Add scan flow MVP and local Axiom skill workspace

This snapshot establishes the camera-to-result recognition flow and related tests while checking in the project skill/docs assets required for the configured local tooling.
2026-04-19 21:11:32 +02:00
parent 577214d474
commit a60a76b797
679 changed files with 138964 additions and 73 deletions
--- a/.claude/skills/axiom-ios-ml/speech/SKILL.md
+++ b/.claude/skills/axiom-ios-ml/speech/SKILL.md
@@ -0,0 +1,496 @@
+---
+name: speech
+description: Use when implementing speech-to-text, live transcription, or audio transcription. Covers SpeechAnalyzer (iOS 26+), SpeechTranscriber, volatile/finalized results, AssetInventory model management, audio format handling.
+license: MIT
+version: 1.0.0
+---
+
+# Speech-to-Text with SpeechAnalyzer
+
+## Overview
+
+SpeechAnalyzer is Apple's new speech-to-text API introduced in iOS 26. It powers Notes, Voice Memos, Journal, and Call Summarization. The on-device model is faster, more accurate, and better for long-form/distant audio than SFSpeechRecognizer.
+
+**Key principle**: SpeechAnalyzer is modular—add transcription modules to an analysis session. Results stream asynchronously using Swift's AsyncSequence.
+
+## Decision Tree - SpeechAnalyzer vs SFSpeechRecognizer
+
+```
+Need speech-to-text?
+  ├─ iOS 26+ only?
+  │   └─ Yes → SpeechAnalyzer (preferred)
+  ├─ Need iOS 10-25 support?
+  │   └─ Yes → SFSpeechRecognizer (or DictationTranscriber)
+  ├─ Long-form audio (meetings, lectures)?
+  │   └─ Yes → SpeechAnalyzer
+  ├─ Distant audio (across room)?
+  │   └─ Yes → SpeechAnalyzer
+  └─ Short dictation commands?
+      └─ Either works
+```
+
+**SpeechAnalyzer advantages**:
+- Better for long-form and conversational audio
+- Works well with distant speakers (meetings)
+- On-device, private
+- Model managed by system (no app size increase)
+- Powers Notes, Voice Memos, Journal
+
+**DictationTranscriber** (iOS 26+): Same languages as SFSpeechRecognizer, but doesn't require user to enable Siri/dictation in Settings.
+
+## Red Flags
+
+Use this skill when you see:
+- "Live transcription"
+- "Transcribe audio"
+- "Speech-to-text"
+- "SpeechAnalyzer" or "SpeechTranscriber"
+- "Volatile results"
+- Building Notes-like or Voice Memos-like features
+
+## Pattern 1 - File Transcription (Simplest)
+
+Transcribe an audio file to text in one function.
+
+```swift
+import Speech
+
+func transcribe(file: URL, locale: Locale) async throws -> AttributedString {
+    // Set up transcriber
+    let transcriber = SpeechTranscriber(
+        locale: locale,
+        preset: .offlineTranscription
+    )
+
+    // Collect results asynchronously
+    async let transcriptionFuture = try transcriber.results
+        .reduce(AttributedString()) { str, result in
+            str + result.text
+        }
+
+    // Set up analyzer with transcriber module
+    let analyzer = SpeechAnalyzer(modules: [transcriber])
+
+    // Analyze the file
+    if let lastSample = try await analyzer.analyzeSequence(from: file) {
+        try await analyzer.finalizeAndFinish(through: lastSample)
+    } else {
+        await analyzer.cancelAndFinishNow()
+    }
+
+    return try await transcriptionFuture
+}
+```
+
+**Key points**:
+- `analyzeSequence(from:)` reads file and feeds audio to analyzer
+- `finalizeAndFinish(through:)` ensures all results are finalized
+- Results are `AttributedString` with timing metadata
+
+## Pattern 2 - Live Transcription Setup
+
+For real-time transcription from microphone.
+
+### Step 1 - Configure SpeechTranscriber
+
+```swift
+import Speech
+
+class TranscriptionManager: ObservableObject {
+    private var transcriber: SpeechTranscriber?
+    private var analyzer: SpeechAnalyzer?
+    private var analyzerFormat: AudioFormatDescription?
+    private var inputBuilder: AsyncStream<AnalyzerInput>.Continuation?
+
+    @Published var finalizedTranscript = AttributedString()
+    @Published var volatileTranscript = AttributedString()
+
+    func setUp() async throws {
+        // Create transcriber with options
+        transcriber = SpeechTranscriber(
+            locale: Locale.current,
+            transcriptionOptions: [],
+            reportingOptions: [.volatileResults],  // Enable real-time updates
+            attributeOptions: [.audioTimeRange]     // Include timing
+        )
+
+        guard let transcriber else { throw TranscriptionError.setupFailed }
+
+        // Create analyzer with transcriber module
+        analyzer = SpeechAnalyzer(modules: [transcriber])
+
+        // Get required audio format
+        analyzerFormat = await SpeechAnalyzer.bestAvailableAudioFormat(
+            compatibleWith: [transcriber]
+        )
+
+        // Ensure model is available
+        try await ensureModel(for: transcriber)
+
+        // Create input stream
+        let (stream, builder) = AsyncStream<AnalyzerInput>.makeStream()
+        inputBuilder = builder
+
+        // Start analyzer
+        try await analyzer?.start(inputSequence: stream)
+    }
+}
+```
+
+### Step 2 - Ensure Model Availability
+
+```swift
+func ensureModel(for transcriber: SpeechTranscriber) async throws {
+    let locale = Locale.current
+
+    // Check if language is supported
+    let supported = await SpeechTranscriber.supportedLocales
+    guard supported.contains(where: {
+        $0.identifier(.bcp47) == locale.identifier(.bcp47)
+    }) else {
+        throw TranscriptionError.localeNotSupported
+    }
+
+    // Check if model is installed
+    let installed = await SpeechTranscriber.installedLocales
+    if installed.contains(where: {
+        $0.identifier(.bcp47) == locale.identifier(.bcp47)
+    }) {
+        return  // Already installed
+    }
+
+    // Download model
+    if let downloader = try await AssetInventory.assetInstallationRequest(
+        supporting: [transcriber]
+    ) {
+        // Track progress if needed
+        let progress = downloader.progress
+        try await downloader.downloadAndInstall()
+    }
+}
+```
+
+**Note**: Models are stored in system storage, not app storage. Limited number of languages can be allocated at once.
+
+### Step 3 - Handle Results
+
+```swift
+func startResultHandling() {
+    Task {
+        guard let transcriber else { return }
+
+        do {
+            for try await result in transcriber.results {
+                let text = result.text
+
+                if result.isFinal {
+                    // Finalized result - won't change
+                    finalizedTranscript += text
+                    volatileTranscript = AttributedString()
+
+                    // Access timing info
+                    for run in text.runs {
+                        if let timeRange = run.audioTimeRange {
+                            print("Time: \(timeRange)")
+                        }
+                    }
+                } else {
+                    // Volatile result - will be replaced
+                    volatileTranscript = text
+                }
+            }
+        } catch {
+            print("Transcription failed: \(error)")
+        }
+    }
+}
+```
+
+## Pattern 3 - Audio Recording and Streaming
+
+Connect AVAudioEngine to SpeechAnalyzer.
+
+```swift
+import AVFoundation
+
+class AudioRecorder {
+    private let audioEngine = AVAudioEngine()
+    private var outputContinuation: AsyncStream<AVAudioPCMBuffer>.Continuation?
+    private let transcriptionManager: TranscriptionManager
+
+    func startRecording() async throws {
+        // Request permission
+        guard await AVAudioApplication.requestRecordPermission() else {
+            throw RecordingError.permissionDenied
+        }
+
+        // Configure audio session (iOS)
+        #if os(iOS)
+        let session = AVAudioSession.sharedInstance()
+        try session.setCategory(.playAndRecord, mode: .spokenAudio)
+        try session.setActive(true, options: .notifyOthersOnDeactivation)
+        #endif
+
+        // Set up transcriber
+        try await transcriptionManager.setUp()
+        transcriptionManager.startResultHandling()
+
+        // Stream audio to transcriber
+        for await buffer in try audioStream() {
+            try await transcriptionManager.streamAudio(buffer)
+        }
+    }
+
+    private func audioStream() throws -> AsyncStream<AVAudioPCMBuffer> {
+        let inputNode = audioEngine.inputNode
+        let format = inputNode.outputFormat(forBus: 0)
+
+        inputNode.installTap(
+            onBus: 0,
+            bufferSize: 4096,
+            format: format
+        ) { [weak self] buffer, time in
+            self?.outputContinuation?.yield(buffer)
+        }
+
+        audioEngine.prepare()
+        try audioEngine.start()
+
+        return AsyncStream { continuation in
+            outputContinuation = continuation
+        }
+    }
+}
+```
+
+### Stream Audio with Format Conversion
+
+```swift
+extension TranscriptionManager {
+    private var converter: AVAudioConverter?
+
+    func streamAudio(_ buffer: AVAudioPCMBuffer) async throws {
+        guard let inputBuilder, let analyzerFormat else {
+            throw TranscriptionError.notSetUp
+        }
+
+        // Convert to analyzer's required format
+        let converted = try convertBuffer(buffer, to: analyzerFormat)
+
+        // Send to analyzer
+        let input = AnalyzerInput(buffer: converted)
+        inputBuilder.yield(input)
+    }
+
+    private func convertBuffer(
+        _ buffer: AVAudioPCMBuffer,
+        to format: AudioFormatDescription
+    ) throws -> AVAudioPCMBuffer {
+        // Lazy initialize converter
+        if converter == nil {
+            let sourceFormat = buffer.format
+            let destFormat = AVAudioFormat(cmAudioFormatDescription: format)!
+            converter = AVAudioConverter(from: sourceFormat, to: destFormat)
+        }
+
+        guard let converter else {
+            throw TranscriptionError.conversionFailed
+        }
+
+        let outputBuffer = AVAudioPCMBuffer(
+            pcmFormat: converter.outputFormat,
+            frameCapacity: buffer.frameLength
+        )!
+
+        try converter.convert(to: outputBuffer, from: buffer)
+        return outputBuffer
+    }
+}
+```
+
+## Pattern 4 - Stopping Transcription
+
+Properly finalize to get remaining volatile results as finalized.
+
+```swift
+func stopRecording() async {
+    // Stop audio
+    audioEngine.stop()
+    audioEngine.inputNode.removeTap(onBus: 0)
+    outputContinuation?.finish()
+
+    // Finalize transcription (converts remaining volatile to final)
+    try? await analyzer?.finalizeAndFinishThroughEndOfInput()
+
+    // Cancel any pending tasks
+    recognizerTask?.cancel()
+}
+```
+
+**Critical**: Always call `finalizeAndFinishThroughEndOfInput()` to ensure volatile results are finalized.
+
+## Pattern 5 - Model Asset Management
+
+### Check Supported Languages
+
+```swift
+// Languages the API supports
+let supported = await SpeechTranscriber.supportedLocales
+
+// Languages currently installed on device
+let installed = await SpeechTranscriber.installedLocales
+```
+
+### Deallocate Languages
+
+Limited number of languages can be allocated. Deallocate unused ones.
+
+```swift
+func deallocateLanguages() async {
+    let allocated = await AssetInventory.allocatedLocales
+    for locale in allocated {
+        await AssetInventory.deallocate(locale: locale)
+    }
+}
+```
+
+## Pattern 6 - Displaying Results with Timing
+
+Highlight text during audio playback using timing metadata.
+
+```swift
+struct TranscriptView: View {
+    let transcript: AttributedString
+    @Binding var playbackTime: CMTime
+
+    var body: some View {
+        Text(highlightedTranscript)
+    }
+
+    var highlightedTranscript: AttributedString {
+        var result = transcript
+
+        for (range, run) in transcript.runs {
+            guard let timeRange = run.audioTimeRange else { continue }
+
+            let isActive = timeRange.containsTime(playbackTime)
+            if isActive {
+                result[range].backgroundColor = .yellow
+            }
+        }
+
+        return result
+    }
+}
+```
+
+## Anti-Patterns
+
+### Don't - Forget to finalize
+
+```swift
+// BAD - volatile results lost
+func stopRecording() {
+    audioEngine.stop()
+    // Missing finalize!
+}
+
+// GOOD - volatile results become finalized
+func stopRecording() async {
+    audioEngine.stop()
+    try? await analyzer?.finalizeAndFinishThroughEndOfInput()
+}
+```
+
+### Don't - Ignore format conversion
+
+```swift
+// BAD - format mismatch may fail silently
+inputBuilder.yield(AnalyzerInput(buffer: rawBuffer))
+
+// GOOD - convert to analyzer's format
+let format = await SpeechAnalyzer.bestAvailableAudioFormat(compatibleWith: [transcriber])
+let converted = try convertBuffer(rawBuffer, to: format)
+inputBuilder.yield(AnalyzerInput(buffer: converted))
+```
+
+### Don't - Skip model availability check
+
+```swift
+// BAD - may crash if model not installed
+let transcriber = SpeechTranscriber(locale: locale, ...)
+// Start using immediately
+
+// GOOD - ensure model is ready
+let transcriber = SpeechTranscriber(locale: locale, ...)
+try await ensureModel(for: transcriber)
+// Now safe to use
+```
+
+## Presets Reference
+
+| Preset | Use Case |
+|--------|----------|
+| `.offlineTranscription` | File transcription, no real-time feedback needed |
+| `.progressiveLiveTranscription` | Live transcription with volatile updates |
+
+## Options Reference
+
+### TranscriptionOptions
+- Default: None (standard transcription)
+
+### ReportingOptions
+- `.volatileResults`: Enable real-time approximate results
+
+### AttributeOptions
+- `.audioTimeRange`: Include CMTimeRange for each text segment
+
+## Platform Availability
+
+| Platform | SpeechTranscriber | DictationTranscriber |
+|----------|-------------------|---------------------|
+| iOS 26+ | Yes | Yes |
+| macOS Tahoe+ | Yes | Yes |
+| watchOS 26+ | No | Yes |
+| tvOS 26+ | TBD | TBD |
+
+**Hardware requirements**: Varies by device. Use `supportedLocales` to check.
+
+## Integration with Apple Intelligence
+
+Combine with Foundation Models for summarization:
+
+```swift
+import FoundationModels
+
+func generateTitle(for transcript: String) async throws -> String {
+    let session = LanguageModelSession()
+    let prompt = "Generate a short, clever title for this story: \(transcript)"
+    let response = try await session.respond(to: prompt)
+    return response.content
+}
+```
+
+See `axiom-ios-ai` skill for Foundation Models details.
+
+## Checklist
+
+Before shipping speech-to-text:
+
+- [ ] Check locale support with `supportedLocales`
+- [ ] Ensure model with `AssetInventory.assetInstallationRequest`
+- [ ] Handle download progress for user feedback
+- [ ] Convert audio to `bestAvailableAudioFormat`
+- [ ] Enable `.volatileResults` for live transcription
+- [ ] Call `finalizeAndFinishThroughEndOfInput()` on stop
+- [ ] Handle timing with `.audioTimeRange` if needed
+- [ ] Clear volatile results when finalized result arrives
+- [ ] Request microphone permission before recording
+
+## Resources
+
+**WWDC**: 2025-277
+
+**Docs**: /speech, /speech/speechanalyzer, /speech/speechtranscriber
+
+**Skills**: coreml (on-device ML), axiom-ios-ai (Foundation Models)