This snapshot establishes the camera-to-result recognition flow and related tests while checking in the project skill/docs assets required for the configured local tooling.
497 lines
14 KiB
Markdown
497 lines
14 KiB
Markdown
---
|
|
name: speech
|
|
description: Use when implementing speech-to-text, live transcription, or audio transcription. Covers SpeechAnalyzer (iOS 26+), SpeechTranscriber, volatile/finalized results, AssetInventory model management, audio format handling.
|
|
license: MIT
|
|
version: 1.0.0
|
|
---
|
|
|
|
# Speech-to-Text with SpeechAnalyzer
|
|
|
|
## Overview
|
|
|
|
SpeechAnalyzer is Apple's new speech-to-text API introduced in iOS 26. It powers Notes, Voice Memos, Journal, and Call Summarization. The on-device model is faster, more accurate, and better for long-form/distant audio than SFSpeechRecognizer.
|
|
|
|
**Key principle**: SpeechAnalyzer is modular—add transcription modules to an analysis session. Results stream asynchronously using Swift's AsyncSequence.
|
|
|
|
## Decision Tree - SpeechAnalyzer vs SFSpeechRecognizer
|
|
|
|
```
|
|
Need speech-to-text?
|
|
├─ iOS 26+ only?
|
|
│ └─ Yes → SpeechAnalyzer (preferred)
|
|
├─ Need iOS 10-25 support?
|
|
│ └─ Yes → SFSpeechRecognizer (or DictationTranscriber)
|
|
├─ Long-form audio (meetings, lectures)?
|
|
│ └─ Yes → SpeechAnalyzer
|
|
├─ Distant audio (across room)?
|
|
│ └─ Yes → SpeechAnalyzer
|
|
└─ Short dictation commands?
|
|
└─ Either works
|
|
```
|
|
|
|
**SpeechAnalyzer advantages**:
|
|
- Better for long-form and conversational audio
|
|
- Works well with distant speakers (meetings)
|
|
- On-device, private
|
|
- Model managed by system (no app size increase)
|
|
- Powers Notes, Voice Memos, Journal
|
|
|
|
**DictationTranscriber** (iOS 26+): Same languages as SFSpeechRecognizer, but doesn't require user to enable Siri/dictation in Settings.
|
|
|
|
## Red Flags
|
|
|
|
Use this skill when you see:
|
|
- "Live transcription"
|
|
- "Transcribe audio"
|
|
- "Speech-to-text"
|
|
- "SpeechAnalyzer" or "SpeechTranscriber"
|
|
- "Volatile results"
|
|
- Building Notes-like or Voice Memos-like features
|
|
|
|
## Pattern 1 - File Transcription (Simplest)
|
|
|
|
Transcribe an audio file to text in one function.
|
|
|
|
```swift
|
|
import Speech
|
|
|
|
func transcribe(file: URL, locale: Locale) async throws -> AttributedString {
|
|
// Set up transcriber
|
|
let transcriber = SpeechTranscriber(
|
|
locale: locale,
|
|
preset: .offlineTranscription
|
|
)
|
|
|
|
// Collect results asynchronously
|
|
async let transcriptionFuture = try transcriber.results
|
|
.reduce(AttributedString()) { str, result in
|
|
str + result.text
|
|
}
|
|
|
|
// Set up analyzer with transcriber module
|
|
let analyzer = SpeechAnalyzer(modules: [transcriber])
|
|
|
|
// Analyze the file
|
|
if let lastSample = try await analyzer.analyzeSequence(from: file) {
|
|
try await analyzer.finalizeAndFinish(through: lastSample)
|
|
} else {
|
|
await analyzer.cancelAndFinishNow()
|
|
}
|
|
|
|
return try await transcriptionFuture
|
|
}
|
|
```
|
|
|
|
**Key points**:
|
|
- `analyzeSequence(from:)` reads file and feeds audio to analyzer
|
|
- `finalizeAndFinish(through:)` ensures all results are finalized
|
|
- Results are `AttributedString` with timing metadata
|
|
|
|
## Pattern 2 - Live Transcription Setup
|
|
|
|
For real-time transcription from microphone.
|
|
|
|
### Step 1 - Configure SpeechTranscriber
|
|
|
|
```swift
|
|
import Speech
|
|
|
|
class TranscriptionManager: ObservableObject {
|
|
private var transcriber: SpeechTranscriber?
|
|
private var analyzer: SpeechAnalyzer?
|
|
private var analyzerFormat: AudioFormatDescription?
|
|
private var inputBuilder: AsyncStream<AnalyzerInput>.Continuation?
|
|
|
|
@Published var finalizedTranscript = AttributedString()
|
|
@Published var volatileTranscript = AttributedString()
|
|
|
|
func setUp() async throws {
|
|
// Create transcriber with options
|
|
transcriber = SpeechTranscriber(
|
|
locale: Locale.current,
|
|
transcriptionOptions: [],
|
|
reportingOptions: [.volatileResults], // Enable real-time updates
|
|
attributeOptions: [.audioTimeRange] // Include timing
|
|
)
|
|
|
|
guard let transcriber else { throw TranscriptionError.setupFailed }
|
|
|
|
// Create analyzer with transcriber module
|
|
analyzer = SpeechAnalyzer(modules: [transcriber])
|
|
|
|
// Get required audio format
|
|
analyzerFormat = await SpeechAnalyzer.bestAvailableAudioFormat(
|
|
compatibleWith: [transcriber]
|
|
)
|
|
|
|
// Ensure model is available
|
|
try await ensureModel(for: transcriber)
|
|
|
|
// Create input stream
|
|
let (stream, builder) = AsyncStream<AnalyzerInput>.makeStream()
|
|
inputBuilder = builder
|
|
|
|
// Start analyzer
|
|
try await analyzer?.start(inputSequence: stream)
|
|
}
|
|
}
|
|
```
|
|
|
|
### Step 2 - Ensure Model Availability
|
|
|
|
```swift
|
|
func ensureModel(for transcriber: SpeechTranscriber) async throws {
|
|
let locale = Locale.current
|
|
|
|
// Check if language is supported
|
|
let supported = await SpeechTranscriber.supportedLocales
|
|
guard supported.contains(where: {
|
|
$0.identifier(.bcp47) == locale.identifier(.bcp47)
|
|
}) else {
|
|
throw TranscriptionError.localeNotSupported
|
|
}
|
|
|
|
// Check if model is installed
|
|
let installed = await SpeechTranscriber.installedLocales
|
|
if installed.contains(where: {
|
|
$0.identifier(.bcp47) == locale.identifier(.bcp47)
|
|
}) {
|
|
return // Already installed
|
|
}
|
|
|
|
// Download model
|
|
if let downloader = try await AssetInventory.assetInstallationRequest(
|
|
supporting: [transcriber]
|
|
) {
|
|
// Track progress if needed
|
|
let progress = downloader.progress
|
|
try await downloader.downloadAndInstall()
|
|
}
|
|
}
|
|
```
|
|
|
|
**Note**: Models are stored in system storage, not app storage. Limited number of languages can be allocated at once.
|
|
|
|
### Step 3 - Handle Results
|
|
|
|
```swift
|
|
func startResultHandling() {
|
|
Task {
|
|
guard let transcriber else { return }
|
|
|
|
do {
|
|
for try await result in transcriber.results {
|
|
let text = result.text
|
|
|
|
if result.isFinal {
|
|
// Finalized result - won't change
|
|
finalizedTranscript += text
|
|
volatileTranscript = AttributedString()
|
|
|
|
// Access timing info
|
|
for run in text.runs {
|
|
if let timeRange = run.audioTimeRange {
|
|
print("Time: \(timeRange)")
|
|
}
|
|
}
|
|
} else {
|
|
// Volatile result - will be replaced
|
|
volatileTranscript = text
|
|
}
|
|
}
|
|
} catch {
|
|
print("Transcription failed: \(error)")
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Pattern 3 - Audio Recording and Streaming
|
|
|
|
Connect AVAudioEngine to SpeechAnalyzer.
|
|
|
|
```swift
|
|
import AVFoundation
|
|
|
|
class AudioRecorder {
|
|
private let audioEngine = AVAudioEngine()
|
|
private var outputContinuation: AsyncStream<AVAudioPCMBuffer>.Continuation?
|
|
private let transcriptionManager: TranscriptionManager
|
|
|
|
func startRecording() async throws {
|
|
// Request permission
|
|
guard await AVAudioApplication.requestRecordPermission() else {
|
|
throw RecordingError.permissionDenied
|
|
}
|
|
|
|
// Configure audio session (iOS)
|
|
#if os(iOS)
|
|
let session = AVAudioSession.sharedInstance()
|
|
try session.setCategory(.playAndRecord, mode: .spokenAudio)
|
|
try session.setActive(true, options: .notifyOthersOnDeactivation)
|
|
#endif
|
|
|
|
// Set up transcriber
|
|
try await transcriptionManager.setUp()
|
|
transcriptionManager.startResultHandling()
|
|
|
|
// Stream audio to transcriber
|
|
for await buffer in try audioStream() {
|
|
try await transcriptionManager.streamAudio(buffer)
|
|
}
|
|
}
|
|
|
|
private func audioStream() throws -> AsyncStream<AVAudioPCMBuffer> {
|
|
let inputNode = audioEngine.inputNode
|
|
let format = inputNode.outputFormat(forBus: 0)
|
|
|
|
inputNode.installTap(
|
|
onBus: 0,
|
|
bufferSize: 4096,
|
|
format: format
|
|
) { [weak self] buffer, time in
|
|
self?.outputContinuation?.yield(buffer)
|
|
}
|
|
|
|
audioEngine.prepare()
|
|
try audioEngine.start()
|
|
|
|
return AsyncStream { continuation in
|
|
outputContinuation = continuation
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Stream Audio with Format Conversion
|
|
|
|
```swift
|
|
extension TranscriptionManager {
|
|
private var converter: AVAudioConverter?
|
|
|
|
func streamAudio(_ buffer: AVAudioPCMBuffer) async throws {
|
|
guard let inputBuilder, let analyzerFormat else {
|
|
throw TranscriptionError.notSetUp
|
|
}
|
|
|
|
// Convert to analyzer's required format
|
|
let converted = try convertBuffer(buffer, to: analyzerFormat)
|
|
|
|
// Send to analyzer
|
|
let input = AnalyzerInput(buffer: converted)
|
|
inputBuilder.yield(input)
|
|
}
|
|
|
|
private func convertBuffer(
|
|
_ buffer: AVAudioPCMBuffer,
|
|
to format: AudioFormatDescription
|
|
) throws -> AVAudioPCMBuffer {
|
|
// Lazy initialize converter
|
|
if converter == nil {
|
|
let sourceFormat = buffer.format
|
|
let destFormat = AVAudioFormat(cmAudioFormatDescription: format)!
|
|
converter = AVAudioConverter(from: sourceFormat, to: destFormat)
|
|
}
|
|
|
|
guard let converter else {
|
|
throw TranscriptionError.conversionFailed
|
|
}
|
|
|
|
let outputBuffer = AVAudioPCMBuffer(
|
|
pcmFormat: converter.outputFormat,
|
|
frameCapacity: buffer.frameLength
|
|
)!
|
|
|
|
try converter.convert(to: outputBuffer, from: buffer)
|
|
return outputBuffer
|
|
}
|
|
}
|
|
```
|
|
|
|
## Pattern 4 - Stopping Transcription
|
|
|
|
Properly finalize to get remaining volatile results as finalized.
|
|
|
|
```swift
|
|
func stopRecording() async {
|
|
// Stop audio
|
|
audioEngine.stop()
|
|
audioEngine.inputNode.removeTap(onBus: 0)
|
|
outputContinuation?.finish()
|
|
|
|
// Finalize transcription (converts remaining volatile to final)
|
|
try? await analyzer?.finalizeAndFinishThroughEndOfInput()
|
|
|
|
// Cancel any pending tasks
|
|
recognizerTask?.cancel()
|
|
}
|
|
```
|
|
|
|
**Critical**: Always call `finalizeAndFinishThroughEndOfInput()` to ensure volatile results are finalized.
|
|
|
|
## Pattern 5 - Model Asset Management
|
|
|
|
### Check Supported Languages
|
|
|
|
```swift
|
|
// Languages the API supports
|
|
let supported = await SpeechTranscriber.supportedLocales
|
|
|
|
// Languages currently installed on device
|
|
let installed = await SpeechTranscriber.installedLocales
|
|
```
|
|
|
|
### Deallocate Languages
|
|
|
|
Limited number of languages can be allocated. Deallocate unused ones.
|
|
|
|
```swift
|
|
func deallocateLanguages() async {
|
|
let allocated = await AssetInventory.allocatedLocales
|
|
for locale in allocated {
|
|
await AssetInventory.deallocate(locale: locale)
|
|
}
|
|
}
|
|
```
|
|
|
|
## Pattern 6 - Displaying Results with Timing
|
|
|
|
Highlight text during audio playback using timing metadata.
|
|
|
|
```swift
|
|
struct TranscriptView: View {
|
|
let transcript: AttributedString
|
|
@Binding var playbackTime: CMTime
|
|
|
|
var body: some View {
|
|
Text(highlightedTranscript)
|
|
}
|
|
|
|
var highlightedTranscript: AttributedString {
|
|
var result = transcript
|
|
|
|
for (range, run) in transcript.runs {
|
|
guard let timeRange = run.audioTimeRange else { continue }
|
|
|
|
let isActive = timeRange.containsTime(playbackTime)
|
|
if isActive {
|
|
result[range].backgroundColor = .yellow
|
|
}
|
|
}
|
|
|
|
return result
|
|
}
|
|
}
|
|
```
|
|
|
|
## Anti-Patterns
|
|
|
|
### Don't - Forget to finalize
|
|
|
|
```swift
|
|
// BAD - volatile results lost
|
|
func stopRecording() {
|
|
audioEngine.stop()
|
|
// Missing finalize!
|
|
}
|
|
|
|
// GOOD - volatile results become finalized
|
|
func stopRecording() async {
|
|
audioEngine.stop()
|
|
try? await analyzer?.finalizeAndFinishThroughEndOfInput()
|
|
}
|
|
```
|
|
|
|
### Don't - Ignore format conversion
|
|
|
|
```swift
|
|
// BAD - format mismatch may fail silently
|
|
inputBuilder.yield(AnalyzerInput(buffer: rawBuffer))
|
|
|
|
// GOOD - convert to analyzer's format
|
|
let format = await SpeechAnalyzer.bestAvailableAudioFormat(compatibleWith: [transcriber])
|
|
let converted = try convertBuffer(rawBuffer, to: format)
|
|
inputBuilder.yield(AnalyzerInput(buffer: converted))
|
|
```
|
|
|
|
### Don't - Skip model availability check
|
|
|
|
```swift
|
|
// BAD - may crash if model not installed
|
|
let transcriber = SpeechTranscriber(locale: locale, ...)
|
|
// Start using immediately
|
|
|
|
// GOOD - ensure model is ready
|
|
let transcriber = SpeechTranscriber(locale: locale, ...)
|
|
try await ensureModel(for: transcriber)
|
|
// Now safe to use
|
|
```
|
|
|
|
## Presets Reference
|
|
|
|
| Preset | Use Case |
|
|
|--------|----------|
|
|
| `.offlineTranscription` | File transcription, no real-time feedback needed |
|
|
| `.progressiveLiveTranscription` | Live transcription with volatile updates |
|
|
|
|
## Options Reference
|
|
|
|
### TranscriptionOptions
|
|
- Default: None (standard transcription)
|
|
|
|
### ReportingOptions
|
|
- `.volatileResults`: Enable real-time approximate results
|
|
|
|
### AttributeOptions
|
|
- `.audioTimeRange`: Include CMTimeRange for each text segment
|
|
|
|
## Platform Availability
|
|
|
|
| Platform | SpeechTranscriber | DictationTranscriber |
|
|
|----------|-------------------|---------------------|
|
|
| iOS 26+ | Yes | Yes |
|
|
| macOS Tahoe+ | Yes | Yes |
|
|
| watchOS 26+ | No | Yes |
|
|
| tvOS 26+ | TBD | TBD |
|
|
|
|
**Hardware requirements**: Varies by device. Use `supportedLocales` to check.
|
|
|
|
## Integration with Apple Intelligence
|
|
|
|
Combine with Foundation Models for summarization:
|
|
|
|
```swift
|
|
import FoundationModels
|
|
|
|
func generateTitle(for transcript: String) async throws -> String {
|
|
let session = LanguageModelSession()
|
|
let prompt = "Generate a short, clever title for this story: \(transcript)"
|
|
let response = try await session.respond(to: prompt)
|
|
return response.content
|
|
}
|
|
```
|
|
|
|
See `axiom-ios-ai` skill for Foundation Models details.
|
|
|
|
## Checklist
|
|
|
|
Before shipping speech-to-text:
|
|
|
|
- [ ] Check locale support with `supportedLocales`
|
|
- [ ] Ensure model with `AssetInventory.assetInstallationRequest`
|
|
- [ ] Handle download progress for user feedback
|
|
- [ ] Convert audio to `bestAvailableAudioFormat`
|
|
- [ ] Enable `.volatileResults` for live transcription
|
|
- [ ] Call `finalizeAndFinishThroughEndOfInput()` on stop
|
|
- [ ] Handle timing with `.audioTimeRange` if needed
|
|
- [ ] Clear volatile results when finalized result arrives
|
|
- [ ] Request microphone permission before recording
|
|
|
|
## Resources
|
|
|
|
**WWDC**: 2025-277
|
|
|
|
**Docs**: /speech, /speech/speechanalyzer, /speech/speechtranscriber
|
|
|
|
**Skills**: coreml (on-device ML), axiom-ios-ai (Foundation Models)
|