Add scan flow MVP and local Axiom skill workspace

This snapshot establishes the camera-to-result recognition flow and related tests while checking in the project skill/docs assets required for the configured local tooling.
This commit is contained in:
Matthias
2026-04-19 21:11:32 +02:00
parent 577214d474
commit a60a76b797
679 changed files with 138964 additions and 73 deletions

View File

@@ -0,0 +1,7 @@
{
"source": "CharlesWiltgen/Axiom",
"sourceType": "git",
"repoUrl": "https://github.com/CharlesWiltgen/Axiom",
"subpath": ".claude-plugin/plugins/axiom/skills/axiom-ios-ml",
"installedAt": "2026-04-12T08:05:35.624Z"
}

View File

@@ -0,0 +1,136 @@
---
name: axiom-ios-ml
description: Use when deploying ANY machine learning model on-device, converting models to CoreML, compressing models, or implementing speech-to-text. Covers CoreML conversion, MLTensor, model compression (quantization/palettization/pruning), stateful models, KV-cache, multi-function models, async prediction, SpeechAnalyzer, SpeechTranscriber.
license: MIT
---
# iOS Machine Learning Router
**You MUST use this skill for ANY on-device machine learning or speech-to-text work.**
## When to Use
Use this router when:
- Converting PyTorch/TensorFlow models to CoreML
- Deploying ML models on-device
- Compressing models (quantization, palettization, pruning)
- Working with large language models (LLMs)
- Implementing KV-cache for transformers
- Using MLTensor for model stitching
- Building speech-to-text features
- Transcribing audio (live or recorded)
## Boundary with ios-ai
**ios-ml vs ios-ai — know the difference:**
| Developer Intent | Router |
|-----------------|--------|
| "Use Apple Intelligence / Foundation Models" | **ios-ai** — Apple's on-device LLM |
| "Run my own ML model on device" | **ios-ml** — CoreML conversion + deployment |
| "Add text generation with @Generable" | **ios-ai** — Foundation Models structured output |
| "Deploy a custom LLM with KV-cache" | **ios-ml** — Custom model optimization |
| "Use Vision framework for image analysis" | **ios-vision** — Not ML deployment |
| "Use pre-trained Apple NLP models" | **ios-ai** — Apple's models, not custom |
**Rule of thumb**: If the developer is converting/compressing/deploying their own model → ios-ml. If they're using Apple's built-in AI → ios-ai. If they're doing computer vision → ios-vision.
## Routing Logic
### CoreML Work
**Implementation patterns**`/skill coreml`
- Model conversion workflow
- MLTensor for model stitching
- Stateful models with KV-cache
- Multi-function models (adapters/LoRA)
- Async prediction patterns
- Compute unit selection
**API reference**`/skill coreml-ref`
- CoreML Tools Python API
- MLModel lifecycle
- MLTensor operations
- MLComputeDevice availability
- State management APIs
- Performance reports
**Diagnostics**`/skill coreml-diag`
- Model won't load
- Slow inference
- Memory issues
- Compression accuracy loss
- Compute unit problems
### Speech Work
**Implementation patterns**`/skill speech`
- SpeechAnalyzer setup (iOS 26+)
- SpeechTranscriber configuration
- Live transcription
- File transcription
- Volatile vs finalized results
- Model asset management
## Decision Tree
1. Implementing / converting ML models? → coreml
2. CoreML API reference? → coreml-ref
3. Debugging ML issues (load, inference, compression)? → coreml-diag
4. Speech-to-text / transcription? → speech
## Anti-Rationalization
| Thought | Reality |
|---------|---------|
| "CoreML is just load and predict" | CoreML has compression, stateful models, compute unit selection, and async prediction. coreml covers all. |
| "My model is small, no optimization needed" | Even small models benefit from compute unit selection and async prediction. coreml has the patterns. |
| "I'll just use SFSpeechRecognizer" | iOS 26 has SpeechAnalyzer with better accuracy and offline support. speech skill covers the modern API. |
## Critical Patterns
**coreml**:
- Model conversion (PyTorch → CoreML)
- Compression (palettization, quantization, pruning)
- Stateful KV-cache for LLMs
- Multi-function models for adapters
- MLTensor for pipeline stitching
- Async concurrent prediction
**coreml-diag**:
- Load failures and caching
- Inference performance issues
- Memory pressure from models
- Accuracy degradation from compression
**speech**:
- SpeechAnalyzer + SpeechTranscriber setup
- AssetInventory model management
- Live transcription with volatile results
- Audio format conversion
## Example Invocations
User: "How do I convert a PyTorch model to CoreML?"
→ Invoke: `/skill coreml`
User: "Compress my model to fit on iPhone"
→ Invoke: `/skill coreml`
User: "Implement KV-cache for my language model"
→ Invoke: `/skill coreml`
User: "Model loads slowly on first launch"
→ Invoke: `/skill coreml-diag`
User: "My compressed model has bad accuracy"
→ Invoke: `/skill coreml-diag`
User: "Add live transcription to my app"
→ Invoke: `/skill speech`
User: "Transcribe audio files with SpeechAnalyzer"
→ Invoke: `/skill speech`
User: "What's MLTensor and how do I use it?"
→ Invoke: `/skill coreml-ref`

View File

@@ -0,0 +1,473 @@
---
name: coreml-diag
description: CoreML diagnostics - model load failures, slow inference, memory issues, compression accuracy loss, compute unit problems, conversion errors.
license: MIT
version: 1.0.0
---
# CoreML Diagnostics
## Quick Reference
| Symptom | First Check | Pattern |
|---------|-------------|---------|
| Model won't load | Deployment target | 1a-1c |
| Slow first load | Cache miss | 2a |
| Slow inference | Compute units | 2b-2c |
| High memory | Concurrent predictions | 3a-3b |
| Bad accuracy after compression | Granularity | 4a-4c |
| Conversion fails | Operation support | 5a-5b |
## Decision Tree
```
CoreML issue
├─ Load failure?
│ ├─ "Unsupported model version" → 1a
│ ├─ "Failed to create compute plan" → 1b
│ └─ Other load error → 1c
├─ Performance issue?
│ ├─ First load slow, subsequent fast? → 2a
│ ├─ All predictions slow? → 2b
│ └─ Slow only on specific device? → 2c
├─ Memory issue?
│ ├─ Memory grows during predictions? → 3a
│ └─ Out of memory on load? → 3b
├─ Accuracy degraded?
│ ├─ After palettization? → 4a
│ ├─ After quantization? → 4b
│ └─ After pruning? → 4c
└─ Conversion issue?
├─ Operation not supported? → 5a
└─ Wrong output? → 5b
```
---
## Pattern 1a - "Unsupported model version"
**Symptom**: Model fails to load with version error.
**Cause**: Model compiled for newer OS than device supports.
**Diagnosis**:
```python
# Check model's minimum deployment target
import coremltools as ct
model = ct.models.MLModel("Model.mlpackage")
print(model.get_spec().specificationVersion)
```
| Spec Version | Minimum iOS |
|--------------|-------------|
| 4 | iOS 13 |
| 5 | iOS 14 |
| 6 | iOS 15 |
| 7 | iOS 16 |
| 8 | iOS 17 |
| 9 | iOS 18 |
**Fix**: Re-convert with lower deployment target:
```python
mlmodel = ct.convert(
traced,
minimum_deployment_target=ct.target.iOS16 # Lower target
)
```
**Tradeoff**: Loses newer optimizations (SDPA fusion, per-block quantization, MLTensor).
---
## Pattern 1b - "Failed to create compute plan"
**Symptom**: Model loads on some devices but not others.
**Cause**: Unsupported operations for target compute unit.
**Diagnosis**:
1. Open model in Xcode
2. Create Performance Report
3. Check "Unsupported" operations
4. Hover for hints
**Fix**:
```swift
// Force CPU-only to bypass unsupported GPU/NE operations
let config = MLModelConfiguration()
config.computeUnits = .cpuOnly
let model = try MLModel(contentsOf: url, configuration: config)
```
**Better fix**: Update model precision or operations during conversion:
```python
# Float16 often better supported
mlmodel = ct.convert(traced, compute_precision=ct.precision.FLOAT16)
```
---
## Pattern 1c - General Load Failures
**Symptom**: Model fails to load with unclear error.
**Checklist**:
1. Check file exists and is readable
2. Check compiled vs source model (runtime needs `.mlmodelc`)
3. Check available disk space (cache needs room)
4. Check model isn't corrupted (re-convert)
```swift
// Debug logging
let config = MLModelConfiguration()
config.parameters = [.reporter: { print($0) }] // iOS 17+
```
---
## Pattern 2a - Slow First Load (Cache Miss)
**Symptom**: First prediction after install/update is slow, subsequent are fast.
**Cause**: Device specialization not cached.
**Diagnosis**:
1. Profile with Core ML Instrument
2. Look at Load event label:
- "prepare and cache" = cache miss (slow)
- "cached" = cache hit (fast)
**Why cache misses**:
- First launch after install
- System update invalidated cache
- Low disk space cleared cache
- Model file was modified
**Mitigation**:
```swift
// Warm cache in background at app launch
Task.detached(priority: .background) {
_ = try? await MLModel.load(contentsOf: modelURL)
}
```
**Note**: Cache is tied to (model path + configuration + device). Different configs = different cache entries.
---
## Pattern 2b - All Predictions Slow
**Symptom**: Predictions consistently slow, not just first one.
**Diagnosis**:
1. Create Xcode Performance Report
2. Check compute unit distribution
3. Look for high-cost operations
**Common causes**:
| Cause | Fix |
|-------|-----|
| Running on CPU when GPU/NE available | Check `computeUnits` config |
| Model too large for Neural Engine | Compress model |
| Frequent CPU↔GPU↔NE transfers | Adjust segmentation |
| Dynamic shapes recompiling | Use fixed/enumerated shapes |
**Profile compute unit usage**:
```swift
let plan = try await MLComputePlan.load(contentsOf: modelURL)
for op in plan.modelStructure.operations {
let info = plan.computeDeviceInfo(for: op)
print("\(op.name): \(info.preferredDevice)")
}
```
---
## Pattern 2c - Slow on Specific Device
**Symptom**: Fast on Mac, slow on iPhone (or vice versa).
**Cause**: Different hardware characteristics.
**Diagnosis**:
```swift
// Check available compute
let devices = MLModel.availableComputeDevices
print(devices) // Different per device
```
**Common issues**:
| Scenario | Cause | Fix |
|----------|-------|-----|
| Fast on M-series Mac, slow on iPhone | Model optimized for GPU | Use palettization (Neural Engine) |
| Fast on iPhone, slow on Intel Mac | No Neural Engine | Use quantization (GPU) |
| Slow on older devices | Less compute power | Use more aggressive compression |
**Recommendation**: Profile on target devices, not just development Mac.
---
## Pattern 3a - Memory Grows During Predictions
**Symptom**: Memory increases with each prediction, doesn't release.
**Cause**: Input/output buffers accumulating from concurrent predictions.
**Diagnosis**:
```
Instruments → Allocations + Core ML template
Look for: Many concurrent prediction intervals
Check: MLMultiArray allocations growing
```
**Fix**: Limit concurrent predictions:
```swift
actor PredictionLimiter {
private let maxConcurrent = 2
private var inFlight = 0
func predict(_ model: MLModel, input: MLFeatureProvider) async throws -> MLFeatureProvider {
while inFlight >= maxConcurrent {
await Task.yield()
}
inFlight += 1
defer { inFlight -= 1 }
return try await model.prediction(from: input)
}
}
```
---
## Pattern 3b - Out of Memory on Load
**Symptom**: App crashes or model fails to load on memory-constrained devices.
**Cause**: Model too large for device memory.
**Diagnosis**:
```bash
# Check model size
ls -lh Model.mlpackage/Data/com.apple.CoreML/weights/
```
**Fix options**:
| Approach | Compression | Memory Impact |
|----------|-------------|---------------|
| 8-bit palettization | 2x smaller | 2x less memory |
| 4-bit palettization | 4x smaller | 4x less memory |
| Pruning (50%) | ~2x smaller | ~2x less memory |
**Note**: Compressed weights are decompressed just-in-time (iOS 17+), so smaller on-disk = smaller in memory.
---
## Pattern 4a - Bad Accuracy After Palettization
**Symptom**: Model output degraded after palettization.
**Diagnosis**:
1. What bit depth? (2-bit most likely to fail)
2. What granularity? (per-tensor loses more than per-grouped-channel)
**Fix progression**:
```python
# Step 1: Try grouped channels (iOS 18+)
config = OpPalettizerConfig(
nbits=4,
granularity="per_grouped_channel",
group_size=16
)
# Step 2: If still bad, try more bits
config = OpPalettizerConfig(nbits=6, ...)
# Step 3: If still need 4-bit, use calibration
from coremltools.optimize.torch.palettization import DKMPalettizer
# ... training-time compression
```
**Key insight**: 4-bit per-tensor has only 16 clusters for entire weight matrix. Grouped channels = 16 clusters per 16 channels = much better granularity.
---
## Pattern 4b - Bad Accuracy After Quantization
**Symptom**: Model output degraded after INT8/INT4 quantization.
**Diagnosis**:
1. What bit depth?
2. What granularity?
**Fix progression**:
```python
# Step 1: Use per-block (iOS 18+)
config = OpLinearQuantizerConfig(
dtype="int4",
granularity="per_block",
block_size=32
)
# Step 2: Use calibration data
from coremltools.optimize.torch.quantization import LayerwiseCompressor
compressor = LayerwiseCompressor(model, config)
quantized = compressor.compress(calibration_loader)
```
**Note**: INT4 quantization works best on Mac GPU. For Neural Engine, prefer palettization.
---
## Pattern 4c - Bad Accuracy After Pruning
**Symptom**: Model output degraded after weight pruning.
**Diagnosis**:
1. What sparsity level?
2. Post-training or training-time?
**Thresholds** (model-dependent):
- 0-30% sparsity: Usually safe
- 30-50% sparsity: May need calibration
- 50%+ sparsity: Usually needs training-time
**Fix**:
```python
# Use calibration-based pruning
from coremltools.optimize.torch.pruning import LayerwiseCompressor
config = MagnitudePrunerConfig(
target_sparsity=0.4,
n_samples=128
)
compressor = LayerwiseCompressor(model, config)
sparse = compressor.compress(calibration_loader)
```
---
## Pattern 5a - Operation Not Supported
**Symptom**: Conversion fails with unsupported operation error.
**Diagnosis**:
```
Error: "Op 'custom_op' is not supported for conversion"
```
**Options**:
1. **Check if op is in coremltools**: May need newer version
```bash
pip install --upgrade coremltools
```
2. **Use composite ops**: Split into supported primitives
```python
# Instead of custom_op(x)
# Use: supported_op1(supported_op2(x))
```
3. **Register custom op**: Advanced, requires MIL programming
```python
from coremltools.converters.mil import Builder as mb
@mb.register_torch_op
def custom_op(context, node):
# Map to MIL operations
...
```
---
## Pattern 5b - Conversion Succeeds but Wrong Output
**Symptom**: Model converts but predictions differ from PyTorch.
**Diagnosis checklist**:
1. **Input normalization**: Ensure preprocessing matches
```python
# PyTorch often uses ImageNet normalization
# CoreML may need explicit preprocessing
```
2. **Shape ordering**: PyTorch (NCHW) vs CoreML (NHWC for some ops)
```python
# Check shapes in conversion
ct.convert(..., inputs=[ct.ImageType(shape=(1, 3, 224, 224))])
```
3. **Precision differences**: Float16 may differ from Float32
```python
# Force Float32 to match PyTorch
ct.convert(..., compute_precision=ct.precision.FLOAT32)
```
4. **Random ops**: Dropout, random initialization differ
```python
# Ensure eval mode
model.eval()
```
**Debug**:
```python
# Compare outputs layer by layer
import numpy as np
torch_output = model(input).detach().numpy()
coreml_output = mlmodel.predict({"input": input.numpy()})["output"]
print(f"Max diff: {np.max(np.abs(torch_output - coreml_output))}")
```
---
## Pressure Scenario - "Model works on simulator but not device"
**Wrong approach**: Assume simulator bug, ignore.
**Right approach**:
1. Check model spec version vs device iOS version (Pattern 1a)
2. Check compute unit availability (Pattern 2c)
3. Profile on actual device, not simulator
4. Simulator uses host Mac's GPU/CPU, not device Neural Engine
---
## Pressure Scenario - "Ship now, optimize later"
**Wrong approach**: Compress to smallest possible size without testing.
**Right approach**:
1. Ship Float16 baseline first
2. Profile on target devices
3. Apply compression incrementally with accuracy testing
4. Document compression settings for future optimization
---
## Diagnostic Checklist
When CoreML isn't working:
- [ ] Check deployment target matches device iOS
- [ ] Check model file is compiled (.mlmodelc)
- [ ] Profile load: cached vs uncached
- [ ] Profile prediction: which compute units
- [ ] Check memory: concurrent predictions limited
- [ ] For compression issues: try higher granularity
- [ ] For conversion issues: check op support, precision
## Resources
**WWDC**: 2023-10047, 2023-10049, 2024-10159, 2024-10161
**Docs**: /coreml, /coreml/mlmodel
**Skills**: coreml, coreml-ref

View File

@@ -0,0 +1,467 @@
---
name: coreml-ref
description: CoreML API reference - MLModel lifecycle, MLTensor operations, coremltools conversion, compression APIs, state management, compute device availability, performance profiling.
license: MIT
version: 1.0.0
---
# CoreML API Reference
## Part 1 - Model Lifecycle
### MLModel Loading
```swift
// Synchronous load (blocks thread)
let model = try MLModel(contentsOf: compiledModelURL)
// Async load (preferred)
let model = try await MLModel.load(contentsOf: compiledModelURL)
// With configuration
let config = MLModelConfiguration()
config.computeUnits = .all // .cpuOnly, .cpuAndGPU, .cpuAndNeuralEngine
let model = try await MLModel.load(contentsOf: url, configuration: config)
```
### Model Asset Types
| Type | Extension | Purpose |
|------|-----------|---------|
| Source | `.mlmodel`, `.mlpackage` | Development, editing |
| Compiled | `.mlmodelc` | Runtime execution |
**Note**: Xcode compiles source models automatically. At runtime, use compiled models.
### Caching Behavior
First load triggers device specialization (can be slow). Subsequent loads use cache.
```
Load flow:
├─ Check cache for (model path + configuration + device)
│ ├─ Found → Cached load (fast)
│ └─ Not found → Device specialization
│ ├─ Parse model
│ ├─ Optimize operations
│ ├─ Segment for compute units
│ ├─ Compile for each unit
│ └─ Cache result
```
Cache invalidated by: system updates, low disk space, model modification.
### Multi-Function Models
```swift
// Load specific function
let config = MLModelConfiguration()
config.functionName = "sticker" // Function name from model
let model = try MLModel(contentsOf: url, configuration: config)
```
---
## Part 2 - Compute Availability
### MLComputeDevice (iOS 17+)
```swift
// Check available compute devices
let devices = MLModel.availableComputeDevices
// Check for Neural Engine
let hasNeuralEngine = devices.contains { device in
if case .neuralEngine = device { return true }
return false
}
// Check for specific GPU
for device in devices {
switch device {
case .cpu:
print("CPU available")
case .gpu(let gpu):
print("GPU: \(gpu.name)")
case .neuralEngine(let ne):
print("Neural Engine: \(ne.totalCoreCount) cores")
@unknown default:
break
}
}
```
### MLModelConfiguration.ComputeUnits
| Value | Behavior |
|-------|----------|
| `.all` | Best performance (default) |
| `.cpuOnly` | CPU only |
| `.cpuAndGPU` | Exclude Neural Engine |
| `.cpuAndNeuralEngine` | Exclude GPU |
---
## Part 3 - Prediction APIs
### Synchronous Prediction
```swift
// Single prediction (NOT thread-safe)
let output = try model.prediction(from: input)
// Batch prediction
let outputs = try model.predictions(from: batch)
```
### Async Prediction (iOS 17+)
```swift
// Single prediction (thread-safe, supports concurrency)
let output = try await model.prediction(from: input)
// With cancellation
let output = try await withTaskCancellationHandler {
try await model.prediction(from: input)
} onCancel: {
// Prediction will be cancelled
}
```
### State-Based Prediction
```swift
// Create state from model
let state = model.makeState()
// Prediction with state (state updated in-place)
let output = try model.prediction(from: input, using: state)
// Async with state
let output = try await model.prediction(from: input, using: state)
```
---
## Part 4 - MLTensor (iOS 18+)
### Creating Tensors
```swift
import CoreML
// From MLShapedArray
let shapedArray = MLShapedArray<Float>(scalars: [1, 2, 3, 4], shape: [2, 2])
let tensor = MLTensor(shapedArray)
// From nested collections
let tensor = MLTensor([[1.0, 2.0], [3.0, 4.0]])
// Zeros/ones
let zeros = MLTensor(zeros: [3, 3], scalarType: Float.self)
```
### Math Operations
```swift
// Element-wise
let sum = tensor1 + tensor2
let product = tensor1 * tensor2
let scaled = tensor * 2.0
// Reductions
let mean = tensor.mean()
let sum = tensor.sum()
let max = tensor.max()
// Comparison
let mask = tensor .> mean // Boolean mask
// Softmax
let probs = tensor.softmax()
```
### Indexing and Reshaping
```swift
// Slicing (Python-like syntax)
let row = tensor[0] // First row
let col = tensor[.all, 0] // First column
let slice = tensor[0..<2, 1..<3]
// Reshaping
let reshaped = tensor.reshaped(to: [4])
let expanded = tensor.expandingShape(at: 0)
```
### Materialization
**Critical**: Tensor operations are async. Must materialize to access data.
```swift
// Materialize to MLShapedArray (blocks until complete)
let array = await tensor.shapedArray(of: Float.self)
// Access scalars
let values = array.scalars
```
---
## Part 5 - Core ML Tools (Python)
### Basic Conversion
```python
import coremltools as ct
import torch
# Trace PyTorch model
model.eval()
traced = torch.jit.trace(model, example_input)
# Convert
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(shape=example_input.shape)],
outputs=[ct.TensorType(name="output")],
minimum_deployment_target=ct.target.iOS18
)
mlmodel.save("Model.mlpackage")
```
### Dynamic Shapes
```python
# Fixed shape
ct.TensorType(shape=(1, 3, 224, 224))
# Range dimension
ct.TensorType(shape=(1, ct.RangeDim(1, 2048)))
# Enumerated shapes
ct.TensorType(shape=ct.EnumeratedShapes(shapes=[(1, 256), (1, 512), (1, 1024)]))
```
### State Types
```python
# For stateful models (KV-cache)
states = [
ct.StateType(
name="keyCache",
wrapped_type=ct.TensorType(shape=(1, 32, 2048, 128))
),
ct.StateType(
name="valueCache",
wrapped_type=ct.TensorType(shape=(1, 32, 2048, 128))
)
]
mlmodel = ct.convert(traced, inputs=inputs, states=states, ...)
```
---
## Part 6 - Compression APIs (coremltools.optimize)
### Post-Training Palettization
```python
from coremltools.optimize.coreml import (
OpPalettizerConfig,
OptimizationConfig,
palettize_weights
)
# Per-tensor (iOS 17+)
config = OpPalettizerConfig(mode="kmeans", nbits=4)
# Per-grouped-channel (iOS 18+, better accuracy)
config = OpPalettizerConfig(
mode="kmeans",
nbits=4,
granularity="per_grouped_channel",
group_size=16
)
opt_config = OptimizationConfig(global_config=config)
compressed = palettize_weights(model, opt_config)
```
### Post-Training Quantization
```python
from coremltools.optimize.coreml import (
OpLinearQuantizerConfig,
OptimizationConfig,
linear_quantize_weights
)
# INT8 per-channel (iOS 17+)
config = OpLinearQuantizerConfig(mode="linear", dtype="int8")
# INT4 per-block (iOS 18+)
config = OpLinearQuantizerConfig(
mode="linear",
dtype="int4",
granularity="per_block",
block_size=32
)
opt_config = OptimizationConfig(global_config=config)
compressed = linear_quantize_weights(model, opt_config)
```
### Post-Training Pruning
```python
from coremltools.optimize.coreml import (
OpMagnitudePrunerConfig,
OptimizationConfig,
prune_weights
)
config = OpMagnitudePrunerConfig(target_sparsity=0.5)
opt_config = OptimizationConfig(global_config=config)
sparse = prune_weights(model, opt_config)
```
### Training-Time Palettization (PyTorch)
```python
from coremltools.optimize.torch.palettization import (
DKMPalettizerConfig,
DKMPalettizer
)
config = DKMPalettizerConfig(global_config={"n_bits": 4})
palettizer = DKMPalettizer(model, config)
# Prepare (inserts palettization layers)
prepared = palettizer.prepare()
# Training loop
for epoch in range(epochs):
train_one_epoch(prepared, data_loader)
palettizer.step()
# Finalize
final = palettizer.finalize()
```
### Calibration-Based Compression
```python
from coremltools.optimize.torch.pruning import (
MagnitudePrunerConfig,
LayerwiseCompressor
)
config = MagnitudePrunerConfig(
target_sparsity=0.4,
n_samples=128
)
compressor = LayerwiseCompressor(model, config)
compressed = compressor.compress(calibration_loader)
```
---
## Part 7 - Multi-Function Models
### Merging Models
```python
from coremltools.models import MultiFunctionDescriptor
from coremltools.models.utils import save_multifunction
# Create descriptor
desc = MultiFunctionDescriptor()
desc.add_function("function_a", "model_a.mlpackage")
desc.add_function("function_b", "model_b.mlpackage")
# Merge (deduplicates shared weights)
save_multifunction(desc, "merged.mlpackage")
```
### Inspecting Functions (Xcode)
Open model in Xcode → Predictions tab → Functions listed above inputs.
---
## Part 8 - Performance Profiling
### MLComputePlan (iOS 18+)
```swift
let plan = try await MLComputePlan.load(contentsOf: modelURL)
// Inspect operations
for op in plan.modelStructure.operations {
let info = plan.computeDeviceInfo(for: op)
print("Op: \(op.name)")
print(" Preferred: \(info.preferredDevice)")
print(" Estimated cost: \(info.estimatedCost)")
}
```
### Xcode Performance Reports
1. Open model in Xcode
2. Select Performance tab
3. Click + to create report
4. Select device and compute units
5. Click "Run Test"
**New in iOS 18**: Shows estimated time per operation, compute device support hints.
### Core ML Instrument
```
Instruments → Core ML template
├─ Load events: "cached" vs "prepare and cache"
├─ Prediction intervals
├─ Compute unit usage
└─ Neural Engine activity
```
---
## Part 9 - Deployment Targets
| Target | Key Features |
|--------|--------------|
| iOS 16 | Weight compression (palettization, quantization, pruning) |
| iOS 17 | Async prediction, MLComputeDevice, activation quantization |
| iOS 18 | MLTensor, State, SDPA fusion, per-block quantization, multi-function |
**Recommendation**: Always set `minimum_deployment_target=ct.target.iOS18` for best optimizations.
---
## Part 10 - Conversion Pass Pipelines
```python
# Default pipeline
mlmodel = ct.convert(traced, ...)
# With palettization support
mlmodel = ct.convert(
traced,
pass_pipeline=ct.PassPipeline.DEFAULT_PALETTIZATION,
...
)
```
## Resources
**WWDC**: 2023-10047, 2023-10049, 2024-10159, 2024-10161
**Docs**: /coreml, /coreml/mlmodel, /coreml/mltensor, /documentation/coremltools
**Skills**: coreml, coreml-diag

View File

@@ -0,0 +1,468 @@
---
name: coreml
description: Use when deploying custom ML models on-device, converting PyTorch models, compressing models, implementing LLM inference, or optimizing CoreML performance. Covers model conversion, compression, stateful models, KV-cache, multi-function models, MLTensor.
license: MIT
version: 1.0.0
---
# CoreML On-Device Machine Learning
## Overview
CoreML enables on-device machine learning inference across all Apple platforms. It abstracts hardware details while leveraging Apple Silicon's CPU, GPU, and Neural Engine for high-performance, private, and efficient execution.
**Key principle**: Start with the simplest approach, then optimize based on profiling. Don't over-engineer compression or caching until you have real performance data.
## Decision Tree - CoreML vs Foundation Models
```
Need on-device ML?
├─ Text generation (LLM)?
│ ├─ Simple prompts, structured output? → Foundation Models (ios-ai skill)
│ └─ Custom model, fine-tuned, specific architecture? → CoreML
├─ Custom trained model?
│ └─ Yes → CoreML
├─ Image/audio/sensor processing?
│ └─ Yes → CoreML
└─ Apple's built-in intelligence?
└─ Yes → Foundation Models (ios-ai skill)
```
## Red Flags
Use this skill when you see:
- "Convert PyTorch model to CoreML"
- "Model too large for device"
- "Slow inference performance"
- "LLM on-device"
- "KV-cache" or "stateful model"
- "Model compression" or "quantization"
- MLModel, MLTensor, or coremltools in context
## Pattern 1 - Basic Model Conversion
The standard PyTorch → CoreML workflow.
```python
import coremltools as ct
import torch
# Trace the model
model.eval()
traced_model = torch.jit.trace(model, example_input)
# Convert to CoreML
mlmodel = ct.convert(
traced_model,
inputs=[ct.TensorType(shape=example_input.shape)],
minimum_deployment_target=ct.target.iOS18
)
# Save
mlmodel.save("MyModel.mlpackage")
```
**Critical**: Always set `minimum_deployment_target` to enable latest optimizations.
## Pattern 2 - Model Compression (Post-Training)
Three techniques, each with different tradeoffs:
### Palettization (Best for Neural Engine)
Clusters weights into lookup tables. Use per-grouped-channel for better accuracy.
```python
from coremltools.optimize.coreml import (
OpPalettizerConfig,
OptimizationConfig,
palettize_weights
)
# 4-bit with grouped channels (iOS 18+)
op_config = OpPalettizerConfig(
mode="kmeans",
nbits=4,
granularity="per_grouped_channel",
group_size=16
)
config = OptimizationConfig(global_config=op_config)
compressed_model = palettize_weights(model, config)
```
| Bits | Compression | Accuracy Impact |
|------|-------------|-----------------|
| 8-bit | 2x | Minimal |
| 6-bit | 2.7x | Low |
| 4-bit | 4x | Moderate (use grouped channels) |
| 2-bit | 8x | High (requires training-time) |
### Quantization (Best for GPU on Mac)
Linear mapping to INT8/INT4. Use per-block for better accuracy.
```python
from coremltools.optimize.coreml import (
OpLinearQuantizerConfig,
OptimizationConfig,
linear_quantize_weights
)
# INT4 per-block quantization (iOS 18+)
op_config = OpLinearQuantizerConfig(
mode="linear",
dtype="int4",
granularity="per_block",
block_size=32
)
config = OptimizationConfig(global_config=op_config)
compressed_model = linear_quantize_weights(model, config)
```
### Pruning (Combine with other techniques)
Sets weights to zero for sparse representation. Can combine with palettization.
```python
from coremltools.optimize.coreml import (
OpMagnitudePrunerConfig,
OptimizationConfig,
prune_weights
)
op_config = OpMagnitudePrunerConfig(
target_sparsity=0.4 # 40% zeros
)
config = OptimizationConfig(global_config=op_config)
sparse_model = prune_weights(model, config)
```
## Pattern 3 - Training-Time Compression
When post-training compression loses too much accuracy, fine-tune with compression.
```python
from coremltools.optimize.torch.palettization import (
DKMPalettizerConfig,
DKMPalettizer
)
# Configure 4-bit palettization
config = DKMPalettizerConfig(global_config={"n_bits": 4})
# Prepare model
palettizer = DKMPalettizer(model, config)
prepared_model = palettizer.prepare()
# Fine-tune (your training loop)
for epoch in range(num_epochs):
train_epoch(prepared_model, data_loader)
palettizer.step()
# Finalize
final_model = palettizer.finalize()
```
**Tradeoff**: Better accuracy than post-training, but requires training data and time.
## Pattern 4 - Calibration-Based Compression (iOS 18+)
Middle ground: uses calibration data without full training.
```python
from coremltools.optimize.torch.pruning import (
MagnitudePrunerConfig,
LayerwiseCompressor
)
# Configure
config = MagnitudePrunerConfig(
target_sparsity=0.4,
n_samples=128 # Calibration samples
)
# Create pruner
pruner = LayerwiseCompressor(model, config)
# Calibrate
sparse_model = pruner.compress(calibration_data_loader)
```
## Pattern 5 - Stateful Models (KV-Cache for LLMs)
For transformer models, use state to avoid recomputing key/value vectors.
### PyTorch Model with State
```python
class StatefulLLM(nn.Module):
def __init__(self):
super().__init__()
# Register state buffers
self.register_buffer("keyCache", torch.zeros(batch, heads, seq_len, dim))
self.register_buffer("valueCache", torch.zeros(batch, heads, seq_len, dim))
def forward(self, input_ids, causal_mask):
# Update caches in-place during forward
# ... attention with KV-cache ...
return logits
```
### Conversion with State
```python
import coremltools as ct
mlmodel = ct.convert(
traced_model,
inputs=[
ct.TensorType(name="input_ids", shape=(1, ct.RangeDim(1, 2048))),
ct.TensorType(name="causal_mask", shape=(1, 1, ct.RangeDim(1, 2048), ct.RangeDim(1, 2048)))
],
states=[
ct.StateType(name="keyCache", ...),
ct.StateType(name="valueCache", ...)
],
minimum_deployment_target=ct.target.iOS18
)
```
### Using State at Runtime
```swift
// Create state from model
let state = model.makeState()
// Run prediction with state (updated in-place)
let output = try model.prediction(from: input, using: state)
```
**Performance**: 1.6x speedup on Mistral-7B (M3 Max) compared to manual KV-cache I/O.
## Pattern 6 - Multi-Function Models (Adapters/LoRA)
Deploy multiple adapters in a single model, sharing base weights.
```python
from coremltools.models import MultiFunctionDescriptor
from coremltools.models.utils import save_multifunction
# Convert individual models
sticker_model = ct.convert(sticker_adapter_model, ...)
storybook_model = ct.convert(storybook_adapter_model, ...)
# Save individually
sticker_model.save("sticker.mlpackage")
storybook_model.save("storybook.mlpackage")
# Merge with shared weights
desc = MultiFunctionDescriptor()
desc.add_function("sticker", "sticker.mlpackage")
desc.add_function("storybook", "storybook.mlpackage")
save_multifunction(desc, "MultiAdapter.mlpackage")
```
### Loading Specific Function
```swift
let config = MLModelConfiguration()
config.functionName = "sticker" // or "storybook"
let model = try MLModel(contentsOf: modelURL, configuration: config)
```
## Pattern 7 - MLTensor for Pipeline Stitching (iOS 18+)
Simplifies computation between models (decoding, post-processing).
```swift
import CoreML
// Create tensors
let scores = MLTensor(shape: [1, vocab_size], scalars: logits)
// Operations (executed asynchronously on Apple Silicon)
let topK = scores.topK(k: 10)
let probs = (topK.values / temperature).softmax()
// Sample from distribution
let sampled = probs.multinomial(numSamples: 1)
// Materialize to access data (blocks until complete)
let shapedArray = await sampled.shapedArray(of: Int32.self)
```
**Key insight**: MLTensor operations are async. Call `shapedArray()` to materialize results.
## Pattern 8 - Async Prediction for Concurrency
Thread-safe concurrent predictions for throughput.
```swift
class ImageProcessor {
let model: MLModel
func processImages(_ images: [CGImage]) async throws -> [Output] {
try await withThrowingTaskGroup(of: Output.self) { group in
for image in images {
group.addTask {
// Check cancellation before expensive work
try Task.checkCancellation()
let input = try self.prepareInput(image)
// Async prediction - thread safe!
return try await self.model.prediction(from: input)
}
}
return try await group.reduce(into: []) { $0.append($1) }
}
}
}
```
**Warning**: Limit concurrent predictions to avoid memory pressure from multiple input/output buffers.
```swift
// Limit concurrency
let semaphore = AsyncSemaphore(value: 2)
for image in images {
group.addTask {
await semaphore.wait()
defer { semaphore.signal() }
return try await process(image)
}
}
```
## Anti-Patterns
### Don't - Load models on main thread at launch
```swift
// BAD - blocks UI
class AppDelegate {
let model = try! MLModel(contentsOf: url) // Blocks!
}
// GOOD - lazy async loading
class ModelManager {
private var model: MLModel?
func getModel() async throws -> MLModel {
if let model { return model }
model = try await Task.detached {
try MLModel(contentsOf: url)
}.value
return model!
}
}
```
### Don't - Reload model for each prediction
```swift
// BAD - reloads every time
func predict(_ input: Input) throws -> Output {
let model = try MLModel(contentsOf: url) // Expensive!
return try model.prediction(from: input)
}
// GOOD - keep model loaded
class Predictor {
private let model: MLModel
func predict(_ input: Input) throws -> Output {
try model.prediction(from: input)
}
}
```
### Don't - Compress without profiling first
```swift
// BAD - blind compression
let compressed = palettize_weights(model, 2bit_config) // May break accuracy!
// GOOD - profile, then compress iteratively
// 1. Profile Float16 baseline
// 2. Try 8-bit check accuracy
// 3. Try 6-bit check accuracy
// 4. Try 4-bit with grouped channels check accuracy
// 5. Only use 2-bit with training-time compression
```
### Don't - Ignore deployment target
```python
# BAD - misses optimizations
mlmodel = ct.convert(traced_model, inputs=[...])
# GOOD - enables SDPA fusion, per-block quantization, etc.
mlmodel = ct.convert(
traced_model,
inputs=[...],
minimum_deployment_target=ct.target.iOS18
)
```
## Pressure Scenarios
### Scenario 1 - "Model is 5GB, need it under 2GB for iPhone"
**Wrong approach**: Jump straight to 2-bit palettization.
**Right approach**:
1. Start with 8-bit palettization → check accuracy
2. Try 6-bit → check accuracy
3. Try 4-bit with `per_grouped_channel` → check accuracy
4. If still too large, use calibration-based compression
5. If still losing accuracy, use training-time compression
### Scenario 2 - "LLM inference is too slow"
**Wrong approach**: Try different compute units randomly.
**Right approach**:
1. Profile with Core ML Instrument
2. Check if load is cached (look for "cached" vs "prepare and cache")
3. Enable stateful KV-cache
4. Check SDPA optimization is enabled (iOS 18+ deployment target)
5. Consider INT4 quantization for GPU on Mac
### Scenario 3 - "Need multiple LoRA adapters in one app"
**Wrong approach**: Ship separate models for each adapter.
**Right approach**:
1. Convert each adapter model separately
2. Use `MultiFunctionDescriptor` to merge with shared base
3. Load specific function via `config.functionName`
4. Weights are deduplicated automatically
## Checklist
Before deploying a CoreML model:
- [ ] Set `minimum_deployment_target` to latest supported iOS
- [ ] Profile baseline Float16 performance
- [ ] Check if model load is cached
- [ ] Consider compression only if size/performance requires it
- [ ] Test accuracy after each compression step
- [ ] Use async prediction for concurrent workloads
- [ ] Limit concurrent predictions to manage memory
- [ ] Use state for transformer KV-cache
- [ ] Use multi-function for adapter variants
## Resources
**WWDC**: 2023-10047, 2023-10049, 2024-10159, 2024-10161
**Docs**: /coreml, /coreml/mlmodel, /coreml/mltensor
**Skills**: coreml-ref, coreml-diag, axiom-ios-ai (Foundation Models)

View File

@@ -0,0 +1,496 @@
---
name: speech
description: Use when implementing speech-to-text, live transcription, or audio transcription. Covers SpeechAnalyzer (iOS 26+), SpeechTranscriber, volatile/finalized results, AssetInventory model management, audio format handling.
license: MIT
version: 1.0.0
---
# Speech-to-Text with SpeechAnalyzer
## Overview
SpeechAnalyzer is Apple's new speech-to-text API introduced in iOS 26. It powers Notes, Voice Memos, Journal, and Call Summarization. The on-device model is faster, more accurate, and better for long-form/distant audio than SFSpeechRecognizer.
**Key principle**: SpeechAnalyzer is modular—add transcription modules to an analysis session. Results stream asynchronously using Swift's AsyncSequence.
## Decision Tree - SpeechAnalyzer vs SFSpeechRecognizer
```
Need speech-to-text?
├─ iOS 26+ only?
│ └─ Yes → SpeechAnalyzer (preferred)
├─ Need iOS 10-25 support?
│ └─ Yes → SFSpeechRecognizer (or DictationTranscriber)
├─ Long-form audio (meetings, lectures)?
│ └─ Yes → SpeechAnalyzer
├─ Distant audio (across room)?
│ └─ Yes → SpeechAnalyzer
└─ Short dictation commands?
└─ Either works
```
**SpeechAnalyzer advantages**:
- Better for long-form and conversational audio
- Works well with distant speakers (meetings)
- On-device, private
- Model managed by system (no app size increase)
- Powers Notes, Voice Memos, Journal
**DictationTranscriber** (iOS 26+): Same languages as SFSpeechRecognizer, but doesn't require user to enable Siri/dictation in Settings.
## Red Flags
Use this skill when you see:
- "Live transcription"
- "Transcribe audio"
- "Speech-to-text"
- "SpeechAnalyzer" or "SpeechTranscriber"
- "Volatile results"
- Building Notes-like or Voice Memos-like features
## Pattern 1 - File Transcription (Simplest)
Transcribe an audio file to text in one function.
```swift
import Speech
func transcribe(file: URL, locale: Locale) async throws -> AttributedString {
// Set up transcriber
let transcriber = SpeechTranscriber(
locale: locale,
preset: .offlineTranscription
)
// Collect results asynchronously
async let transcriptionFuture = try transcriber.results
.reduce(AttributedString()) { str, result in
str + result.text
}
// Set up analyzer with transcriber module
let analyzer = SpeechAnalyzer(modules: [transcriber])
// Analyze the file
if let lastSample = try await analyzer.analyzeSequence(from: file) {
try await analyzer.finalizeAndFinish(through: lastSample)
} else {
await analyzer.cancelAndFinishNow()
}
return try await transcriptionFuture
}
```
**Key points**:
- `analyzeSequence(from:)` reads file and feeds audio to analyzer
- `finalizeAndFinish(through:)` ensures all results are finalized
- Results are `AttributedString` with timing metadata
## Pattern 2 - Live Transcription Setup
For real-time transcription from microphone.
### Step 1 - Configure SpeechTranscriber
```swift
import Speech
class TranscriptionManager: ObservableObject {
private var transcriber: SpeechTranscriber?
private var analyzer: SpeechAnalyzer?
private var analyzerFormat: AudioFormatDescription?
private var inputBuilder: AsyncStream<AnalyzerInput>.Continuation?
@Published var finalizedTranscript = AttributedString()
@Published var volatileTranscript = AttributedString()
func setUp() async throws {
// Create transcriber with options
transcriber = SpeechTranscriber(
locale: Locale.current,
transcriptionOptions: [],
reportingOptions: [.volatileResults], // Enable real-time updates
attributeOptions: [.audioTimeRange] // Include timing
)
guard let transcriber else { throw TranscriptionError.setupFailed }
// Create analyzer with transcriber module
analyzer = SpeechAnalyzer(modules: [transcriber])
// Get required audio format
analyzerFormat = await SpeechAnalyzer.bestAvailableAudioFormat(
compatibleWith: [transcriber]
)
// Ensure model is available
try await ensureModel(for: transcriber)
// Create input stream
let (stream, builder) = AsyncStream<AnalyzerInput>.makeStream()
inputBuilder = builder
// Start analyzer
try await analyzer?.start(inputSequence: stream)
}
}
```
### Step 2 - Ensure Model Availability
```swift
func ensureModel(for transcriber: SpeechTranscriber) async throws {
let locale = Locale.current
// Check if language is supported
let supported = await SpeechTranscriber.supportedLocales
guard supported.contains(where: {
$0.identifier(.bcp47) == locale.identifier(.bcp47)
}) else {
throw TranscriptionError.localeNotSupported
}
// Check if model is installed
let installed = await SpeechTranscriber.installedLocales
if installed.contains(where: {
$0.identifier(.bcp47) == locale.identifier(.bcp47)
}) {
return // Already installed
}
// Download model
if let downloader = try await AssetInventory.assetInstallationRequest(
supporting: [transcriber]
) {
// Track progress if needed
let progress = downloader.progress
try await downloader.downloadAndInstall()
}
}
```
**Note**: Models are stored in system storage, not app storage. Limited number of languages can be allocated at once.
### Step 3 - Handle Results
```swift
func startResultHandling() {
Task {
guard let transcriber else { return }
do {
for try await result in transcriber.results {
let text = result.text
if result.isFinal {
// Finalized result - won't change
finalizedTranscript += text
volatileTranscript = AttributedString()
// Access timing info
for run in text.runs {
if let timeRange = run.audioTimeRange {
print("Time: \(timeRange)")
}
}
} else {
// Volatile result - will be replaced
volatileTranscript = text
}
}
} catch {
print("Transcription failed: \(error)")
}
}
}
```
## Pattern 3 - Audio Recording and Streaming
Connect AVAudioEngine to SpeechAnalyzer.
```swift
import AVFoundation
class AudioRecorder {
private let audioEngine = AVAudioEngine()
private var outputContinuation: AsyncStream<AVAudioPCMBuffer>.Continuation?
private let transcriptionManager: TranscriptionManager
func startRecording() async throws {
// Request permission
guard await AVAudioApplication.requestRecordPermission() else {
throw RecordingError.permissionDenied
}
// Configure audio session (iOS)
#if os(iOS)
let session = AVAudioSession.sharedInstance()
try session.setCategory(.playAndRecord, mode: .spokenAudio)
try session.setActive(true, options: .notifyOthersOnDeactivation)
#endif
// Set up transcriber
try await transcriptionManager.setUp()
transcriptionManager.startResultHandling()
// Stream audio to transcriber
for await buffer in try audioStream() {
try await transcriptionManager.streamAudio(buffer)
}
}
private func audioStream() throws -> AsyncStream<AVAudioPCMBuffer> {
let inputNode = audioEngine.inputNode
let format = inputNode.outputFormat(forBus: 0)
inputNode.installTap(
onBus: 0,
bufferSize: 4096,
format: format
) { [weak self] buffer, time in
self?.outputContinuation?.yield(buffer)
}
audioEngine.prepare()
try audioEngine.start()
return AsyncStream { continuation in
outputContinuation = continuation
}
}
}
```
### Stream Audio with Format Conversion
```swift
extension TranscriptionManager {
private var converter: AVAudioConverter?
func streamAudio(_ buffer: AVAudioPCMBuffer) async throws {
guard let inputBuilder, let analyzerFormat else {
throw TranscriptionError.notSetUp
}
// Convert to analyzer's required format
let converted = try convertBuffer(buffer, to: analyzerFormat)
// Send to analyzer
let input = AnalyzerInput(buffer: converted)
inputBuilder.yield(input)
}
private func convertBuffer(
_ buffer: AVAudioPCMBuffer,
to format: AudioFormatDescription
) throws -> AVAudioPCMBuffer {
// Lazy initialize converter
if converter == nil {
let sourceFormat = buffer.format
let destFormat = AVAudioFormat(cmAudioFormatDescription: format)!
converter = AVAudioConverter(from: sourceFormat, to: destFormat)
}
guard let converter else {
throw TranscriptionError.conversionFailed
}
let outputBuffer = AVAudioPCMBuffer(
pcmFormat: converter.outputFormat,
frameCapacity: buffer.frameLength
)!
try converter.convert(to: outputBuffer, from: buffer)
return outputBuffer
}
}
```
## Pattern 4 - Stopping Transcription
Properly finalize to get remaining volatile results as finalized.
```swift
func stopRecording() async {
// Stop audio
audioEngine.stop()
audioEngine.inputNode.removeTap(onBus: 0)
outputContinuation?.finish()
// Finalize transcription (converts remaining volatile to final)
try? await analyzer?.finalizeAndFinishThroughEndOfInput()
// Cancel any pending tasks
recognizerTask?.cancel()
}
```
**Critical**: Always call `finalizeAndFinishThroughEndOfInput()` to ensure volatile results are finalized.
## Pattern 5 - Model Asset Management
### Check Supported Languages
```swift
// Languages the API supports
let supported = await SpeechTranscriber.supportedLocales
// Languages currently installed on device
let installed = await SpeechTranscriber.installedLocales
```
### Deallocate Languages
Limited number of languages can be allocated. Deallocate unused ones.
```swift
func deallocateLanguages() async {
let allocated = await AssetInventory.allocatedLocales
for locale in allocated {
await AssetInventory.deallocate(locale: locale)
}
}
```
## Pattern 6 - Displaying Results with Timing
Highlight text during audio playback using timing metadata.
```swift
struct TranscriptView: View {
let transcript: AttributedString
@Binding var playbackTime: CMTime
var body: some View {
Text(highlightedTranscript)
}
var highlightedTranscript: AttributedString {
var result = transcript
for (range, run) in transcript.runs {
guard let timeRange = run.audioTimeRange else { continue }
let isActive = timeRange.containsTime(playbackTime)
if isActive {
result[range].backgroundColor = .yellow
}
}
return result
}
}
```
## Anti-Patterns
### Don't - Forget to finalize
```swift
// BAD - volatile results lost
func stopRecording() {
audioEngine.stop()
// Missing finalize!
}
// GOOD - volatile results become finalized
func stopRecording() async {
audioEngine.stop()
try? await analyzer?.finalizeAndFinishThroughEndOfInput()
}
```
### Don't - Ignore format conversion
```swift
// BAD - format mismatch may fail silently
inputBuilder.yield(AnalyzerInput(buffer: rawBuffer))
// GOOD - convert to analyzer's format
let format = await SpeechAnalyzer.bestAvailableAudioFormat(compatibleWith: [transcriber])
let converted = try convertBuffer(rawBuffer, to: format)
inputBuilder.yield(AnalyzerInput(buffer: converted))
```
### Don't - Skip model availability check
```swift
// BAD - may crash if model not installed
let transcriber = SpeechTranscriber(locale: locale, ...)
// Start using immediately
// GOOD - ensure model is ready
let transcriber = SpeechTranscriber(locale: locale, ...)
try await ensureModel(for: transcriber)
// Now safe to use
```
## Presets Reference
| Preset | Use Case |
|--------|----------|
| `.offlineTranscription` | File transcription, no real-time feedback needed |
| `.progressiveLiveTranscription` | Live transcription with volatile updates |
## Options Reference
### TranscriptionOptions
- Default: None (standard transcription)
### ReportingOptions
- `.volatileResults`: Enable real-time approximate results
### AttributeOptions
- `.audioTimeRange`: Include CMTimeRange for each text segment
## Platform Availability
| Platform | SpeechTranscriber | DictationTranscriber |
|----------|-------------------|---------------------|
| iOS 26+ | Yes | Yes |
| macOS Tahoe+ | Yes | Yes |
| watchOS 26+ | No | Yes |
| tvOS 26+ | TBD | TBD |
**Hardware requirements**: Varies by device. Use `supportedLocales` to check.
## Integration with Apple Intelligence
Combine with Foundation Models for summarization:
```swift
import FoundationModels
func generateTitle(for transcript: String) async throws -> String {
let session = LanguageModelSession()
let prompt = "Generate a short, clever title for this story: \(transcript)"
let response = try await session.respond(to: prompt)
return response.content
}
```
See `axiom-ios-ai` skill for Foundation Models details.
## Checklist
Before shipping speech-to-text:
- [ ] Check locale support with `supportedLocales`
- [ ] Ensure model with `AssetInventory.assetInstallationRequest`
- [ ] Handle download progress for user feedback
- [ ] Convert audio to `bestAvailableAudioFormat`
- [ ] Enable `.volatileResults` for live transcription
- [ ] Call `finalizeAndFinishThroughEndOfInput()` on stop
- [ ] Handle timing with `.audioTimeRange` if needed
- [ ] Clear volatile results when finalized result arrives
- [ ] Request microphone permission before recording
## Resources
**WWDC**: 2025-277
**Docs**: /speech, /speech/speechanalyzer, /speech/speechtranscriber
**Skills**: coreml (on-device ML), axiom-ios-ai (Foundation Models)