Add scan flow MVP and local Axiom skill workspace

This snapshot establishes the camera-to-result recognition flow and related tests while checking in the project skill/docs assets required for the configured local tooling.
2026-04-19 21:11:32 +02:00
parent 577214d474
commit a60a76b797
679 changed files with 138964 additions and 73 deletions
--- a/.claude/skills/axiom-ios-ml/.openskills.json
+++ b/.claude/skills/axiom-ios-ml/.openskills.json
@@ -0,0 +1,7 @@
+{
+  "source": "CharlesWiltgen/Axiom",
+  "sourceType": "git",
+  "repoUrl": "https://github.com/CharlesWiltgen/Axiom",
+  "subpath": ".claude-plugin/plugins/axiom/skills/axiom-ios-ml",
+  "installedAt": "2026-04-12T08:05:35.624Z"
+}
--- a/.claude/skills/axiom-ios-ml/SKILL.md
+++ b/.claude/skills/axiom-ios-ml/SKILL.md
@@ -0,0 +1,136 @@
+---
+name: axiom-ios-ml
+description: Use when deploying ANY machine learning model on-device, converting models to CoreML, compressing models, or implementing speech-to-text. Covers CoreML conversion, MLTensor, model compression (quantization/palettization/pruning), stateful models, KV-cache, multi-function models, async prediction, SpeechAnalyzer, SpeechTranscriber.
+license: MIT
+---
+
+# iOS Machine Learning Router
+
+**You MUST use this skill for ANY on-device machine learning or speech-to-text work.**
+
+## When to Use
+
+Use this router when:
+- Converting PyTorch/TensorFlow models to CoreML
+- Deploying ML models on-device
+- Compressing models (quantization, palettization, pruning)
+- Working with large language models (LLMs)
+- Implementing KV-cache for transformers
+- Using MLTensor for model stitching
+- Building speech-to-text features
+- Transcribing audio (live or recorded)
+
+## Boundary with ios-ai
+
+**ios-ml vs ios-ai — know the difference:**
+
+| Developer Intent | Router |
+|-----------------|--------|
+| "Use Apple Intelligence / Foundation Models" | **ios-ai** — Apple's on-device LLM |
+| "Run my own ML model on device" | **ios-ml** — CoreML conversion + deployment |
+| "Add text generation with @Generable" | **ios-ai** — Foundation Models structured output |
+| "Deploy a custom LLM with KV-cache" | **ios-ml** — Custom model optimization |
+| "Use Vision framework for image analysis" | **ios-vision** — Not ML deployment |
+| "Use pre-trained Apple NLP models" | **ios-ai** — Apple's models, not custom |
+
+**Rule of thumb**: If the developer is converting/compressing/deploying their own model → ios-ml. If they're using Apple's built-in AI → ios-ai. If they're doing computer vision → ios-vision.
+
+## Routing Logic
+
+### CoreML Work
+
+**Implementation patterns** → `/skill coreml`
+- Model conversion workflow
+- MLTensor for model stitching
+- Stateful models with KV-cache
+- Multi-function models (adapters/LoRA)
+- Async prediction patterns
+- Compute unit selection
+
+**API reference** → `/skill coreml-ref`
+- CoreML Tools Python API
+- MLModel lifecycle
+- MLTensor operations
+- MLComputeDevice availability
+- State management APIs
+- Performance reports
+
+**Diagnostics** → `/skill coreml-diag`
+- Model won't load
+- Slow inference
+- Memory issues
+- Compression accuracy loss
+- Compute unit problems
+
+### Speech Work
+
+**Implementation patterns** → `/skill speech`
+- SpeechAnalyzer setup (iOS 26+)
+- SpeechTranscriber configuration
+- Live transcription
+- File transcription
+- Volatile vs finalized results
+- Model asset management
+
+## Decision Tree
+
+1. Implementing / converting ML models? → coreml
+2. CoreML API reference? → coreml-ref
+3. Debugging ML issues (load, inference, compression)? → coreml-diag
+4. Speech-to-text / transcription? → speech
+
+## Anti-Rationalization
+
+| Thought | Reality |
+|---------|---------|
+| "CoreML is just load and predict" | CoreML has compression, stateful models, compute unit selection, and async prediction. coreml covers all. |
+| "My model is small, no optimization needed" | Even small models benefit from compute unit selection and async prediction. coreml has the patterns. |
+| "I'll just use SFSpeechRecognizer" | iOS 26 has SpeechAnalyzer with better accuracy and offline support. speech skill covers the modern API. |
+
+## Critical Patterns
+
+**coreml**:
+- Model conversion (PyTorch → CoreML)
+- Compression (palettization, quantization, pruning)
+- Stateful KV-cache for LLMs
+- Multi-function models for adapters
+- MLTensor for pipeline stitching
+- Async concurrent prediction
+
+**coreml-diag**:
+- Load failures and caching
+- Inference performance issues
+- Memory pressure from models
+- Accuracy degradation from compression
+
+**speech**:
+- SpeechAnalyzer + SpeechTranscriber setup
+- AssetInventory model management
+- Live transcription with volatile results
+- Audio format conversion
+
+## Example Invocations
+
+User: "How do I convert a PyTorch model to CoreML?"
+→ Invoke: `/skill coreml`
+
+User: "Compress my model to fit on iPhone"
+→ Invoke: `/skill coreml`
+
+User: "Implement KV-cache for my language model"
+→ Invoke: `/skill coreml`
+
+User: "Model loads slowly on first launch"
+→ Invoke: `/skill coreml-diag`
+
+User: "My compressed model has bad accuracy"
+→ Invoke: `/skill coreml-diag`
+
+User: "Add live transcription to my app"
+→ Invoke: `/skill speech`
+
+User: "Transcribe audio files with SpeechAnalyzer"
+→ Invoke: `/skill speech`
+
+User: "What's MLTensor and how do I use it?"
+→ Invoke: `/skill coreml-ref`
--- a/.claude/skills/axiom-ios-ml/coreml-diag/SKILL.md
+++ b/.claude/skills/axiom-ios-ml/coreml-diag/SKILL.md
@@ -0,0 +1,473 @@
+---
+name: coreml-diag
+description: CoreML diagnostics - model load failures, slow inference, memory issues, compression accuracy loss, compute unit problems, conversion errors.
+license: MIT
+version: 1.0.0
+---
+
+# CoreML Diagnostics
+
+## Quick Reference
+
+| Symptom | First Check | Pattern |
+|---------|-------------|---------|
+| Model won't load | Deployment target | 1a-1c |
+| Slow first load | Cache miss | 2a |
+| Slow inference | Compute units | 2b-2c |
+| High memory | Concurrent predictions | 3a-3b |
+| Bad accuracy after compression | Granularity | 4a-4c |
+| Conversion fails | Operation support | 5a-5b |
+
+## Decision Tree
+
+```
+CoreML issue
+├─ Load failure?
+│   ├─ "Unsupported model version" → 1a
+│   ├─ "Failed to create compute plan" → 1b
+│   └─ Other load error → 1c
+├─ Performance issue?
+│   ├─ First load slow, subsequent fast? → 2a
+│   ├─ All predictions slow? → 2b
+│   └─ Slow only on specific device? → 2c
+├─ Memory issue?
+│   ├─ Memory grows during predictions? → 3a
+│   └─ Out of memory on load? → 3b
+├─ Accuracy degraded?
+│   ├─ After palettization? → 4a
+│   ├─ After quantization? → 4b
+│   └─ After pruning? → 4c
+└─ Conversion issue?
+    ├─ Operation not supported? → 5a
+    └─ Wrong output? → 5b
+```
+
+---
+
+## Pattern 1a - "Unsupported model version"
+
+**Symptom**: Model fails to load with version error.
+
+**Cause**: Model compiled for newer OS than device supports.
+
+**Diagnosis**:
+```python
+# Check model's minimum deployment target
+import coremltools as ct
+model = ct.models.MLModel("Model.mlpackage")
+print(model.get_spec().specificationVersion)
+```
+
+| Spec Version | Minimum iOS |
+|--------------|-------------|
+| 4 | iOS 13 |
+| 5 | iOS 14 |
+| 6 | iOS 15 |
+| 7 | iOS 16 |
+| 8 | iOS 17 |
+| 9 | iOS 18 |
+
+**Fix**: Re-convert with lower deployment target:
+```python
+mlmodel = ct.convert(
+    traced,
+    minimum_deployment_target=ct.target.iOS16  # Lower target
+)
+```
+
+**Tradeoff**: Loses newer optimizations (SDPA fusion, per-block quantization, MLTensor).
+
+---
+
+## Pattern 1b - "Failed to create compute plan"
+
+**Symptom**: Model loads on some devices but not others.
+
+**Cause**: Unsupported operations for target compute unit.
+
+**Diagnosis**:
+1. Open model in Xcode
+2. Create Performance Report
+3. Check "Unsupported" operations
+4. Hover for hints
+
+**Fix**:
+```swift
+// Force CPU-only to bypass unsupported GPU/NE operations
+let config = MLModelConfiguration()
+config.computeUnits = .cpuOnly
+let model = try MLModel(contentsOf: url, configuration: config)
+```
+
+**Better fix**: Update model precision or operations during conversion:
+```python
+# Float16 often better supported
+mlmodel = ct.convert(traced, compute_precision=ct.precision.FLOAT16)
+```
+
+---
+
+## Pattern 1c - General Load Failures
+
+**Symptom**: Model fails to load with unclear error.
+
+**Checklist**:
+1. Check file exists and is readable
+2. Check compiled vs source model (runtime needs `.mlmodelc`)
+3. Check available disk space (cache needs room)
+4. Check model isn't corrupted (re-convert)
+
+```swift
+// Debug logging
+let config = MLModelConfiguration()
+config.parameters = [.reporter: { print($0) }]  // iOS 17+
+```
+
+---
+
+## Pattern 2a - Slow First Load (Cache Miss)
+
+**Symptom**: First prediction after install/update is slow, subsequent are fast.
+
+**Cause**: Device specialization not cached.
+
+**Diagnosis**:
+1. Profile with Core ML Instrument
+2. Look at Load event label:
+   - "prepare and cache" = cache miss (slow)
+   - "cached" = cache hit (fast)
+
+**Why cache misses**:
+- First launch after install
+- System update invalidated cache
+- Low disk space cleared cache
+- Model file was modified
+
+**Mitigation**:
+```swift
+// Warm cache in background at app launch
+Task.detached(priority: .background) {
+    _ = try? await MLModel.load(contentsOf: modelURL)
+}
+```
+
+**Note**: Cache is tied to (model path + configuration + device). Different configs = different cache entries.
+
+---
+
+## Pattern 2b - All Predictions Slow
+
+**Symptom**: Predictions consistently slow, not just first one.
+
+**Diagnosis**:
+1. Create Xcode Performance Report
+2. Check compute unit distribution
+3. Look for high-cost operations
+
+**Common causes**:
+
+| Cause | Fix |
+|-------|-----|
+| Running on CPU when GPU/NE available | Check `computeUnits` config |
+| Model too large for Neural Engine | Compress model |
+| Frequent CPU↔GPU↔NE transfers | Adjust segmentation |
+| Dynamic shapes recompiling | Use fixed/enumerated shapes |
+
+**Profile compute unit usage**:
+```swift
+let plan = try await MLComputePlan.load(contentsOf: modelURL)
+for op in plan.modelStructure.operations {
+    let info = plan.computeDeviceInfo(for: op)
+    print("\(op.name): \(info.preferredDevice)")
+}
+```
+
+---
+
+## Pattern 2c - Slow on Specific Device
+
+**Symptom**: Fast on Mac, slow on iPhone (or vice versa).
+
+**Cause**: Different hardware characteristics.
+
+**Diagnosis**:
+```swift
+// Check available compute
+let devices = MLModel.availableComputeDevices
+print(devices)  // Different per device
+```
+
+**Common issues**:
+
+| Scenario | Cause | Fix |
+|----------|-------|-----|
+| Fast on M-series Mac, slow on iPhone | Model optimized for GPU | Use palettization (Neural Engine) |
+| Fast on iPhone, slow on Intel Mac | No Neural Engine | Use quantization (GPU) |
+| Slow on older devices | Less compute power | Use more aggressive compression |
+
+**Recommendation**: Profile on target devices, not just development Mac.
+
+---
+
+## Pattern 3a - Memory Grows During Predictions
+
+**Symptom**: Memory increases with each prediction, doesn't release.
+
+**Cause**: Input/output buffers accumulating from concurrent predictions.
+
+**Diagnosis**:
+```
+Instruments → Allocations + Core ML template
+Look for: Many concurrent prediction intervals
+Check: MLMultiArray allocations growing
+```
+
+**Fix**: Limit concurrent predictions:
+```swift
+actor PredictionLimiter {
+    private let maxConcurrent = 2
+    private var inFlight = 0
+
+    func predict(_ model: MLModel, input: MLFeatureProvider) async throws -> MLFeatureProvider {
+        while inFlight >= maxConcurrent {
+            await Task.yield()
+        }
+        inFlight += 1
+        defer { inFlight -= 1 }
+        return try await model.prediction(from: input)
+    }
+}
+```
+
+---
+
+## Pattern 3b - Out of Memory on Load
+
+**Symptom**: App crashes or model fails to load on memory-constrained devices.
+
+**Cause**: Model too large for device memory.
+
+**Diagnosis**:
+```bash
+# Check model size
+ls -lh Model.mlpackage/Data/com.apple.CoreML/weights/
+```
+
+**Fix options**:
+
+| Approach | Compression | Memory Impact |
+|----------|-------------|---------------|
+| 8-bit palettization | 2x smaller | 2x less memory |
+| 4-bit palettization | 4x smaller | 4x less memory |
+| Pruning (50%) | ~2x smaller | ~2x less memory |
+
+**Note**: Compressed weights are decompressed just-in-time (iOS 17+), so smaller on-disk = smaller in memory.
+
+---
+
+## Pattern 4a - Bad Accuracy After Palettization
+
+**Symptom**: Model output degraded after palettization.
+
+**Diagnosis**:
+1. What bit depth? (2-bit most likely to fail)
+2. What granularity? (per-tensor loses more than per-grouped-channel)
+
+**Fix progression**:
+
+```python
+# Step 1: Try grouped channels (iOS 18+)
+config = OpPalettizerConfig(
+    nbits=4,
+    granularity="per_grouped_channel",
+    group_size=16
+)
+
+# Step 2: If still bad, try more bits
+config = OpPalettizerConfig(nbits=6, ...)
+
+# Step 3: If still need 4-bit, use calibration
+from coremltools.optimize.torch.palettization import DKMPalettizer
+# ... training-time compression
+```
+
+**Key insight**: 4-bit per-tensor has only 16 clusters for entire weight matrix. Grouped channels = 16 clusters per 16 channels = much better granularity.
+
+---
+
+## Pattern 4b - Bad Accuracy After Quantization
+
+**Symptom**: Model output degraded after INT8/INT4 quantization.
+
+**Diagnosis**:
+1. What bit depth?
+2. What granularity?
+
+**Fix progression**:
+
+```python
+# Step 1: Use per-block (iOS 18+)
+config = OpLinearQuantizerConfig(
+    dtype="int4",
+    granularity="per_block",
+    block_size=32
+)
+
+# Step 2: Use calibration data
+from coremltools.optimize.torch.quantization import LayerwiseCompressor
+compressor = LayerwiseCompressor(model, config)
+quantized = compressor.compress(calibration_loader)
+```
+
+**Note**: INT4 quantization works best on Mac GPU. For Neural Engine, prefer palettization.
+
+---
+
+## Pattern 4c - Bad Accuracy After Pruning
+
+**Symptom**: Model output degraded after weight pruning.
+
+**Diagnosis**:
+1. What sparsity level?
+2. Post-training or training-time?
+
+**Thresholds** (model-dependent):
+- 0-30% sparsity: Usually safe
+- 30-50% sparsity: May need calibration
+- 50%+ sparsity: Usually needs training-time
+
+**Fix**:
+```python
+# Use calibration-based pruning
+from coremltools.optimize.torch.pruning import LayerwiseCompressor
+
+config = MagnitudePrunerConfig(
+    target_sparsity=0.4,
+    n_samples=128
+)
+compressor = LayerwiseCompressor(model, config)
+sparse = compressor.compress(calibration_loader)
+```
+
+---
+
+## Pattern 5a - Operation Not Supported
+
+**Symptom**: Conversion fails with unsupported operation error.
+
+**Diagnosis**:
+```
+Error: "Op 'custom_op' is not supported for conversion"
+```
+
+**Options**:
+
+1. **Check if op is in coremltools**: May need newer version
+```bash
+pip install --upgrade coremltools
+```
+
+2. **Use composite ops**: Split into supported primitives
+```python
+# Instead of custom_op(x)
+# Use: supported_op1(supported_op2(x))
+```
+
+3. **Register custom op**: Advanced, requires MIL programming
+```python
+from coremltools.converters.mil import Builder as mb
+
+@mb.register_torch_op
+def custom_op(context, node):
+    # Map to MIL operations
+    ...
+```
+
+---
+
+## Pattern 5b - Conversion Succeeds but Wrong Output
+
+**Symptom**: Model converts but predictions differ from PyTorch.
+
+**Diagnosis checklist**:
+
+1. **Input normalization**: Ensure preprocessing matches
+```python
+# PyTorch often uses ImageNet normalization
+# CoreML may need explicit preprocessing
+```
+
+2. **Shape ordering**: PyTorch (NCHW) vs CoreML (NHWC for some ops)
+```python
+# Check shapes in conversion
+ct.convert(..., inputs=[ct.ImageType(shape=(1, 3, 224, 224))])
+```
+
+3. **Precision differences**: Float16 may differ from Float32
+```python
+# Force Float32 to match PyTorch
+ct.convert(..., compute_precision=ct.precision.FLOAT32)
+```
+
+4. **Random ops**: Dropout, random initialization differ
+```python
+# Ensure eval mode
+model.eval()
+```
+
+**Debug**:
+```python
+# Compare outputs layer by layer
+import numpy as np
+
+torch_output = model(input).detach().numpy()
+coreml_output = mlmodel.predict({"input": input.numpy()})["output"]
+
+print(f"Max diff: {np.max(np.abs(torch_output - coreml_output))}")
+```
+
+---
+
+## Pressure Scenario - "Model works on simulator but not device"
+
+**Wrong approach**: Assume simulator bug, ignore.
+
+**Right approach**:
+1. Check model spec version vs device iOS version (Pattern 1a)
+2. Check compute unit availability (Pattern 2c)
+3. Profile on actual device, not simulator
+4. Simulator uses host Mac's GPU/CPU, not device Neural Engine
+
+---
+
+## Pressure Scenario - "Ship now, optimize later"
+
+**Wrong approach**: Compress to smallest possible size without testing.
+
+**Right approach**:
+1. Ship Float16 baseline first
+2. Profile on target devices
+3. Apply compression incrementally with accuracy testing
+4. Document compression settings for future optimization
+
+---
+
+## Diagnostic Checklist
+
+When CoreML isn't working:
+
+- [ ] Check deployment target matches device iOS
+- [ ] Check model file is compiled (.mlmodelc)
+- [ ] Profile load: cached vs uncached
+- [ ] Profile prediction: which compute units
+- [ ] Check memory: concurrent predictions limited
+- [ ] For compression issues: try higher granularity
+- [ ] For conversion issues: check op support, precision
+
+## Resources
+
+**WWDC**: 2023-10047, 2023-10049, 2024-10159, 2024-10161
+
+**Docs**: /coreml, /coreml/mlmodel
+
+**Skills**: coreml, coreml-ref
--- a/.claude/skills/axiom-ios-ml/coreml-ref/SKILL.md
+++ b/.claude/skills/axiom-ios-ml/coreml-ref/SKILL.md
@@ -0,0 +1,467 @@
+---
+name: coreml-ref
+description: CoreML API reference - MLModel lifecycle, MLTensor operations, coremltools conversion, compression APIs, state management, compute device availability, performance profiling.
+license: MIT
+version: 1.0.0
+---
+
+# CoreML API Reference
+
+## Part 1 - Model Lifecycle
+
+### MLModel Loading
+
+```swift
+// Synchronous load (blocks thread)
+let model = try MLModel(contentsOf: compiledModelURL)
+
+// Async load (preferred)
+let model = try await MLModel.load(contentsOf: compiledModelURL)
+
+// With configuration
+let config = MLModelConfiguration()
+config.computeUnits = .all  // .cpuOnly, .cpuAndGPU, .cpuAndNeuralEngine
+let model = try await MLModel.load(contentsOf: url, configuration: config)
+```
+
+### Model Asset Types
+
+| Type | Extension | Purpose |
+|------|-----------|---------|
+| Source | `.mlmodel`, `.mlpackage` | Development, editing |
+| Compiled | `.mlmodelc` | Runtime execution |
+
+**Note**: Xcode compiles source models automatically. At runtime, use compiled models.
+
+### Caching Behavior
+
+First load triggers device specialization (can be slow). Subsequent loads use cache.
+
+```
+Load flow:
+  ├─ Check cache for (model path + configuration + device)
+  │   ├─ Found → Cached load (fast)
+  │   └─ Not found → Device specialization
+  │       ├─ Parse model
+  │       ├─ Optimize operations
+  │       ├─ Segment for compute units
+  │       ├─ Compile for each unit
+  │       └─ Cache result
+```
+
+Cache invalidated by: system updates, low disk space, model modification.
+
+### Multi-Function Models
+
+```swift
+// Load specific function
+let config = MLModelConfiguration()
+config.functionName = "sticker"  // Function name from model
+
+let model = try MLModel(contentsOf: url, configuration: config)
+```
+
+---
+
+## Part 2 - Compute Availability
+
+### MLComputeDevice (iOS 17+)
+
+```swift
+// Check available compute devices
+let devices = MLModel.availableComputeDevices
+
+// Check for Neural Engine
+let hasNeuralEngine = devices.contains { device in
+    if case .neuralEngine = device { return true }
+    return false
+}
+
+// Check for specific GPU
+for device in devices {
+    switch device {
+    case .cpu:
+        print("CPU available")
+    case .gpu(let gpu):
+        print("GPU: \(gpu.name)")
+    case .neuralEngine(let ne):
+        print("Neural Engine: \(ne.totalCoreCount) cores")
+    @unknown default:
+        break
+    }
+}
+```
+
+### MLModelConfiguration.ComputeUnits
+
+| Value | Behavior |
+|-------|----------|
+| `.all` | Best performance (default) |
+| `.cpuOnly` | CPU only |
+| `.cpuAndGPU` | Exclude Neural Engine |
+| `.cpuAndNeuralEngine` | Exclude GPU |
+
+---
+
+## Part 3 - Prediction APIs
+
+### Synchronous Prediction
+
+```swift
+// Single prediction (NOT thread-safe)
+let output = try model.prediction(from: input)
+
+// Batch prediction
+let outputs = try model.predictions(from: batch)
+```
+
+### Async Prediction (iOS 17+)
+
+```swift
+// Single prediction (thread-safe, supports concurrency)
+let output = try await model.prediction(from: input)
+
+// With cancellation
+let output = try await withTaskCancellationHandler {
+    try await model.prediction(from: input)
+} onCancel: {
+    // Prediction will be cancelled
+}
+```
+
+### State-Based Prediction
+
+```swift
+// Create state from model
+let state = model.makeState()
+
+// Prediction with state (state updated in-place)
+let output = try model.prediction(from: input, using: state)
+
+// Async with state
+let output = try await model.prediction(from: input, using: state)
+```
+
+---
+
+## Part 4 - MLTensor (iOS 18+)
+
+### Creating Tensors
+
+```swift
+import CoreML
+
+// From MLShapedArray
+let shapedArray = MLShapedArray<Float>(scalars: [1, 2, 3, 4], shape: [2, 2])
+let tensor = MLTensor(shapedArray)
+
+// From nested collections
+let tensor = MLTensor([[1.0, 2.0], [3.0, 4.0]])
+
+// Zeros/ones
+let zeros = MLTensor(zeros: [3, 3], scalarType: Float.self)
+```
+
+### Math Operations
+
+```swift
+// Element-wise
+let sum = tensor1 + tensor2
+let product = tensor1 * tensor2
+let scaled = tensor * 2.0
+
+// Reductions
+let mean = tensor.mean()
+let sum = tensor.sum()
+let max = tensor.max()
+
+// Comparison
+let mask = tensor .> mean  // Boolean mask
+
+// Softmax
+let probs = tensor.softmax()
+```
+
+### Indexing and Reshaping
+
+```swift
+// Slicing (Python-like syntax)
+let row = tensor[0]           // First row
+let col = tensor[.all, 0]     // First column
+let slice = tensor[0..<2, 1..<3]
+
+// Reshaping
+let reshaped = tensor.reshaped(to: [4])
+let expanded = tensor.expandingShape(at: 0)
+```
+
+### Materialization
+
+**Critical**: Tensor operations are async. Must materialize to access data.
+
+```swift
+// Materialize to MLShapedArray (blocks until complete)
+let array = await tensor.shapedArray(of: Float.self)
+
+// Access scalars
+let values = array.scalars
+```
+
+---
+
+## Part 5 - Core ML Tools (Python)
+
+### Basic Conversion
+
+```python
+import coremltools as ct
+import torch
+
+# Trace PyTorch model
+model.eval()
+traced = torch.jit.trace(model, example_input)
+
+# Convert
+mlmodel = ct.convert(
+    traced,
+    inputs=[ct.TensorType(shape=example_input.shape)],
+    outputs=[ct.TensorType(name="output")],
+    minimum_deployment_target=ct.target.iOS18
+)
+
+mlmodel.save("Model.mlpackage")
+```
+
+### Dynamic Shapes
+
+```python
+# Fixed shape
+ct.TensorType(shape=(1, 3, 224, 224))
+
+# Range dimension
+ct.TensorType(shape=(1, ct.RangeDim(1, 2048)))
+
+# Enumerated shapes
+ct.TensorType(shape=ct.EnumeratedShapes(shapes=[(1, 256), (1, 512), (1, 1024)]))
+```
+
+### State Types
+
+```python
+# For stateful models (KV-cache)
+states = [
+    ct.StateType(
+        name="keyCache",
+        wrapped_type=ct.TensorType(shape=(1, 32, 2048, 128))
+    ),
+    ct.StateType(
+        name="valueCache",
+        wrapped_type=ct.TensorType(shape=(1, 32, 2048, 128))
+    )
+]
+
+mlmodel = ct.convert(traced, inputs=inputs, states=states, ...)
+```
+
+---
+
+## Part 6 - Compression APIs (coremltools.optimize)
+
+### Post-Training Palettization
+
+```python
+from coremltools.optimize.coreml import (
+    OpPalettizerConfig,
+    OptimizationConfig,
+    palettize_weights
+)
+
+# Per-tensor (iOS 17+)
+config = OpPalettizerConfig(mode="kmeans", nbits=4)
+
+# Per-grouped-channel (iOS 18+, better accuracy)
+config = OpPalettizerConfig(
+    mode="kmeans",
+    nbits=4,
+    granularity="per_grouped_channel",
+    group_size=16
+)
+
+opt_config = OptimizationConfig(global_config=config)
+compressed = palettize_weights(model, opt_config)
+```
+
+### Post-Training Quantization
+
+```python
+from coremltools.optimize.coreml import (
+    OpLinearQuantizerConfig,
+    OptimizationConfig,
+    linear_quantize_weights
+)
+
+# INT8 per-channel (iOS 17+)
+config = OpLinearQuantizerConfig(mode="linear", dtype="int8")
+
+# INT4 per-block (iOS 18+)
+config = OpLinearQuantizerConfig(
+    mode="linear",
+    dtype="int4",
+    granularity="per_block",
+    block_size=32
+)
+
+opt_config = OptimizationConfig(global_config=config)
+compressed = linear_quantize_weights(model, opt_config)
+```
+
+### Post-Training Pruning
+
+```python
+from coremltools.optimize.coreml import (
+    OpMagnitudePrunerConfig,
+    OptimizationConfig,
+    prune_weights
+)
+
+config = OpMagnitudePrunerConfig(target_sparsity=0.5)
+opt_config = OptimizationConfig(global_config=config)
+sparse = prune_weights(model, opt_config)
+```
+
+### Training-Time Palettization (PyTorch)
+
+```python
+from coremltools.optimize.torch.palettization import (
+    DKMPalettizerConfig,
+    DKMPalettizer
+)
+
+config = DKMPalettizerConfig(global_config={"n_bits": 4})
+palettizer = DKMPalettizer(model, config)
+
+# Prepare (inserts palettization layers)
+prepared = palettizer.prepare()
+
+# Training loop
+for epoch in range(epochs):
+    train_one_epoch(prepared, data_loader)
+    palettizer.step()
+
+# Finalize
+final = palettizer.finalize()
+```
+
+### Calibration-Based Compression
+
+```python
+from coremltools.optimize.torch.pruning import (
+    MagnitudePrunerConfig,
+    LayerwiseCompressor
+)
+
+config = MagnitudePrunerConfig(
+    target_sparsity=0.4,
+    n_samples=128
+)
+
+compressor = LayerwiseCompressor(model, config)
+compressed = compressor.compress(calibration_loader)
+```
+
+---
+
+## Part 7 - Multi-Function Models
+
+### Merging Models
+
+```python
+from coremltools.models import MultiFunctionDescriptor
+from coremltools.models.utils import save_multifunction
+
+# Create descriptor
+desc = MultiFunctionDescriptor()
+desc.add_function("function_a", "model_a.mlpackage")
+desc.add_function("function_b", "model_b.mlpackage")
+
+# Merge (deduplicates shared weights)
+save_multifunction(desc, "merged.mlpackage")
+```
+
+### Inspecting Functions (Xcode)
+
+Open model in Xcode → Predictions tab → Functions listed above inputs.
+
+---
+
+## Part 8 - Performance Profiling
+
+### MLComputePlan (iOS 18+)
+
+```swift
+let plan = try await MLComputePlan.load(contentsOf: modelURL)
+
+// Inspect operations
+for op in plan.modelStructure.operations {
+    let info = plan.computeDeviceInfo(for: op)
+    print("Op: \(op.name)")
+    print("  Preferred: \(info.preferredDevice)")
+    print("  Estimated cost: \(info.estimatedCost)")
+}
+```
+
+### Xcode Performance Reports
+
+1. Open model in Xcode
+2. Select Performance tab
+3. Click + to create report
+4. Select device and compute units
+5. Click "Run Test"
+
+**New in iOS 18**: Shows estimated time per operation, compute device support hints.
+
+### Core ML Instrument
+
+```
+Instruments → Core ML template
+  ├─ Load events: "cached" vs "prepare and cache"
+  ├─ Prediction intervals
+  ├─ Compute unit usage
+  └─ Neural Engine activity
+```
+
+---
+
+## Part 9 - Deployment Targets
+
+| Target | Key Features |
+|--------|--------------|
+| iOS 16 | Weight compression (palettization, quantization, pruning) |
+| iOS 17 | Async prediction, MLComputeDevice, activation quantization |
+| iOS 18 | MLTensor, State, SDPA fusion, per-block quantization, multi-function |
+
+**Recommendation**: Always set `minimum_deployment_target=ct.target.iOS18` for best optimizations.
+
+---
+
+## Part 10 - Conversion Pass Pipelines
+
+```python
+# Default pipeline
+mlmodel = ct.convert(traced, ...)
+
+# With palettization support
+mlmodel = ct.convert(
+    traced,
+    pass_pipeline=ct.PassPipeline.DEFAULT_PALETTIZATION,
+    ...
+)
+```
+
+## Resources
+
+**WWDC**: 2023-10047, 2023-10049, 2024-10159, 2024-10161
+
+**Docs**: /coreml, /coreml/mlmodel, /coreml/mltensor, /documentation/coremltools
+
+**Skills**: coreml, coreml-diag
--- a/.claude/skills/axiom-ios-ml/coreml/SKILL.md
+++ b/.claude/skills/axiom-ios-ml/coreml/SKILL.md
@@ -0,0 +1,468 @@
+---
+name: coreml
+description: Use when deploying custom ML models on-device, converting PyTorch models, compressing models, implementing LLM inference, or optimizing CoreML performance. Covers model conversion, compression, stateful models, KV-cache, multi-function models, MLTensor.
+license: MIT
+version: 1.0.0
+---
+
+# CoreML On-Device Machine Learning
+
+## Overview
+
+CoreML enables on-device machine learning inference across all Apple platforms. It abstracts hardware details while leveraging Apple Silicon's CPU, GPU, and Neural Engine for high-performance, private, and efficient execution.
+
+**Key principle**: Start with the simplest approach, then optimize based on profiling. Don't over-engineer compression or caching until you have real performance data.
+
+## Decision Tree - CoreML vs Foundation Models
+
+```
+Need on-device ML?
+  ├─ Text generation (LLM)?
+  │   ├─ Simple prompts, structured output? → Foundation Models (ios-ai skill)
+  │   └─ Custom model, fine-tuned, specific architecture? → CoreML
+  ├─ Custom trained model?
+  │   └─ Yes → CoreML
+  ├─ Image/audio/sensor processing?
+  │   └─ Yes → CoreML
+  └─ Apple's built-in intelligence?
+      └─ Yes → Foundation Models (ios-ai skill)
+```
+
+## Red Flags
+
+Use this skill when you see:
+- "Convert PyTorch model to CoreML"
+- "Model too large for device"
+- "Slow inference performance"
+- "LLM on-device"
+- "KV-cache" or "stateful model"
+- "Model compression" or "quantization"
+- MLModel, MLTensor, or coremltools in context
+
+## Pattern 1 - Basic Model Conversion
+
+The standard PyTorch → CoreML workflow.
+
+```python
+import coremltools as ct
+import torch
+
+# Trace the model
+model.eval()
+traced_model = torch.jit.trace(model, example_input)
+
+# Convert to CoreML
+mlmodel = ct.convert(
+    traced_model,
+    inputs=[ct.TensorType(shape=example_input.shape)],
+    minimum_deployment_target=ct.target.iOS18
+)
+
+# Save
+mlmodel.save("MyModel.mlpackage")
+```
+
+**Critical**: Always set `minimum_deployment_target` to enable latest optimizations.
+
+## Pattern 2 - Model Compression (Post-Training)
+
+Three techniques, each with different tradeoffs:
+
+### Palettization (Best for Neural Engine)
+
+Clusters weights into lookup tables. Use per-grouped-channel for better accuracy.
+
+```python
+from coremltools.optimize.coreml import (
+    OpPalettizerConfig,
+    OptimizationConfig,
+    palettize_weights
+)
+
+# 4-bit with grouped channels (iOS 18+)
+op_config = OpPalettizerConfig(
+    mode="kmeans",
+    nbits=4,
+    granularity="per_grouped_channel",
+    group_size=16
+)
+
+config = OptimizationConfig(global_config=op_config)
+compressed_model = palettize_weights(model, config)
+```
+
+| Bits | Compression | Accuracy Impact |
+|------|-------------|-----------------|
+| 8-bit | 2x | Minimal |
+| 6-bit | 2.7x | Low |
+| 4-bit | 4x | Moderate (use grouped channels) |
+| 2-bit | 8x | High (requires training-time) |
+
+### Quantization (Best for GPU on Mac)
+
+Linear mapping to INT8/INT4. Use per-block for better accuracy.
+
+```python
+from coremltools.optimize.coreml import (
+    OpLinearQuantizerConfig,
+    OptimizationConfig,
+    linear_quantize_weights
+)
+
+# INT4 per-block quantization (iOS 18+)
+op_config = OpLinearQuantizerConfig(
+    mode="linear",
+    dtype="int4",
+    granularity="per_block",
+    block_size=32
+)
+
+config = OptimizationConfig(global_config=op_config)
+compressed_model = linear_quantize_weights(model, config)
+```
+
+### Pruning (Combine with other techniques)
+
+Sets weights to zero for sparse representation. Can combine with palettization.
+
+```python
+from coremltools.optimize.coreml import (
+    OpMagnitudePrunerConfig,
+    OptimizationConfig,
+    prune_weights
+)
+
+op_config = OpMagnitudePrunerConfig(
+    target_sparsity=0.4  # 40% zeros
+)
+
+config = OptimizationConfig(global_config=op_config)
+sparse_model = prune_weights(model, config)
+```
+
+## Pattern 3 - Training-Time Compression
+
+When post-training compression loses too much accuracy, fine-tune with compression.
+
+```python
+from coremltools.optimize.torch.palettization import (
+    DKMPalettizerConfig,
+    DKMPalettizer
+)
+
+# Configure 4-bit palettization
+config = DKMPalettizerConfig(global_config={"n_bits": 4})
+
+# Prepare model
+palettizer = DKMPalettizer(model, config)
+prepared_model = palettizer.prepare()
+
+# Fine-tune (your training loop)
+for epoch in range(num_epochs):
+    train_epoch(prepared_model, data_loader)
+    palettizer.step()
+
+# Finalize
+final_model = palettizer.finalize()
+```
+
+**Tradeoff**: Better accuracy than post-training, but requires training data and time.
+
+## Pattern 4 - Calibration-Based Compression (iOS 18+)
+
+Middle ground: uses calibration data without full training.
+
+```python
+from coremltools.optimize.torch.pruning import (
+    MagnitudePrunerConfig,
+    LayerwiseCompressor
+)
+
+# Configure
+config = MagnitudePrunerConfig(
+    target_sparsity=0.4,
+    n_samples=128  # Calibration samples
+)
+
+# Create pruner
+pruner = LayerwiseCompressor(model, config)
+
+# Calibrate
+sparse_model = pruner.compress(calibration_data_loader)
+```
+
+## Pattern 5 - Stateful Models (KV-Cache for LLMs)
+
+For transformer models, use state to avoid recomputing key/value vectors.
+
+### PyTorch Model with State
+
+```python
+class StatefulLLM(nn.Module):
+    def __init__(self):
+        super().__init__()
+        # Register state buffers
+        self.register_buffer("keyCache", torch.zeros(batch, heads, seq_len, dim))
+        self.register_buffer("valueCache", torch.zeros(batch, heads, seq_len, dim))
+
+    def forward(self, input_ids, causal_mask):
+        # Update caches in-place during forward
+        # ... attention with KV-cache ...
+        return logits
+```
+
+### Conversion with State
+
+```python
+import coremltools as ct
+
+mlmodel = ct.convert(
+    traced_model,
+    inputs=[
+        ct.TensorType(name="input_ids", shape=(1, ct.RangeDim(1, 2048))),
+        ct.TensorType(name="causal_mask", shape=(1, 1, ct.RangeDim(1, 2048), ct.RangeDim(1, 2048)))
+    ],
+    states=[
+        ct.StateType(name="keyCache", ...),
+        ct.StateType(name="valueCache", ...)
+    ],
+    minimum_deployment_target=ct.target.iOS18
+)
+```
+
+### Using State at Runtime
+
+```swift
+// Create state from model
+let state = model.makeState()
+
+// Run prediction with state (updated in-place)
+let output = try model.prediction(from: input, using: state)
+```
+
+**Performance**: 1.6x speedup on Mistral-7B (M3 Max) compared to manual KV-cache I/O.
+
+## Pattern 6 - Multi-Function Models (Adapters/LoRA)
+
+Deploy multiple adapters in a single model, sharing base weights.
+
+```python
+from coremltools.models import MultiFunctionDescriptor
+from coremltools.models.utils import save_multifunction
+
+# Convert individual models
+sticker_model = ct.convert(sticker_adapter_model, ...)
+storybook_model = ct.convert(storybook_adapter_model, ...)
+
+# Save individually
+sticker_model.save("sticker.mlpackage")
+storybook_model.save("storybook.mlpackage")
+
+# Merge with shared weights
+desc = MultiFunctionDescriptor()
+desc.add_function("sticker", "sticker.mlpackage")
+desc.add_function("storybook", "storybook.mlpackage")
+
+save_multifunction(desc, "MultiAdapter.mlpackage")
+```
+
+### Loading Specific Function
+
+```swift
+let config = MLModelConfiguration()
+config.functionName = "sticker"  // or "storybook"
+
+let model = try MLModel(contentsOf: modelURL, configuration: config)
+```
+
+## Pattern 7 - MLTensor for Pipeline Stitching (iOS 18+)
+
+Simplifies computation between models (decoding, post-processing).
+
+```swift
+import CoreML
+
+// Create tensors
+let scores = MLTensor(shape: [1, vocab_size], scalars: logits)
+
+// Operations (executed asynchronously on Apple Silicon)
+let topK = scores.topK(k: 10)
+let probs = (topK.values / temperature).softmax()
+
+// Sample from distribution
+let sampled = probs.multinomial(numSamples: 1)
+
+// Materialize to access data (blocks until complete)
+let shapedArray = await sampled.shapedArray(of: Int32.self)
+```
+
+**Key insight**: MLTensor operations are async. Call `shapedArray()` to materialize results.
+
+## Pattern 8 - Async Prediction for Concurrency
+
+Thread-safe concurrent predictions for throughput.
+
+```swift
+class ImageProcessor {
+    let model: MLModel
+
+    func processImages(_ images: [CGImage]) async throws -> [Output] {
+        try await withThrowingTaskGroup(of: Output.self) { group in
+            for image in images {
+                group.addTask {
+                    // Check cancellation before expensive work
+                    try Task.checkCancellation()
+
+                    let input = try self.prepareInput(image)
+                    // Async prediction - thread safe!
+                    return try await self.model.prediction(from: input)
+                }
+            }
+
+            return try await group.reduce(into: []) { $0.append($1) }
+        }
+    }
+}
+```
+
+**Warning**: Limit concurrent predictions to avoid memory pressure from multiple input/output buffers.
+
+```swift
+// Limit concurrency
+let semaphore = AsyncSemaphore(value: 2)
+
+for image in images {
+    group.addTask {
+        await semaphore.wait()
+        defer { semaphore.signal() }
+        return try await process(image)
+    }
+}
+```
+
+## Anti-Patterns
+
+### Don't - Load models on main thread at launch
+
+```swift
+// BAD - blocks UI
+class AppDelegate {
+    let model = try! MLModel(contentsOf: url)  // Blocks!
+}
+
+// GOOD - lazy async loading
+class ModelManager {
+    private var model: MLModel?
+
+    func getModel() async throws -> MLModel {
+        if let model { return model }
+        model = try await Task.detached {
+            try MLModel(contentsOf: url)
+        }.value
+        return model!
+    }
+}
+```
+
+### Don't - Reload model for each prediction
+
+```swift
+// BAD - reloads every time
+func predict(_ input: Input) throws -> Output {
+    let model = try MLModel(contentsOf: url)  // Expensive!
+    return try model.prediction(from: input)
+}
+
+// GOOD - keep model loaded
+class Predictor {
+    private let model: MLModel
+
+    func predict(_ input: Input) throws -> Output {
+        try model.prediction(from: input)
+    }
+}
+```
+
+### Don't - Compress without profiling first
+
+```swift
+// BAD - blind compression
+let compressed = palettize_weights(model, 2bit_config)  // May break accuracy!
+
+// GOOD - profile, then compress iteratively
+// 1. Profile Float16 baseline
+// 2. Try 8-bit → check accuracy
+// 3. Try 6-bit → check accuracy
+// 4. Try 4-bit with grouped channels → check accuracy
+// 5. Only use 2-bit with training-time compression
+```
+
+### Don't - Ignore deployment target
+
+```python
+# BAD - misses optimizations
+mlmodel = ct.convert(traced_model, inputs=[...])
+
+# GOOD - enables SDPA fusion, per-block quantization, etc.
+mlmodel = ct.convert(
+    traced_model,
+    inputs=[...],
+    minimum_deployment_target=ct.target.iOS18
+)
+```
+
+## Pressure Scenarios
+
+### Scenario 1 - "Model is 5GB, need it under 2GB for iPhone"
+
+**Wrong approach**: Jump straight to 2-bit palettization.
+
+**Right approach**:
+1. Start with 8-bit palettization → check accuracy
+2. Try 6-bit → check accuracy
+3. Try 4-bit with `per_grouped_channel` → check accuracy
+4. If still too large, use calibration-based compression
+5. If still losing accuracy, use training-time compression
+
+### Scenario 2 - "LLM inference is too slow"
+
+**Wrong approach**: Try different compute units randomly.
+
+**Right approach**:
+1. Profile with Core ML Instrument
+2. Check if load is cached (look for "cached" vs "prepare and cache")
+3. Enable stateful KV-cache
+4. Check SDPA optimization is enabled (iOS 18+ deployment target)
+5. Consider INT4 quantization for GPU on Mac
+
+### Scenario 3 - "Need multiple LoRA adapters in one app"
+
+**Wrong approach**: Ship separate models for each adapter.
+
+**Right approach**:
+1. Convert each adapter model separately
+2. Use `MultiFunctionDescriptor` to merge with shared base
+3. Load specific function via `config.functionName`
+4. Weights are deduplicated automatically
+
+## Checklist
+
+Before deploying a CoreML model:
+
+- [ ] Set `minimum_deployment_target` to latest supported iOS
+- [ ] Profile baseline Float16 performance
+- [ ] Check if model load is cached
+- [ ] Consider compression only if size/performance requires it
+- [ ] Test accuracy after each compression step
+- [ ] Use async prediction for concurrent workloads
+- [ ] Limit concurrent predictions to manage memory
+- [ ] Use state for transformer KV-cache
+- [ ] Use multi-function for adapter variants
+
+## Resources
+
+**WWDC**: 2023-10047, 2023-10049, 2024-10159, 2024-10161
+
+**Docs**: /coreml, /coreml/mlmodel, /coreml/mltensor
+
+**Skills**: coreml-ref, coreml-diag, axiom-ios-ai (Foundation Models)
--- a/.claude/skills/axiom-ios-ml/speech/SKILL.md
+++ b/.claude/skills/axiom-ios-ml/speech/SKILL.md
@@ -0,0 +1,496 @@
+---
+name: speech
+description: Use when implementing speech-to-text, live transcription, or audio transcription. Covers SpeechAnalyzer (iOS 26+), SpeechTranscriber, volatile/finalized results, AssetInventory model management, audio format handling.
+license: MIT
+version: 1.0.0
+---
+
+# Speech-to-Text with SpeechAnalyzer
+
+## Overview
+
+SpeechAnalyzer is Apple's new speech-to-text API introduced in iOS 26. It powers Notes, Voice Memos, Journal, and Call Summarization. The on-device model is faster, more accurate, and better for long-form/distant audio than SFSpeechRecognizer.
+
+**Key principle**: SpeechAnalyzer is modular—add transcription modules to an analysis session. Results stream asynchronously using Swift's AsyncSequence.
+
+## Decision Tree - SpeechAnalyzer vs SFSpeechRecognizer
+
+```
+Need speech-to-text?
+  ├─ iOS 26+ only?
+  │   └─ Yes → SpeechAnalyzer (preferred)
+  ├─ Need iOS 10-25 support?
+  │   └─ Yes → SFSpeechRecognizer (or DictationTranscriber)
+  ├─ Long-form audio (meetings, lectures)?
+  │   └─ Yes → SpeechAnalyzer
+  ├─ Distant audio (across room)?
+  │   └─ Yes → SpeechAnalyzer
+  └─ Short dictation commands?
+      └─ Either works
+```
+
+**SpeechAnalyzer advantages**:
+- Better for long-form and conversational audio
+- Works well with distant speakers (meetings)
+- On-device, private
+- Model managed by system (no app size increase)
+- Powers Notes, Voice Memos, Journal
+
+**DictationTranscriber** (iOS 26+): Same languages as SFSpeechRecognizer, but doesn't require user to enable Siri/dictation in Settings.
+
+## Red Flags
+
+Use this skill when you see:
+- "Live transcription"
+- "Transcribe audio"
+- "Speech-to-text"
+- "SpeechAnalyzer" or "SpeechTranscriber"
+- "Volatile results"
+- Building Notes-like or Voice Memos-like features
+
+## Pattern 1 - File Transcription (Simplest)
+
+Transcribe an audio file to text in one function.
+
+```swift
+import Speech
+
+func transcribe(file: URL, locale: Locale) async throws -> AttributedString {
+    // Set up transcriber
+    let transcriber = SpeechTranscriber(
+        locale: locale,
+        preset: .offlineTranscription
+    )
+
+    // Collect results asynchronously
+    async let transcriptionFuture = try transcriber.results
+        .reduce(AttributedString()) { str, result in
+            str + result.text
+        }
+
+    // Set up analyzer with transcriber module
+    let analyzer = SpeechAnalyzer(modules: [transcriber])
+
+    // Analyze the file
+    if let lastSample = try await analyzer.analyzeSequence(from: file) {
+        try await analyzer.finalizeAndFinish(through: lastSample)
+    } else {
+        await analyzer.cancelAndFinishNow()
+    }
+
+    return try await transcriptionFuture
+}
+```
+
+**Key points**:
+- `analyzeSequence(from:)` reads file and feeds audio to analyzer
+- `finalizeAndFinish(through:)` ensures all results are finalized
+- Results are `AttributedString` with timing metadata
+
+## Pattern 2 - Live Transcription Setup
+
+For real-time transcription from microphone.
+
+### Step 1 - Configure SpeechTranscriber
+
+```swift
+import Speech
+
+class TranscriptionManager: ObservableObject {
+    private var transcriber: SpeechTranscriber?
+    private var analyzer: SpeechAnalyzer?
+    private var analyzerFormat: AudioFormatDescription?
+    private var inputBuilder: AsyncStream<AnalyzerInput>.Continuation?
+
+    @Published var finalizedTranscript = AttributedString()
+    @Published var volatileTranscript = AttributedString()
+
+    func setUp() async throws {
+        // Create transcriber with options
+        transcriber = SpeechTranscriber(
+            locale: Locale.current,
+            transcriptionOptions: [],
+            reportingOptions: [.volatileResults],  // Enable real-time updates
+            attributeOptions: [.audioTimeRange]     // Include timing
+        )
+
+        guard let transcriber else { throw TranscriptionError.setupFailed }
+
+        // Create analyzer with transcriber module
+        analyzer = SpeechAnalyzer(modules: [transcriber])
+
+        // Get required audio format
+        analyzerFormat = await SpeechAnalyzer.bestAvailableAudioFormat(
+            compatibleWith: [transcriber]
+        )
+
+        // Ensure model is available
+        try await ensureModel(for: transcriber)
+
+        // Create input stream
+        let (stream, builder) = AsyncStream<AnalyzerInput>.makeStream()
+        inputBuilder = builder
+
+        // Start analyzer
+        try await analyzer?.start(inputSequence: stream)
+    }
+}
+```
+
+### Step 2 - Ensure Model Availability
+
+```swift
+func ensureModel(for transcriber: SpeechTranscriber) async throws {
+    let locale = Locale.current
+
+    // Check if language is supported
+    let supported = await SpeechTranscriber.supportedLocales
+    guard supported.contains(where: {
+        $0.identifier(.bcp47) == locale.identifier(.bcp47)
+    }) else {
+        throw TranscriptionError.localeNotSupported
+    }
+
+    // Check if model is installed
+    let installed = await SpeechTranscriber.installedLocales
+    if installed.contains(where: {
+        $0.identifier(.bcp47) == locale.identifier(.bcp47)
+    }) {
+        return  // Already installed
+    }
+
+    // Download model
+    if let downloader = try await AssetInventory.assetInstallationRequest(
+        supporting: [transcriber]
+    ) {
+        // Track progress if needed
+        let progress = downloader.progress
+        try await downloader.downloadAndInstall()
+    }
+}
+```
+
+**Note**: Models are stored in system storage, not app storage. Limited number of languages can be allocated at once.
+
+### Step 3 - Handle Results
+
+```swift
+func startResultHandling() {
+    Task {
+        guard let transcriber else { return }
+
+        do {
+            for try await result in transcriber.results {
+                let text = result.text
+
+                if result.isFinal {
+                    // Finalized result - won't change
+                    finalizedTranscript += text
+                    volatileTranscript = AttributedString()
+
+                    // Access timing info
+                    for run in text.runs {
+                        if let timeRange = run.audioTimeRange {
+                            print("Time: \(timeRange)")
+                        }
+                    }
+                } else {
+                    // Volatile result - will be replaced
+                    volatileTranscript = text
+                }
+            }
+        } catch {
+            print("Transcription failed: \(error)")
+        }
+    }
+}
+```
+
+## Pattern 3 - Audio Recording and Streaming
+
+Connect AVAudioEngine to SpeechAnalyzer.
+
+```swift
+import AVFoundation
+
+class AudioRecorder {
+    private let audioEngine = AVAudioEngine()
+    private var outputContinuation: AsyncStream<AVAudioPCMBuffer>.Continuation?
+    private let transcriptionManager: TranscriptionManager
+
+    func startRecording() async throws {
+        // Request permission
+        guard await AVAudioApplication.requestRecordPermission() else {
+            throw RecordingError.permissionDenied
+        }
+
+        // Configure audio session (iOS)
+        #if os(iOS)
+        let session = AVAudioSession.sharedInstance()
+        try session.setCategory(.playAndRecord, mode: .spokenAudio)
+        try session.setActive(true, options: .notifyOthersOnDeactivation)
+        #endif
+
+        // Set up transcriber
+        try await transcriptionManager.setUp()
+        transcriptionManager.startResultHandling()
+
+        // Stream audio to transcriber
+        for await buffer in try audioStream() {
+            try await transcriptionManager.streamAudio(buffer)
+        }
+    }
+
+    private func audioStream() throws -> AsyncStream<AVAudioPCMBuffer> {
+        let inputNode = audioEngine.inputNode
+        let format = inputNode.outputFormat(forBus: 0)
+
+        inputNode.installTap(
+            onBus: 0,
+            bufferSize: 4096,
+            format: format
+        ) { [weak self] buffer, time in
+            self?.outputContinuation?.yield(buffer)
+        }
+
+        audioEngine.prepare()
+        try audioEngine.start()
+
+        return AsyncStream { continuation in
+            outputContinuation = continuation
+        }
+    }
+}
+```
+
+### Stream Audio with Format Conversion
+
+```swift
+extension TranscriptionManager {
+    private var converter: AVAudioConverter?
+
+    func streamAudio(_ buffer: AVAudioPCMBuffer) async throws {
+        guard let inputBuilder, let analyzerFormat else {
+            throw TranscriptionError.notSetUp
+        }
+
+        // Convert to analyzer's required format
+        let converted = try convertBuffer(buffer, to: analyzerFormat)
+
+        // Send to analyzer
+        let input = AnalyzerInput(buffer: converted)
+        inputBuilder.yield(input)
+    }
+
+    private func convertBuffer(
+        _ buffer: AVAudioPCMBuffer,
+        to format: AudioFormatDescription
+    ) throws -> AVAudioPCMBuffer {
+        // Lazy initialize converter
+        if converter == nil {
+            let sourceFormat = buffer.format
+            let destFormat = AVAudioFormat(cmAudioFormatDescription: format)!
+            converter = AVAudioConverter(from: sourceFormat, to: destFormat)
+        }
+
+        guard let converter else {
+            throw TranscriptionError.conversionFailed
+        }
+
+        let outputBuffer = AVAudioPCMBuffer(
+            pcmFormat: converter.outputFormat,
+            frameCapacity: buffer.frameLength
+        )!
+
+        try converter.convert(to: outputBuffer, from: buffer)
+        return outputBuffer
+    }
+}
+```
+
+## Pattern 4 - Stopping Transcription
+
+Properly finalize to get remaining volatile results as finalized.
+
+```swift
+func stopRecording() async {
+    // Stop audio
+    audioEngine.stop()
+    audioEngine.inputNode.removeTap(onBus: 0)
+    outputContinuation?.finish()
+
+    // Finalize transcription (converts remaining volatile to final)
+    try? await analyzer?.finalizeAndFinishThroughEndOfInput()
+
+    // Cancel any pending tasks
+    recognizerTask?.cancel()
+}
+```
+
+**Critical**: Always call `finalizeAndFinishThroughEndOfInput()` to ensure volatile results are finalized.
+
+## Pattern 5 - Model Asset Management
+
+### Check Supported Languages
+
+```swift
+// Languages the API supports
+let supported = await SpeechTranscriber.supportedLocales
+
+// Languages currently installed on device
+let installed = await SpeechTranscriber.installedLocales
+```
+
+### Deallocate Languages
+
+Limited number of languages can be allocated. Deallocate unused ones.
+
+```swift
+func deallocateLanguages() async {
+    let allocated = await AssetInventory.allocatedLocales
+    for locale in allocated {
+        await AssetInventory.deallocate(locale: locale)
+    }
+}
+```
+
+## Pattern 6 - Displaying Results with Timing
+
+Highlight text during audio playback using timing metadata.
+
+```swift
+struct TranscriptView: View {
+    let transcript: AttributedString
+    @Binding var playbackTime: CMTime
+
+    var body: some View {
+        Text(highlightedTranscript)
+    }
+
+    var highlightedTranscript: AttributedString {
+        var result = transcript
+
+        for (range, run) in transcript.runs {
+            guard let timeRange = run.audioTimeRange else { continue }
+
+            let isActive = timeRange.containsTime(playbackTime)
+            if isActive {
+                result[range].backgroundColor = .yellow
+            }
+        }
+
+        return result
+    }
+}
+```
+
+## Anti-Patterns
+
+### Don't - Forget to finalize
+
+```swift
+// BAD - volatile results lost
+func stopRecording() {
+    audioEngine.stop()
+    // Missing finalize!
+}
+
+// GOOD - volatile results become finalized
+func stopRecording() async {
+    audioEngine.stop()
+    try? await analyzer?.finalizeAndFinishThroughEndOfInput()
+}
+```
+
+### Don't - Ignore format conversion
+
+```swift
+// BAD - format mismatch may fail silently
+inputBuilder.yield(AnalyzerInput(buffer: rawBuffer))
+
+// GOOD - convert to analyzer's format
+let format = await SpeechAnalyzer.bestAvailableAudioFormat(compatibleWith: [transcriber])
+let converted = try convertBuffer(rawBuffer, to: format)
+inputBuilder.yield(AnalyzerInput(buffer: converted))
+```
+
+### Don't - Skip model availability check
+
+```swift
+// BAD - may crash if model not installed
+let transcriber = SpeechTranscriber(locale: locale, ...)
+// Start using immediately
+
+// GOOD - ensure model is ready
+let transcriber = SpeechTranscriber(locale: locale, ...)
+try await ensureModel(for: transcriber)
+// Now safe to use
+```
+
+## Presets Reference
+
+| Preset | Use Case |
+|--------|----------|
+| `.offlineTranscription` | File transcription, no real-time feedback needed |
+| `.progressiveLiveTranscription` | Live transcription with volatile updates |
+
+## Options Reference
+
+### TranscriptionOptions
+- Default: None (standard transcription)
+
+### ReportingOptions
+- `.volatileResults`: Enable real-time approximate results
+
+### AttributeOptions
+- `.audioTimeRange`: Include CMTimeRange for each text segment
+
+## Platform Availability
+
+| Platform | SpeechTranscriber | DictationTranscriber |
+|----------|-------------------|---------------------|
+| iOS 26+ | Yes | Yes |
+| macOS Tahoe+ | Yes | Yes |
+| watchOS 26+ | No | Yes |
+| tvOS 26+ | TBD | TBD |
+
+**Hardware requirements**: Varies by device. Use `supportedLocales` to check.
+
+## Integration with Apple Intelligence
+
+Combine with Foundation Models for summarization:
+
+```swift
+import FoundationModels
+
+func generateTitle(for transcript: String) async throws -> String {
+    let session = LanguageModelSession()
+    let prompt = "Generate a short, clever title for this story: \(transcript)"
+    let response = try await session.respond(to: prompt)
+    return response.content
+}
+```
+
+See `axiom-ios-ai` skill for Foundation Models details.
+
+## Checklist
+
+Before shipping speech-to-text:
+
+- [ ] Check locale support with `supportedLocales`
+- [ ] Ensure model with `AssetInventory.assetInstallationRequest`
+- [ ] Handle download progress for user feedback
+- [ ] Convert audio to `bestAvailableAudioFormat`
+- [ ] Enable `.volatileResults` for live transcription
+- [ ] Call `finalizeAndFinishThroughEndOfInput()` on stop
+- [ ] Handle timing with `.audioTimeRange` if needed
+- [ ] Clear volatile results when finalized result arrives
+- [ ] Request microphone permission before recording
+
+## Resources
+
+**WWDC**: 2025-277
+
+**Docs**: /speech, /speech/speechanalyzer, /speech/speechtranscriber
+
+**Skills**: coreml (on-device ML), axiom-ios-ai (Foundation Models)