This snapshot establishes the camera-to-result recognition flow and related tests while checking in the project skill/docs assets required for the configured local tooling.
1281 lines
36 KiB
Markdown
1281 lines
36 KiB
Markdown
---
|
|
name: axiom-vision-ref
|
|
description: Use when needing Vision framework API details for hand/body pose, segmentation, text recognition, barcode detection, document scanning, or Visual Intelligence integration. Covers VNRequest types, coordinate conversion, DataScannerViewController, RecognizeDocumentsRequest, SemanticContentDescriptor, IntentValueQuery.
|
|
license: MIT
|
|
compatibility: iOS 11+, iPadOS 11+, macOS 10.13+, tvOS 11+, visionOS 1+
|
|
metadata:
|
|
version: "1.1.0"
|
|
last-updated: "2026-01-03"
|
|
---
|
|
|
|
# Vision Framework API Reference
|
|
|
|
Comprehensive reference for Vision framework computer vision: subject segmentation, hand/body pose detection, person detection, face analysis, text recognition (OCR), barcode detection, and document scanning.
|
|
|
|
## When to Use This Reference
|
|
|
|
- **Implementing subject lifting** using VisionKit or Vision
|
|
- **Detecting hand/body poses** for gesture recognition or fitness apps
|
|
- **Segmenting people** from backgrounds or separating multiple individuals
|
|
- **Face detection and landmarks** for AR effects or authentication
|
|
- **Combining Vision APIs** to solve complex computer vision problems
|
|
- **Looking up specific API signatures** and parameter meanings
|
|
- **Recognizing text** in images (OCR) with VNRecognizeTextRequest
|
|
- **Detecting barcodes** and QR codes with VNDetectBarcodesRequest
|
|
- **Building live scanners** with DataScannerViewController
|
|
- **Scanning documents** with VNDocumentCameraViewController
|
|
- **Extracting structured document data** with RecognizeDocumentsRequest (iOS 26+)
|
|
|
|
**Related skills**: See `axiom-vision` for decision trees and patterns, `axiom-vision-diag` for troubleshooting
|
|
|
|
## Vision Framework Overview
|
|
|
|
Vision provides computer vision algorithms for still images and video:
|
|
|
|
**Core workflow**:
|
|
1. Create request (e.g., `VNDetectHumanHandPoseRequest()`)
|
|
2. Create handler with image (`VNImageRequestHandler(cgImage: image)`)
|
|
3. Perform request (`try handler.perform([request])`)
|
|
4. Access observations from `request.results`
|
|
|
|
**Coordinate system**: Lower-left origin, normalized (0.0-1.0) coordinates
|
|
|
|
**Performance**: Run on background queue - resource intensive, blocks UI if on main thread
|
|
|
|
## Request Handlers
|
|
|
|
Vision provides two request handlers for different scenarios.
|
|
|
|
### VNImageRequestHandler
|
|
|
|
Analyzes a **single image**. Initialize with the image, perform requests against it, discard.
|
|
|
|
```swift
|
|
let handler = VNImageRequestHandler(cgImage: image)
|
|
try handler.perform([request1, request2]) // Multiple requests, one image
|
|
```
|
|
|
|
**Initialize with**: `CGImage`, `CIImage`, `CVPixelBuffer`, `Data`, or `URL`
|
|
|
|
**Rule**: One handler per image. Reusing a handler with a different image is unsupported.
|
|
|
|
### VNSequenceRequestHandler
|
|
|
|
Analyzes a **sequence of frames** (video, camera feed). Initialize empty, pass each frame to `perform()`. Maintains inter-frame state for temporal smoothing.
|
|
|
|
```swift
|
|
let sequenceHandler = VNSequenceRequestHandler()
|
|
|
|
// In your camera/video frame callback:
|
|
func processFrame(_ pixelBuffer: CVPixelBuffer) throws {
|
|
try sequenceHandler.perform([request], on: pixelBuffer)
|
|
}
|
|
```
|
|
|
|
**Rule**: Create once, reuse across frames. The handler tracks state between calls.
|
|
|
|
### When to Use Which
|
|
|
|
| Use Case | Handler |
|
|
|----------|---------|
|
|
| Single photo or screenshot | `VNImageRequestHandler` |
|
|
| Video stream or camera frames | `VNSequenceRequestHandler` |
|
|
| Temporal smoothing (pose, segmentation) | `VNSequenceRequestHandler` |
|
|
| One-off analysis of a CVPixelBuffer | `VNImageRequestHandler` |
|
|
|
|
### Requests That Benefit from Sequence Handling
|
|
|
|
These requests use inter-frame state when run through `VNSequenceRequestHandler`:
|
|
- `VNDetectHumanBodyPoseRequest` — Smoother joint tracking
|
|
- `VNDetectHumanHandPoseRequest` — Smoother landmark tracking
|
|
- `VNGeneratePersonSegmentationRequest` — Temporally consistent masks
|
|
- `VNGeneratePersonInstanceMaskRequest` — Stable person identity across frames
|
|
- `VNDetectDocumentSegmentationRequest` — Stable document edges
|
|
- Any `VNStatefulRequest` subclass — Designed for sequences
|
|
|
|
### Common Mistake
|
|
|
|
Creating a new `VNImageRequestHandler` per video frame discards temporal context. Pose landmarks jitter, segmentation masks flicker, and you lose the smoothing that sequence handling provides.
|
|
|
|
```swift
|
|
// Wrong — loses temporal context every frame
|
|
func processFrame(_ buffer: CVPixelBuffer) throws {
|
|
let handler = VNImageRequestHandler(cvPixelBuffer: buffer)
|
|
try handler.perform([poseRequest])
|
|
}
|
|
|
|
// Right — maintains inter-frame state
|
|
let sequenceHandler = VNSequenceRequestHandler()
|
|
func processFrame(_ buffer: CVPixelBuffer) throws {
|
|
try sequenceHandler.perform([poseRequest], on: buffer)
|
|
}
|
|
```
|
|
|
|
## Subject Segmentation APIs
|
|
|
|
### VNGenerateForegroundInstanceMaskRequest
|
|
|
|
**Availability**: iOS 17+, macOS 14+, tvOS 17+, visionOS 1+
|
|
|
|
Generates class-agnostic instance mask of foreground objects (people, pets, buildings, food, shoes, etc.)
|
|
|
|
#### Basic Usage
|
|
|
|
```swift
|
|
let request = VNGenerateForegroundInstanceMaskRequest()
|
|
let handler = VNImageRequestHandler(cgImage: image)
|
|
|
|
try handler.perform([request])
|
|
|
|
guard let observation = request.results?.first as? VNInstanceMaskObservation else {
|
|
return
|
|
}
|
|
```
|
|
|
|
#### InstanceMaskObservation
|
|
|
|
**allInstances**: `IndexSet` containing all foreground instance indices (excludes background 0)
|
|
|
|
**instanceMask**: `CVPixelBuffer` with UInt8 labels (0 = background, 1+ = instance indices)
|
|
|
|
**instanceAtPoint(_:)**: Returns instance index at normalized point
|
|
|
|
```swift
|
|
let point = CGPoint(x: 0.5, y: 0.5) // Center of image
|
|
let instance = observation.instanceAtPoint(point)
|
|
|
|
if instance == 0 {
|
|
print("Background tapped")
|
|
} else {
|
|
print("Instance \(instance) tapped")
|
|
}
|
|
```
|
|
|
|
#### Generating Masks
|
|
|
|
**createScaledMask(for:croppedToInstancesContent:)**
|
|
|
|
Parameters:
|
|
- `for`: `IndexSet` of instances to include
|
|
- `croppedToInstancesContent`:
|
|
- `false` = Output matches input resolution (for compositing)
|
|
- `true` = Tight crop around selected instances
|
|
|
|
Returns: Single-channel floating-point `CVPixelBuffer` (soft segmentation mask)
|
|
|
|
```swift
|
|
// All instances, full resolution
|
|
let mask = try observation.createScaledMask(
|
|
for: observation.allInstances,
|
|
croppedToInstancesContent: false
|
|
)
|
|
|
|
// Single instance, cropped
|
|
let instances = IndexSet(integer: 1)
|
|
let croppedMask = try observation.createScaledMask(
|
|
for: instances,
|
|
croppedToInstancesContent: true
|
|
)
|
|
```
|
|
|
|
#### Instance Mask Hit Testing
|
|
|
|
Access raw pixel buffer to map tap coordinates to instance labels:
|
|
|
|
```swift
|
|
let instanceMask = observation.instanceMask
|
|
|
|
CVPixelBufferLockBaseAddress(instanceMask, .readOnly)
|
|
defer { CVPixelBufferUnlockBaseAddress(instanceMask, .readOnly) }
|
|
|
|
let baseAddress = CVPixelBufferGetBaseAddress(instanceMask)
|
|
let width = CVPixelBufferGetWidth(instanceMask)
|
|
let bytesPerRow = CVPixelBufferGetBytesPerRow(instanceMask)
|
|
|
|
// Convert normalized tap to pixel coordinates
|
|
let pixelPoint = VNImagePointForNormalizedPoint(
|
|
CGPoint(x: normalizedX, y: normalizedY),
|
|
width: imageWidth,
|
|
height: imageHeight
|
|
)
|
|
|
|
// Calculate byte offset
|
|
let offset = Int(pixelPoint.y) * bytesPerRow + Int(pixelPoint.x)
|
|
|
|
// Read instance label
|
|
let label = UnsafeRawPointer(baseAddress!).load(
|
|
fromByteOffset: offset,
|
|
as: UInt8.self
|
|
)
|
|
|
|
let instances = label == 0 ? observation.allInstances : IndexSet(integer: Int(label))
|
|
```
|
|
|
|
## VisionKit Subject Lifting
|
|
|
|
### ImageAnalysisInteraction (iOS)
|
|
|
|
**Availability**: iOS 16+, iPadOS 16+
|
|
|
|
Adds system-like subject lifting UI to views:
|
|
|
|
```swift
|
|
let interaction = ImageAnalysisInteraction()
|
|
interaction.preferredInteractionTypes = .imageSubject // Or .automatic
|
|
imageView.addInteraction(interaction)
|
|
```
|
|
|
|
**Interaction types**:
|
|
- `.automatic`: Subject lifting + Live Text + data detectors
|
|
- `.imageSubject`: Subject lifting only (no interactive text)
|
|
|
|
### ImageAnalysisOverlayView (macOS)
|
|
|
|
**Availability**: macOS 13+
|
|
|
|
```swift
|
|
let overlayView = ImageAnalysisOverlayView()
|
|
overlayView.preferredInteractionTypes = .imageSubject
|
|
nsView.addSubview(overlayView)
|
|
```
|
|
|
|
### Programmatic Access
|
|
|
|
#### ImageAnalyzer
|
|
|
|
```swift
|
|
let analyzer = ImageAnalyzer()
|
|
let configuration = ImageAnalyzer.Configuration([.text, .visualLookUp])
|
|
|
|
let analysis = try await analyzer.analyze(image, configuration: configuration)
|
|
```
|
|
|
|
#### ImageAnalysis
|
|
|
|
**subjects**: `[Subject]` - All subjects in image
|
|
|
|
**highlightedSubjects**: `Set<Subject>` - Currently highlighted (user long-pressed)
|
|
|
|
**subject(at:)**: Async lookup of subject at normalized point (returns `nil` if none)
|
|
|
|
```swift
|
|
// Get all subjects
|
|
let subjects = analysis.subjects
|
|
|
|
// Look up subject at tap
|
|
if let subject = try await analysis.subject(at: tapPoint) {
|
|
// Process subject
|
|
}
|
|
|
|
// Change highlight state
|
|
analysis.highlightedSubjects = Set([subjects[0], subjects[1]])
|
|
```
|
|
|
|
#### Subject Struct
|
|
|
|
**image**: `UIImage`/`NSImage` - Extracted subject with transparency
|
|
|
|
**bounds**: `CGRect` - Subject boundaries in image coordinates
|
|
|
|
```swift
|
|
// Single subject image
|
|
let subjectImage = subject.image
|
|
|
|
// Composite multiple subjects
|
|
let compositeImage = try await analysis.image(for: [subject1, subject2])
|
|
```
|
|
|
|
**Out-of-process**: VisionKit analysis happens out-of-process (performance benefit, image size limited)
|
|
|
|
## Person Segmentation APIs
|
|
|
|
### VNGeneratePersonSegmentationRequest
|
|
|
|
**Availability**: iOS 15+, macOS 12+
|
|
|
|
Returns single mask containing **all people** in image:
|
|
|
|
```swift
|
|
let request = VNGeneratePersonSegmentationRequest()
|
|
// Configure quality level if needed
|
|
try handler.perform([request])
|
|
|
|
guard let observation = request.results?.first as? VNPixelBufferObservation else {
|
|
return
|
|
}
|
|
|
|
let personMask = observation.pixelBuffer // CVPixelBuffer
|
|
```
|
|
|
|
### VNGeneratePersonInstanceMaskRequest
|
|
|
|
**Availability**: iOS 17+, macOS 14+
|
|
|
|
Returns **separate masks for up to 4 people**:
|
|
|
|
```swift
|
|
let request = VNGeneratePersonInstanceMaskRequest()
|
|
try handler.perform([request])
|
|
|
|
guard let observation = request.results?.first as? VNInstanceMaskObservation else {
|
|
return
|
|
}
|
|
|
|
// Same InstanceMaskObservation API as foreground instance masks
|
|
let allPeople = observation.allInstances // Up to 4 people (1-4)
|
|
|
|
// Get mask for person 1
|
|
let person1Mask = try observation.createScaledMask(
|
|
for: IndexSet(integer: 1),
|
|
croppedToInstancesContent: false
|
|
)
|
|
```
|
|
|
|
**Limitations**:
|
|
- Segments up to 4 people
|
|
- With >4 people: may miss people or combine them (typically background people)
|
|
- Use `VNDetectFaceRectanglesRequest` to count faces if you need to handle crowded scenes
|
|
|
|
## Hand Pose Detection
|
|
|
|
### VNDetectHumanHandPoseRequest
|
|
|
|
**Availability**: iOS 14+, macOS 11+
|
|
|
|
Detects **21 hand landmarks** per hand:
|
|
|
|
```swift
|
|
let request = VNDetectHumanHandPoseRequest()
|
|
request.maximumHandCount = 2 // Default: 2, increase if needed
|
|
|
|
let handler = VNImageRequestHandler(cgImage: image)
|
|
try handler.perform([request])
|
|
|
|
for observation in request.results as? [VNHumanHandPoseObservation] ?? [] {
|
|
// Process each hand
|
|
}
|
|
```
|
|
|
|
**Performance note**: `maximumHandCount` affects latency. Pose computed only for hands ≤ maximum. Set to lowest acceptable value.
|
|
|
|
### Hand Landmarks (21 points)
|
|
|
|
**Wrist**: 1 landmark
|
|
|
|
**Thumb** (4 landmarks):
|
|
- `.thumbTip`
|
|
- `.thumbIP` (interphalangeal joint)
|
|
- `.thumbMP` (metacarpophalangeal joint)
|
|
- `.thumbCMC` (carpometacarpal joint)
|
|
|
|
**Fingers** (4 landmarks each):
|
|
- Tip (`.indexTip`, `.middleTip`, `.ringTip`, `.littleTip`)
|
|
- DIP (distal interphalangeal joint)
|
|
- PIP (proximal interphalangeal joint)
|
|
- MCP (metacarpophalangeal joint)
|
|
|
|
### Group Keys
|
|
|
|
Access landmark groups:
|
|
|
|
| Group Key | Points |
|
|
|-----------|--------|
|
|
| `.all` | All 21 landmarks |
|
|
| `.thumb` | 4 thumb joints |
|
|
| `.indexFinger` | 4 index finger joints |
|
|
| `.middleFinger` | 4 middle finger joints |
|
|
| `.ringFinger` | 4 ring finger joints |
|
|
| `.littleFinger` | 4 little finger joints |
|
|
|
|
```swift
|
|
// Get all points
|
|
let allPoints = try observation.recognizedPoints(.all)
|
|
|
|
// Get index finger points only
|
|
let indexPoints = try observation.recognizedPoints(.indexFinger)
|
|
|
|
// Get specific point
|
|
let thumbTip = try observation.recognizedPoint(.thumbTip)
|
|
let indexTip = try observation.recognizedPoint(.indexTip)
|
|
|
|
// Check confidence
|
|
guard thumbTip.confidence > 0.5 else { return }
|
|
|
|
// Access location (normalized coordinates, lower-left origin)
|
|
let location = thumbTip.location // CGPoint
|
|
```
|
|
|
|
### Gesture Recognition Example (Pinch)
|
|
|
|
```swift
|
|
let thumbTip = try observation.recognizedPoint(.thumbTip)
|
|
let indexTip = try observation.recognizedPoint(.indexTip)
|
|
|
|
guard thumbTip.confidence > 0.5, indexTip.confidence > 0.5 else {
|
|
return
|
|
}
|
|
|
|
let distance = hypot(
|
|
thumbTip.location.x - indexTip.location.x,
|
|
thumbTip.location.y - indexTip.location.y
|
|
)
|
|
|
|
let isPinching = distance < 0.05 // Normalized threshold
|
|
```
|
|
|
|
### Chirality (Handedness)
|
|
|
|
```swift
|
|
let chirality = observation.chirality // .left or .right or .unknown
|
|
```
|
|
|
|
## Body Pose Detection
|
|
|
|
### VNDetectHumanBodyPoseRequest (2D)
|
|
|
|
**Availability**: iOS 14+, macOS 11+
|
|
|
|
Detects **18 body landmarks** (2D normalized coordinates):
|
|
|
|
```swift
|
|
let request = VNDetectHumanBodyPoseRequest()
|
|
try handler.perform([request])
|
|
|
|
for observation in request.results as? [VNHumanBodyPoseObservation] ?? [] {
|
|
// Process each person
|
|
}
|
|
```
|
|
|
|
### Body Landmarks (18 points)
|
|
|
|
**Face** (5 landmarks):
|
|
- `.nose`, `.leftEye`, `.rightEye`, `.leftEar`, `.rightEar`
|
|
|
|
**Arms** (6 landmarks):
|
|
- Left: `.leftShoulder`, `.leftElbow`, `.leftWrist`
|
|
- Right: `.rightShoulder`, `.rightElbow`, `.rightWrist`
|
|
|
|
**Torso** (7 landmarks):
|
|
- `.neck` (between shoulders)
|
|
- `.leftShoulder`, `.rightShoulder` (also in arm groups)
|
|
- `.leftHip`, `.rightHip`
|
|
- `.root` (between hips)
|
|
|
|
**Legs** (6 landmarks):
|
|
- Left: `.leftHip`, `.leftKnee`, `.leftAnkle`
|
|
- Right: `.rightHip`, `.rightKnee`, `.rightAnkle`
|
|
|
|
**Note**: Shoulders and hips appear in multiple groups
|
|
|
|
### Group Keys (Body)
|
|
|
|
| Group Key | Points |
|
|
|-----------|--------|
|
|
| `.all` | All 18 landmarks |
|
|
| `.face` | 5 face landmarks |
|
|
| `.leftArm` | shoulder, elbow, wrist |
|
|
| `.rightArm` | shoulder, elbow, wrist |
|
|
| `.torso` | neck, shoulders, hips, root |
|
|
| `.leftLeg` | hip, knee, ankle |
|
|
| `.rightLeg` | hip, knee, ankle |
|
|
|
|
```swift
|
|
// Get all body points
|
|
let allPoints = try observation.recognizedPoints(.all)
|
|
|
|
// Get left arm only
|
|
let leftArmPoints = try observation.recognizedPoints(.leftArm)
|
|
|
|
// Get specific joint
|
|
let leftWrist = try observation.recognizedPoint(.leftWrist)
|
|
```
|
|
|
|
### VNDetectHumanBodyPose3DRequest (3D)
|
|
|
|
**Availability**: iOS 17+, macOS 14+
|
|
|
|
Returns **3D skeleton with 17 joints** in meters (real-world coordinates):
|
|
|
|
```swift
|
|
let request = VNDetectHumanBodyPose3DRequest()
|
|
try handler.perform([request])
|
|
|
|
guard let observation = request.results?.first as? VNHumanBodyPose3DObservation else {
|
|
return
|
|
}
|
|
|
|
// Get 3D joint position
|
|
let leftWrist = try observation.recognizedPoint(.leftWrist)
|
|
let position = leftWrist.position // simd_float4x4 matrix
|
|
let localPosition = leftWrist.localPosition // Relative to parent joint
|
|
```
|
|
|
|
**3D Body Landmarks** (17 points): Same as 2D except no ears (15 vs 18 2D landmarks)
|
|
|
|
#### 3D Observation Properties
|
|
|
|
**bodyHeight**: Estimated height in meters
|
|
- With depth data: Measured height
|
|
- Without depth data: Reference height (1.8m)
|
|
|
|
**heightEstimation**: `.measured` or `.reference`
|
|
|
|
**cameraOriginMatrix**: `simd_float4x4` camera position/orientation relative to subject
|
|
|
|
**pointInImage(\_:)**: Project 3D joint back to 2D image coordinates
|
|
|
|
```swift
|
|
let wrist2D = try observation.pointInImage(leftWrist)
|
|
```
|
|
|
|
#### 3D Point Classes
|
|
|
|
**VNPoint3D**: Base class with `simd_float4x4` position matrix
|
|
|
|
**VNRecognizedPoint3D**: Adds identifier (joint name)
|
|
|
|
**VNHumanBodyRecognizedPoint3D**: Adds `localPosition` and `parentJoint`
|
|
|
|
```swift
|
|
// Position relative to skeleton root (center of hip)
|
|
let modelPosition = leftWrist.position
|
|
|
|
// Position relative to parent joint (left elbow)
|
|
let relativePosition = leftWrist.localPosition
|
|
```
|
|
|
|
#### Depth Input
|
|
|
|
Vision accepts depth data alongside images:
|
|
|
|
```swift
|
|
// From AVDepthData
|
|
let handler = VNImageRequestHandler(
|
|
cvPixelBuffer: imageBuffer,
|
|
depthData: depthData,
|
|
orientation: orientation
|
|
)
|
|
|
|
// From file (automatic depth extraction)
|
|
let handler = VNImageRequestHandler(url: imageURL) // Depth auto-fetched
|
|
```
|
|
|
|
**Depth formats**: Disparity or Depth (interchangeable via AVFoundation)
|
|
|
|
**LiDAR**: Use in live capture sessions for accurate scale/measurement
|
|
|
|
## Face Detection & Landmarks
|
|
|
|
### VNDetectFaceRectanglesRequest
|
|
|
|
**Availability**: iOS 11+
|
|
|
|
Detects face bounding boxes:
|
|
|
|
```swift
|
|
let request = VNDetectFaceRectanglesRequest()
|
|
try handler.perform([request])
|
|
|
|
for observation in request.results as? [VNFaceObservation] ?? [] {
|
|
let faceBounds = observation.boundingBox // Normalized rect
|
|
}
|
|
```
|
|
|
|
### VNDetectFaceLandmarksRequest
|
|
|
|
**Availability**: iOS 11+
|
|
|
|
Detects face with detailed landmarks:
|
|
|
|
```swift
|
|
let request = VNDetectFaceLandmarksRequest()
|
|
try handler.perform([request])
|
|
|
|
for observation in request.results as? [VNFaceObservation] ?? [] {
|
|
if let landmarks = observation.landmarks {
|
|
let leftEye = landmarks.leftEye
|
|
let nose = landmarks.nose
|
|
let leftPupil = landmarks.leftPupil // Revision 2+
|
|
}
|
|
}
|
|
```
|
|
|
|
**Revisions**:
|
|
- Revision 1: Basic landmarks
|
|
- Revision 2: Detects upside-down faces
|
|
- Revision 3+: Pupil locations
|
|
|
|
## Person Detection
|
|
|
|
### VNDetectHumanRectanglesRequest
|
|
|
|
**Availability**: iOS 13+
|
|
|
|
Detects human bounding boxes (torso detection):
|
|
|
|
```swift
|
|
let request = VNDetectHumanRectanglesRequest()
|
|
try handler.perform([request])
|
|
|
|
for observation in request.results as? [VNHumanObservation] ?? [] {
|
|
let humanBounds = observation.boundingBox // Normalized rect
|
|
}
|
|
```
|
|
|
|
**Use case**: Faster than pose detection when you only need location
|
|
|
|
## CoreImage Integration
|
|
|
|
### CIBlendWithMask Filter
|
|
|
|
Composite subject on new background using Vision mask:
|
|
|
|
```swift
|
|
// 1. Get mask from Vision
|
|
let observation = request.results?.first as? VNInstanceMaskObservation
|
|
let visionMask = try observation.createScaledMask(
|
|
for: observation.allInstances,
|
|
croppedToInstancesContent: false
|
|
)
|
|
|
|
// 2. Convert to CIImage
|
|
let maskImage = CIImage(cvPixelBuffer: visionMask)
|
|
|
|
// 3. Apply filter
|
|
let filter = CIFilter(name: "CIBlendWithMask")!
|
|
filter.setValue(sourceImage, forKey: kCIInputImageKey)
|
|
filter.setValue(maskImage, forKey: kCIInputMaskImageKey)
|
|
filter.setValue(newBackground, forKey: kCIInputBackgroundImageKey)
|
|
|
|
let output = filter.outputImage // Composited result
|
|
```
|
|
|
|
**Parameters**:
|
|
- **Input image**: Original image to mask
|
|
- **Mask image**: Vision's soft segmentation mask
|
|
- **Background image**: New background (or empty image for transparency)
|
|
|
|
**HDR preservation**: CoreImage preserves high dynamic range from input (Vision/VisionKit output is SDR)
|
|
|
|
## Text Recognition APIs
|
|
|
|
### VNRecognizeTextRequest
|
|
|
|
**Availability**: iOS 13+, macOS 10.15+
|
|
|
|
Recognizes text in images with configurable accuracy/speed trade-off.
|
|
|
|
#### Basic Usage
|
|
|
|
```swift
|
|
let request = VNRecognizeTextRequest()
|
|
request.recognitionLevel = .accurate // Or .fast
|
|
request.recognitionLanguages = ["en-US", "de-DE"] // Order matters
|
|
request.usesLanguageCorrection = true
|
|
|
|
let handler = VNImageRequestHandler(cgImage: image)
|
|
try handler.perform([request])
|
|
|
|
for observation in request.results as? [VNRecognizedTextObservation] ?? [] {
|
|
// Get top candidates
|
|
let candidates = observation.topCandidates(3)
|
|
let bestText = candidates.first?.string ?? ""
|
|
}
|
|
```
|
|
|
|
#### Recognition Levels
|
|
|
|
| Level | Performance | Accuracy | Best For |
|
|
|-------|-------------|----------|----------|
|
|
| `.fast` | Real-time | Good | Camera feed, large text, signs |
|
|
| `.accurate` | Slower | Excellent | Documents, receipts, handwriting |
|
|
|
|
**Fast path**: Character-by-character recognition (Neural Network → Character Detection)
|
|
|
|
**Accurate path**: Full-line ML recognition (Neural Network → Line/Word Recognition)
|
|
|
|
#### Properties
|
|
|
|
| Property | Type | Description |
|
|
|----------|------|-------------|
|
|
| `recognitionLevel` | `VNRequestTextRecognitionLevel` | `.fast` or `.accurate` |
|
|
| `recognitionLanguages` | `[String]` | BCP 47 language codes, order = priority |
|
|
| `usesLanguageCorrection` | `Bool` | Use language model for correction |
|
|
| `customWords` | `[String]` | Domain-specific vocabulary |
|
|
| `automaticallyDetectsLanguage` | `Bool` | Auto-detect language (iOS 16+) |
|
|
| `minimumTextHeight` | `Float` | Min text height as fraction of image (0-1) |
|
|
| `revision` | `Int` | API version (affects supported languages) |
|
|
|
|
#### Language Support
|
|
|
|
```swift
|
|
// Check supported languages for current settings
|
|
let languages = try VNRecognizeTextRequest.supportedRecognitionLanguages(
|
|
for: .accurate,
|
|
revision: VNRecognizeTextRequestRevision3
|
|
)
|
|
```
|
|
|
|
**Language correction**: Improves accuracy but takes processing time. Disable for codes/serial numbers.
|
|
|
|
**Custom words**: Add domain-specific vocabulary for better recognition (medical terms, product codes).
|
|
|
|
#### VNRecognizedTextObservation
|
|
|
|
**boundingBox**: Normalized rect containing recognized text
|
|
|
|
**topCandidates(_:)**: Returns `[VNRecognizedText]` ordered by confidence
|
|
|
|
#### VNRecognizedText
|
|
|
|
| Property | Type | Description |
|
|
|----------|------|-------------|
|
|
| `string` | `String` | Recognized text |
|
|
| `confidence` | `VNConfidence` | 0.0-1.0 |
|
|
| `boundingBox(for:)` | `VNRectangleObservation?` | Box for substring range |
|
|
|
|
```swift
|
|
// Get bounding box for substring
|
|
let text = candidate.string
|
|
if let range = text.range(of: "invoice") {
|
|
let box = try candidate.boundingBox(for: range)
|
|
}
|
|
```
|
|
|
|
## Barcode Detection APIs
|
|
|
|
### VNDetectBarcodesRequest
|
|
|
|
**Availability**: iOS 11+, macOS 10.13+
|
|
|
|
Detects and decodes barcodes and QR codes.
|
|
|
|
#### Basic Usage
|
|
|
|
```swift
|
|
let request = VNDetectBarcodesRequest()
|
|
request.symbologies = [.qr, .ean13, .code128] // Specific codes
|
|
|
|
let handler = VNImageRequestHandler(cgImage: image)
|
|
try handler.perform([request])
|
|
|
|
for barcode in request.results as? [VNBarcodeObservation] ?? [] {
|
|
let payload = barcode.payloadStringValue
|
|
let type = barcode.symbology
|
|
let bounds = barcode.boundingBox
|
|
}
|
|
```
|
|
|
|
#### Symbologies
|
|
|
|
**1D Barcodes**:
|
|
- `.codabar` (iOS 15+)
|
|
- `.code39`, `.code39Checksum`, `.code39FullASCII`, `.code39FullASCIIChecksum`
|
|
- `.code93`, `.code93i`
|
|
- `.code128`
|
|
- `.ean8`, `.ean13`
|
|
- `.gs1DataBar`, `.gs1DataBarExpanded`, `.gs1DataBarLimited` (iOS 15+)
|
|
- `.i2of5`, `.i2of5Checksum`
|
|
- `.itf14`
|
|
- `.upce`
|
|
|
|
**2D Codes**:
|
|
- `.aztec`
|
|
- `.dataMatrix`
|
|
- `.microPDF417` (iOS 15+)
|
|
- `.microQR` (iOS 15+)
|
|
- `.pdf417`
|
|
- `.qr`
|
|
|
|
**Performance**: Specifying fewer symbologies = faster detection
|
|
|
|
#### Revisions
|
|
|
|
| Revision | iOS | Features |
|
|
|----------|-----|----------|
|
|
| 1 | 11+ | Basic detection, one code at a time |
|
|
| 2 | 15+ | Codabar, GS1, MicroPDF, MicroQR, better ROI |
|
|
| 3 | 16+ | ML-based, multiple codes, better bounding boxes |
|
|
|
|
#### VNBarcodeObservation
|
|
|
|
| Property | Type | Description |
|
|
|----------|------|-------------|
|
|
| `payloadStringValue` | `String?` | Decoded content |
|
|
| `symbology` | `VNBarcodeSymbology` | Barcode type |
|
|
| `boundingBox` | `CGRect` | Normalized bounds |
|
|
| `topLeft/topRight/bottomLeft/bottomRight` | `CGPoint` | Corner points |
|
|
|
|
## VisionKit Scanner APIs
|
|
|
|
### DataScannerViewController
|
|
|
|
**Availability**: iOS 16+
|
|
|
|
Camera-based live scanner with built-in UI for text and barcodes.
|
|
|
|
#### Check Availability
|
|
|
|
```swift
|
|
// Hardware support
|
|
DataScannerViewController.isSupported
|
|
|
|
// Runtime availability (camera access, parental controls)
|
|
DataScannerViewController.isAvailable
|
|
```
|
|
|
|
#### Configuration
|
|
|
|
```swift
|
|
import VisionKit
|
|
|
|
let dataTypes: Set<DataScannerViewController.RecognizedDataType> = [
|
|
.barcode(symbologies: [.qr, .ean13]),
|
|
.text(textContentType: .URL), // Or nil for all text
|
|
// .text(languages: ["ja"]) // Filter by language
|
|
]
|
|
|
|
let scanner = DataScannerViewController(
|
|
recognizedDataTypes: dataTypes,
|
|
qualityLevel: .balanced, // .fast, .balanced, .accurate
|
|
recognizesMultipleItems: true,
|
|
isHighFrameRateTrackingEnabled: true,
|
|
isPinchToZoomEnabled: true,
|
|
isGuidanceEnabled: true,
|
|
isHighlightingEnabled: true
|
|
)
|
|
|
|
scanner.delegate = self
|
|
present(scanner, animated: true) {
|
|
try? scanner.startScanning()
|
|
}
|
|
```
|
|
|
|
#### RecognizedDataType
|
|
|
|
| Type | Description |
|
|
|------|-------------|
|
|
| `.barcode(symbologies:)` | Specific barcode types |
|
|
| `.text()` | All text |
|
|
| `.text(languages:)` | Text filtered by language |
|
|
| `.text(textContentType:)` | Text filtered by type (URL, phone, email) |
|
|
|
|
#### Delegate Protocol
|
|
|
|
```swift
|
|
protocol DataScannerViewControllerDelegate {
|
|
func dataScanner(_ dataScanner: DataScannerViewController,
|
|
didTapOn item: RecognizedItem)
|
|
|
|
func dataScanner(_ dataScanner: DataScannerViewController,
|
|
didAdd addedItems: [RecognizedItem],
|
|
allItems: [RecognizedItem])
|
|
|
|
func dataScanner(_ dataScanner: DataScannerViewController,
|
|
didUpdate updatedItems: [RecognizedItem],
|
|
allItems: [RecognizedItem])
|
|
|
|
func dataScanner(_ dataScanner: DataScannerViewController,
|
|
didRemove removedItems: [RecognizedItem],
|
|
allItems: [RecognizedItem])
|
|
|
|
func dataScanner(_ dataScanner: DataScannerViewController,
|
|
becameUnavailableWithError error: DataScannerViewController.ScanningUnavailable)
|
|
}
|
|
```
|
|
|
|
#### RecognizedItem
|
|
|
|
```swift
|
|
enum RecognizedItem {
|
|
case text(RecognizedItem.Text)
|
|
case barcode(RecognizedItem.Barcode)
|
|
|
|
var id: UUID { get }
|
|
var bounds: RecognizedItem.Bounds { get }
|
|
}
|
|
|
|
// Text item
|
|
struct Text {
|
|
let transcript: String
|
|
}
|
|
|
|
// Barcode item
|
|
struct Barcode {
|
|
let payloadStringValue: String?
|
|
let observation: VNBarcodeObservation
|
|
}
|
|
```
|
|
|
|
#### Async Stream
|
|
|
|
```swift
|
|
// Alternative to delegate
|
|
for await items in scanner.recognizedItems {
|
|
// Current recognized items
|
|
}
|
|
```
|
|
|
|
#### Custom Highlights
|
|
|
|
```swift
|
|
// Add custom views over recognized items
|
|
scanner.overlayContainerView.addSubview(customHighlight)
|
|
|
|
// Capture still photo
|
|
let photo = try await scanner.capturePhoto()
|
|
```
|
|
|
|
### VNDocumentCameraViewController
|
|
|
|
**Availability**: iOS 13+
|
|
|
|
Document scanning with automatic edge detection, perspective correction, and lighting adjustment.
|
|
|
|
#### Basic Usage
|
|
|
|
```swift
|
|
import VisionKit
|
|
|
|
let camera = VNDocumentCameraViewController()
|
|
camera.delegate = self
|
|
present(camera, animated: true)
|
|
```
|
|
|
|
#### Delegate Protocol
|
|
|
|
```swift
|
|
protocol VNDocumentCameraViewControllerDelegate {
|
|
func documentCameraViewController(_ controller: VNDocumentCameraViewController,
|
|
didFinishWith scan: VNDocumentCameraScan)
|
|
|
|
func documentCameraViewControllerDidCancel(_ controller: VNDocumentCameraViewController)
|
|
|
|
func documentCameraViewController(_ controller: VNDocumentCameraViewController,
|
|
didFailWithError error: Error)
|
|
}
|
|
```
|
|
|
|
#### VNDocumentCameraScan
|
|
|
|
| Property | Type | Description |
|
|
|----------|------|-------------|
|
|
| `pageCount` | `Int` | Number of scanned pages |
|
|
| `imageOfPage(at:)` | `UIImage` | Get page image at index |
|
|
| `title` | `String` | User-editable title |
|
|
|
|
```swift
|
|
func documentCameraViewController(_ controller: VNDocumentCameraViewController,
|
|
didFinishWith scan: VNDocumentCameraScan) {
|
|
controller.dismiss(animated: true)
|
|
|
|
for i in 0..<scan.pageCount {
|
|
let pageImage = scan.imageOfPage(at: i)
|
|
// Process with VNRecognizeTextRequest
|
|
}
|
|
}
|
|
```
|
|
|
|
## Document Analysis APIs
|
|
|
|
### VNDetectDocumentSegmentationRequest
|
|
|
|
**Availability**: iOS 15+, macOS 12+
|
|
|
|
Detects document boundaries for custom camera UIs or post-processing.
|
|
|
|
```swift
|
|
let request = VNDetectDocumentSegmentationRequest()
|
|
let handler = VNImageRequestHandler(ciImage: image)
|
|
try handler.perform([request])
|
|
|
|
guard let observation = request.results?.first as? VNRectangleObservation else {
|
|
return // No document found
|
|
}
|
|
|
|
// Get corner points (normalized)
|
|
let corners = [
|
|
observation.topLeft,
|
|
observation.topRight,
|
|
observation.bottomLeft,
|
|
observation.bottomRight
|
|
]
|
|
```
|
|
|
|
**vs VNDetectRectanglesRequest**:
|
|
- Document: ML-based, trained specifically on documents
|
|
- Rectangle: Edge-based, finds any quadrilateral
|
|
|
|
### RecognizeDocumentsRequest (iOS 26+)
|
|
|
|
**Availability**: iOS 26+, macOS 26+
|
|
|
|
Structured document understanding with semantic parsing.
|
|
|
|
#### Basic Usage
|
|
|
|
```swift
|
|
let request = RecognizeDocumentsRequest()
|
|
let observations = try await request.perform(on: imageData)
|
|
|
|
guard let document = observations.first?.document else {
|
|
return
|
|
}
|
|
```
|
|
|
|
#### DocumentObservation Hierarchy
|
|
|
|
```
|
|
DocumentObservation
|
|
└── document: DocumentObservation.Document
|
|
├── text: TextObservation
|
|
├── tables: [Container.Table]
|
|
├── lists: [Container.List]
|
|
└── barcodes: [Container.Barcode]
|
|
```
|
|
|
|
#### Table Extraction
|
|
|
|
```swift
|
|
for table in document.tables {
|
|
for row in table.rows {
|
|
for cell in row {
|
|
let text = cell.content.text.transcript
|
|
let detectedData = cell.content.text.detectedData
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Detected Data Types
|
|
|
|
```swift
|
|
for data in document.text.detectedData {
|
|
switch data.match.details {
|
|
case .emailAddress(let email):
|
|
let address = email.emailAddress
|
|
case .phoneNumber(let phone):
|
|
let number = phone.phoneNumber
|
|
case .link(let url):
|
|
let link = url
|
|
case .address(let address):
|
|
let components = address
|
|
case .date(let date):
|
|
let dateValue = date
|
|
default:
|
|
break
|
|
}
|
|
}
|
|
```
|
|
|
|
#### TextObservation Hierarchy
|
|
|
|
```
|
|
TextObservation
|
|
├── transcript: String
|
|
├── lines: [TextObservation.Line]
|
|
├── paragraphs: [TextObservation.Paragraph]
|
|
├── words: [TextObservation.Word]
|
|
└── detectedData: [DetectedDataObservation]
|
|
```
|
|
|
|
## Visual Intelligence Integration
|
|
|
|
Visual Intelligence is a **system-level feature** (iOS 26+) that lets users point their camera at real-world objects and find matching content across apps. This is distinct from the Vision framework (VNRequest-based image analysis) covered above. Vision analyzes images within your app; Visual Intelligence lets the system invoke your app when users search with the camera or screenshots.
|
|
|
|
### How It Works
|
|
|
|
1. User activates Visual Intelligence camera or takes a screenshot
|
|
2. System analyzes what the user is looking at
|
|
3. System queries participating apps via `IntentValueQuery`
|
|
4. Your app receives a `SemanticContentDescriptor` with labels and/or pixel data
|
|
5. Your app searches its content and returns matching `AppEntity` results
|
|
6. Results appear in the Visual Intelligence UI with your app's branding
|
|
|
|
### Required Frameworks
|
|
|
|
```swift
|
|
import VisualIntelligence
|
|
import AppIntents
|
|
```
|
|
|
|
### SemanticContentDescriptor
|
|
|
|
The core object the system provides to describe what the user is looking at.
|
|
|
|
| Property | Type | Description |
|
|
|----------|------|-------------|
|
|
| `labels` | `[String]` | Classification labels for the detected item |
|
|
| `pixelBuffer` | `CVReadOnlyPixelBuffer?` | Visual data of the detected item |
|
|
|
|
Use labels for fast keyword matching against your content catalog. Use the pixel buffer for image-similarity search when labels are insufficient.
|
|
|
|
### IntentValueQuery
|
|
|
|
The entry point for Visual Intelligence to communicate with your app. Implement `values(for:)` to receive search requests and return matching entities.
|
|
|
|
```swift
|
|
struct LandmarkIntentValueQuery: IntentValueQuery {
|
|
@Dependency var modelData: ModelData
|
|
|
|
func values(for input: SemanticContentDescriptor) async throws -> [LandmarkEntity] {
|
|
if !input.labels.isEmpty {
|
|
return try await modelData.search(matching: input.labels)
|
|
}
|
|
guard let pixelBuffer = input.pixelBuffer else { return [] }
|
|
return try await modelData.search(matching: pixelBuffer)
|
|
}
|
|
}
|
|
```
|
|
|
|
### Returning Multiple Result Types
|
|
|
|
Use `@UnionValue` when your app can return different entity types from a single search.
|
|
|
|
```swift
|
|
@UnionValue
|
|
enum VisualSearchResult {
|
|
case landmark(LandmarkEntity)
|
|
case collection(CollectionEntity)
|
|
}
|
|
```
|
|
|
|
### Display Representation
|
|
|
|
Visual Intelligence uses your entity's `DisplayRepresentation` to show results. Provide a title, subtitle, and image for each result.
|
|
|
|
```swift
|
|
struct LandmarkEntity: AppEntity {
|
|
var id: String
|
|
var name: String
|
|
var location: String
|
|
|
|
static var typeDisplayRepresentation: TypeDisplayRepresentation {
|
|
TypeDisplayRepresentation(
|
|
name: LocalizedStringResource("Landmark", table: "AppIntents"),
|
|
numericFormat: "\(placeholder: .int) landmarks"
|
|
)
|
|
}
|
|
|
|
var displayRepresentation: DisplayRepresentation {
|
|
DisplayRepresentation(
|
|
title: "\(name)",
|
|
subtitle: "\(location)",
|
|
image: .init(named: thumbnailImageName)
|
|
)
|
|
}
|
|
}
|
|
```
|
|
|
|
### Deep Linking from Results
|
|
|
|
When a user taps a result, your app should open to the relevant content. Provide an `appLinkURL` on your entity.
|
|
|
|
```swift
|
|
var appLinkURL: URL? {
|
|
URL(string: "yourapp://landmark/\(id)")
|
|
}
|
|
```
|
|
|
|
### "More Results" Intent
|
|
|
|
For large result sets, provide a `VisualIntelligenceSearchIntent` that opens your app's full search UI.
|
|
|
|
```swift
|
|
struct ViewMoreLandmarksIntent: AppIntent, VisualIntelligenceSearchIntent {
|
|
static var title: LocalizedStringResource = "View More Landmarks"
|
|
|
|
@Parameter(title: "Semantic Content")
|
|
var semanticContent: SemanticContentDescriptor
|
|
|
|
func perform() async throws -> some IntentResult {
|
|
// Open your app's search view with the semantic content
|
|
return .result()
|
|
}
|
|
}
|
|
```
|
|
|
|
### Best Practices
|
|
|
|
- **Return results quickly** — Visual Intelligence expects low-latency responses. Limit to 10-20 most relevant results
|
|
- **Prefer labels first** — Label matching is faster than pixel buffer analysis. Fall back to pixel buffer when labels are empty or insufficient
|
|
- **Localize everything** — Display representations appear in the system UI. Use `LocalizedStringResource` for all user-facing text
|
|
- **Include images** — Results with thumbnails are more recognizable in the Visual Intelligence overlay
|
|
|
|
### Testing
|
|
|
|
1. Build and run on a physical device
|
|
2. Activate Visual Intelligence camera or take a screenshot of relevant content
|
|
3. Perform a visual search and verify your app's results appear
|
|
4. Tap results to verify deep linking opens the correct content
|
|
|
|
## API Quick Reference
|
|
|
|
### Subject Segmentation
|
|
|
|
| API | Platform | Purpose |
|
|
|-----|----------|---------|
|
|
| `VNGenerateForegroundInstanceMaskRequest` | iOS 17+ | Class-agnostic subject instances |
|
|
| `VNGeneratePersonInstanceMaskRequest` | iOS 17+ | Up to 4 people separately |
|
|
| `VNGeneratePersonSegmentationRequest` | iOS 15+ | All people (single mask) |
|
|
| `ImageAnalysisInteraction` (VisionKit) | iOS 16+ | UI for subject lifting |
|
|
|
|
### Pose Detection
|
|
|
|
| API | Platform | Landmarks | Coordinates |
|
|
|-----|----------|-----------|-------------|
|
|
| `VNDetectHumanHandPoseRequest` | iOS 14+ | 21 per hand | 2D normalized |
|
|
| `VNDetectHumanBodyPoseRequest` | iOS 14+ | 18 body joints | 2D normalized |
|
|
| `VNDetectHumanBodyPose3DRequest` | iOS 17+ | 17 body joints | 3D meters |
|
|
|
|
### Face & Person Detection
|
|
|
|
| API | Platform | Purpose |
|
|
|-----|----------|---------|
|
|
| `VNDetectFaceRectanglesRequest` | iOS 11+ | Face bounding boxes |
|
|
| `VNDetectFaceLandmarksRequest` | iOS 11+ | Face with detailed landmarks |
|
|
| `VNDetectHumanRectanglesRequest` | iOS 13+ | Human torso bounding boxes |
|
|
|
|
### Text & Barcode
|
|
|
|
| API | Platform | Purpose |
|
|
|-----|----------|---------|
|
|
| `VNRecognizeTextRequest` | iOS 13+ | Text recognition (OCR) |
|
|
| `VNDetectBarcodesRequest` | iOS 11+ | Barcode/QR detection |
|
|
| `DataScannerViewController` | iOS 16+ | Live camera scanner (text + barcodes) |
|
|
| `VNDocumentCameraViewController` | iOS 13+ | Document scanning with perspective correction |
|
|
| `VNDetectDocumentSegmentationRequest` | iOS 15+ | Programmatic document edge detection |
|
|
| `RecognizeDocumentsRequest` | iOS 26+ | Structured document extraction |
|
|
|
|
### Visual Intelligence
|
|
|
|
| API | Platform | Purpose |
|
|
|-----|----------|---------|
|
|
| `SemanticContentDescriptor` | iOS 26+ | Describes what the user is looking at (labels + pixel buffer) |
|
|
| `IntentValueQuery` | iOS 26+ | Entry point for receiving visual search requests |
|
|
| `VisualIntelligenceSearchIntent` | iOS 26+ | "More results" deep link to your app |
|
|
|
|
### Observation Types
|
|
|
|
| Observation | Returned By |
|
|
|-------------|-------------|
|
|
| `VNInstanceMaskObservation` | Foreground/person instance masks |
|
|
| `VNPixelBufferObservation` | Person segmentation (single mask) |
|
|
| `VNHumanHandPoseObservation` | Hand pose |
|
|
| `VNHumanBodyPoseObservation` | Body pose (2D) |
|
|
| `VNHumanBodyPose3DObservation` | Body pose (3D) |
|
|
| `VNFaceObservation` | Face detection/landmarks |
|
|
| `VNHumanObservation` | Human rectangles |
|
|
| `VNRecognizedTextObservation` | Text recognition |
|
|
| `VNBarcodeObservation` | Barcode detection |
|
|
| `VNRectangleObservation` | Document segmentation |
|
|
| `DocumentObservation` | Structured document (iOS 26+) |
|
|
|
|
## Resources
|
|
|
|
**WWDC**: 2019-234, 2021-10041, 2022-10024, 2022-10025, 2025-272, 2023-10176, 2023-111241, 2023-10048, 2020-10653, 2020-10043, 2020-10099
|
|
|
|
**Docs**: /vision, /visionkit, /visualintelligence, /visualintelligence/semanticcontentdescriptor, /vision/vnrecognizetextrequest, /vision/vndetectbarcodesrequest
|
|
|
|
**Skills**: axiom-vision, axiom-vision-diag
|