avocr is a fast, macOS-native OCR command-line tool powered by Apple's Vision framework. It handles images and PDFs, scales across worker processes, and can write plain text, JSON Lines, or searchable PDFs.
- Native OCR using
VNRecognizeTextRequest— no model downloads or OCR server required. - Fast batch processing with multiple worker processes and bounded prefetching.
- PDF-aware behavior: renders scanned pages, reuses embedded PDF text by default, or forces OCR when needed.
- Flexible outputs: stdout, per-document text files, per-page text files, JSONL with bounding boxes, or searchable PDFs.
- Predictable ordering: stdout and file output are emitted in input/page order even when workers finish out of order.
- Safe defaults: output files are regenerated per run; searchable PDF mode avoids accidental same-path overwrites unless
--overwriteis used.
- macOS 13.0+ (macOS 14.0+ recommended)
- Swift 5.9+
- Apple Silicon or Intel Mac
Runtime OCR/PDF support comes from Apple system frameworks: Vision, PDFKit, CoreGraphics, ImageIO, and AppKit. Swift Package Manager resolves the only package dependency, Swift Argument Parser.
git clone https://github.com/fctrbl/avocr
cd avocr
swift build -c releaseThe binary is written to .build/release/avocr.
To install it on your PATH:
sudo cp .build/release/avocr /usr/local/bin/avocr# OCR a PDF or image; writes ./document.txt
avocr document.pdf
# Stream text to stdout instead of creating files
avocr --stdout document.pdf > document.txt
# Process a directory recursively
avocr --output ./text ~/Documents/scans
# Faster, lower-accuracy mode with 8 workers
avocr --workers 8 --fast large-scan.pdf
# JSON Lines with text blocks and bounding boxes
avocr --stdout --format jsonl document.pdf > document.jsonlProgress and logs are written to stderr, so --stdout is safe to pipe or redirect.
Supported file extensions are:
- PDFs:
.pdf - Images:
.png,.jpg,.jpeg,.tif,.tiff,.bmp,.gif,.heic,.heif,.webp
Directories are scanned recursively. Hidden files and hidden directories are skipped unless --include-hidden is set.
By default, avocr writes output files to the current directory. Use --stdout to stream results, or --output <dir> to choose a destination directory.
Text output paths:
- Image:
<basename>.txt - PDF combined text:
<basename>.txt - PDF with
--per-page:<basename>_page<page>.txt(0-based page numbers) - JSONL file output:
results.jsonl
Output files are truncated/regenerated on each run. If a batch contains multiple inputs with the same basename, later collisions get path-derived filenames instead of merging or overwriting earlier results.
Examples:
avocr ~/scans/invoice.pdf
# writes ./invoice.txt
avocr --output ./text ~/scans/invoice.pdf
# writes ./text/invoice.txt
avocr --output ./text --per-page ~/scans/invoice.pdf
# writes ./text/invoice_page0.txt, ./text/invoice_page1.txt, ...
avocr --output ./text --format jsonl ~/scans
# writes ./text/results.jsonl=== Page 0 ===
Text from the first page.
=== Page 1 ===
Text from the second page.
Use --no-headers for raw text only.
Each line is one page/image result:
{"path":"doc.pdf","page":0,"text":"Extracted text...","blocks":[{"text":"Line 1","confidence":0.95,"bbox":{"x":0.1,"y":0.8,"width":0.8,"height":0.05}}]}Fields:
path: source file pathpage: 0-based page number; omitted for imagestext: full extracted text for that page/imageblocks: OCR text blocks with confidence and normalized bounding boxes
For PDFs, avocr uses embedded text when a page already has enough extractable text. This is much faster and avoids OCR noise on born-digital PDFs.
# Default: use embedded text when available, OCR scanned pages
avocr document.pdf
# Force OCR even if the PDF has an existing text layer
avocr --force-ocr document.pdf--embed-text-layer creates PDFs that preserve the original page image and add invisible OCR text so the result is searchable/selectable.
# Write ./scan_ocr.pdf if scan.pdf is in the current directory
avocr --embed-text-layer scan.pdf
# Write ./searchable/scan.pdf
avocr --embed-text-layer --output ./searchable scan.pdf
# Replace the original file intentionally
avocr --embed-text-layer --overwrite scan.pdfNotes:
- Searchable PDF mode only accepts PDF inputs.
- It cannot be combined with
--stdout,--format jsonl, or--per-page. - Without
--overwrite,avocrwill not write to the same source path; it appends_ocrwhen the destination would otherwise equal the input file.
# Multiple languages; whitespace is OK
avocr --lang "en-US, fr-FR, de-DE" document.pdf
# Disable language correction
avocr --no-correction handwritten-or-code.png
# Region of interest: x,y,w,h in normalized 0-1 coordinates
avocr --roi 0.5,0.5,0.5,0.5 document.pdf
# Ignore very small text blocks
avocr --min-text-height 0.02 document.pdf# Auto-detect columns (default)
avocr --columns auto newspaper.pdf
# Force a fixed layout
avocr --columns 1 document.pdf
avocr --columns 2 newspaper.pdf
avocr --columns 3 tri-fold.pdfColumn detection is heuristic-based; use fixed columns when you know the layout.
avocr --workers 8 scans/
avocr --workers max scans/Default worker count is the active CPU count. More workers can improve throughput but also increase memory use. For very large PDFs or high DPI, reduce --workers first if you see memory pressure.
Starting points:
- Apple Silicon M1/M2:
--workers 8to--workers 12 - Apple Silicon M3/M4:
--workers 12to--workers 16 - Intel Macs:
--workers 4to--workers 8
avocr --dpi 200 clean-scan.pdf # faster
avocr --dpi 300 document.pdf # default balance
avocr --dpi 400 tiny-text.pdf # slower, often more accurateAllowed range is 72–600 DPI. Higher DPI can significantly increase memory and processing time.
avocr --fast clean-printed-text.pdf--fast is usually best for clean printed text. Leave it off for small text, noisy scans, or complex layouts.
avocr --prefetch 1 huge.pdf
avocr --prefetch 4 many-small-images/--prefetch controls in-flight tasks per worker. Lower values reduce memory use; higher values can help when image/PDF loading is the bottleneck.
# Quiet output except OCR text/files
avocr --no-progress document.pdf
# Machine-readable progress on stderr
avocr --progress-format json --stdout document.pdf > document.txt
# JSON-formatted logs on stderr
avocr --log-format json --verbose document.pdf--progress-format quiet is equivalent to disabling progress. --fail-fast, --max-errors <N>, and --retries <N> are useful for batch jobs.
-h, --help Show help information
--version Show version
-i, --input <path> Input file/directory; can be repeated
--include-hidden Include hidden files during directory scans
-o, --output <dir> Output directory (default: current directory)
--stdout Write OCR results to stdout
-f, --format <text|jsonl> Output format (default: text)
--per-page Write one text file per PDF page
--no-headers Suppress text page headers
--no-progress Disable progress output
--progress-format <bar|json|quiet> Progress output format (stderr)
-v, --verbose Enable debug logging
--log-format <text|json> Log format (stderr)
--fast Use fast recognition level
-c, --columns <auto|1|2|3> Column layout mode
-l, --lang <codes> Comma-separated language codes
--no-correction Disable Vision language correction
--min-text-height <0-1> Minimum normalized text height
--roi <x,y,w,h> Normalized region of interest
--dpi <72-600> PDF render DPI (default: 300)
--use-existing-text Explicitly request default embedded-text behavior
--force-ocr OCR PDFs even when embedded text exists
--embed-text-layer Create searchable PDFs
--overwrite Replace originals in searchable PDF mode
-j, --workers <N|max> Worker processes
--prefetch <N> In-flight tasks per worker
--fail-fast Stop after first processing error
--max-errors <N> Stop after N processing errors
--retries <N> Retry transient Vision errors
--graceful-timeout <seconds> Cleanup time after cancellation
The command prints help instead of running
Pass at least one input path, or use -i/--input:
avocr -i ./scans -i ./photos -o ./textOut of memory or the Mac becomes sluggish
- Reduce
--workers - Reduce
--prefetch - Lower
--dpi - Process a smaller batch
Poor OCR accuracy
- Remove
--fast - Increase
--dpito 400 - Set the correct
--lang - Try
--columns 1,--columns 2, or--columns 3for known layouts - Check scan quality and page rotation
Born-digital PDFs return text too quickly / do not look OCR'd
That is the default embedded-text shortcut. Use --force-ocr when you need OCR output specifically.
I need deterministic output for a pipeline
Use --stdout --no-progress or redirect stderr separately. Page results are emitted in input/page order even with multiple workers.
swift test
swift run avocr --help
swift build -c releaseThe repository also includes a Python benchmark harness in benchmark/ for comparing avocr with other OCR engines.
FileEnumerator: recursive input discovery and filteringWorkItems: PDF/image work-plan constructionAVOCREngine: Vision OCR wrapperPDFRenderer: PDF rendering, embedded text extraction, searchable PDF creationReadingOrder: text block sorting and column heuristicsOutputWriter: text/JSONL/stdout/file outputMultiprocessCoordinator: worker orchestration, cancellation, ordered output
- Handwriting recognition depends on Apple's Vision capabilities and works best for clear printed text.
- Column detection is heuristic and may need manual
--columnsfor complex layouts. - Tables are emitted as text blocks, not structured table data.
- Fully rotated pages/text may require preprocessing.
- Available OCR languages are limited to the languages supported by the installed macOS Vision framework.
This is free and unencumbered software released into the public domain. See LICENSE for details.
Built with Apple's Vision framework, PDFKit, and Swift Argument Parser.