Skip to content

Benchmark NuExtract3 + Unlimited-OCR on olmOCR-bench old_scans#26

Merged
davanstrien merged 2 commits into
mainfrom
add-multimodel-oldscans-bench
Jun 29, 2026
Merged

Benchmark NuExtract3 + Unlimited-OCR on olmOCR-bench old_scans#26
davanstrien merged 2 commits into
mainfrom
add-multimodel-oldscans-bench

Conversation

@davanstrien

Copy link
Copy Markdown
Owner

What

Extends the old_scans experiment to NuExtract3 (4.5B) and Unlimited-OCR
(3.3B), scored through the same harness — for my own benchmarking. Adds
unlimited_ocr.py, nuextract3.py, and BENCHMARKING.md.

Results

Model params old_scans present absent order baseline
PaddleOCR-VL v1.6 0.9B 38.6 31.2 95.7 27.7 84.7
PaddleOCR-VL v1 0.9B 38.2 32.3 95.7 24.9 88.8
NuExtract3 4.5B 37.8 41.6 41.4 30.5 100.0
Unlimited-OCR 3.3B 30.6 29.0 50.0 25.4 89.8

The caveat (front and center in BENCHMARKING.md)

The single old_scans number conflates transcription (present) and
boilerplate exclusion (absent). NuExtract3 is the best transcriber
(present 41.6 >> paddle 31.2) and never hallucinates CJK (baseline 100%) — it
"loses" on old_scans only because markdown-mode transcribes letterheads/stamps
(verified: plain body text, not strippable <figure>/HTML). So it's an
architecture tradeoff, not a read-quality deficit; paddle wins via boilerplate
exclusion (its layout pipeline), not better reading.

Notes

  • Each model at its recommended DPI (NuExtract 170 / Unlimited 300), footnoted.
  • Unlimited's <|det|> grounding stripped; NuExtract non-thinking + greedy.
  • Not size-matched (3–4.5B vs 0.9B).

🤖 Generated with Claude Code

davanstrien and others added 2 commits June 27, 2026 18:31
old_scans: paddle v1.6 38.6 / v1 38.2 / NuExtract3 37.8 / Unlimited-OCR 30.6.
BENCHMARKING.md leads with the caveat that the single old_scans number conflates
transcription (present) and boilerplate exclusion (absent): NuExtract3 is the
BEST transcriber (present 41.6 vs paddle 31.2) + never hallucinates CJK (baseline
100%), but scores low on absent because markdown-mode transcribes letterheads/
stamps as plain text (verified: not strippable <figure>/HTML). Architecture
tradeoff, not a read-quality deficit.

Self-contained uv scripts: render PDF->PNG at each model's DPI, greedy, write the
bucket mount directly; NuExtract non-thinking; Unlimited <|det|> grounding stripped.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
The single old_scans number is fitness for olmOCR's goal (clean reading-order
text for LLM training); reasonable for that, wrong yardstick for faithful/archival
OCR where the boilerplate is the record.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@davanstrien davanstrien merged commit 99f7550 into main Jun 29, 2026
1 check passed
@davanstrien davanstrien deleted the add-multimodel-oldscans-bench branch June 29, 2026 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant