Dev by VikParuchuri · Pull Request #56 · datalab-to/pdftext

VikParuchuri · 2026-06-10T21:55:29Z

No description provided.

Bundled pdfium moves 6462 -> 7869. Text extraction improves on math-heavy content (previously dropped chars in formulas); loose charboxes shift slightly due to upstream FontBBox bounding changes. Benchmark alignment vs pymupdf unchanged (98.4). Keep the exact pin: every pypdfium2 minor swaps the pdfium binary and silently changes extraction output. Co-Authored-By: Claude Fable 5 <[email protected]>

Close page and textpage handles in get_pages (including the first page handle when flattening re-fetches), close the document in a finally in _get_pages and around link extraction, and release page/annotation handles in get_links. Narrow the get_rotation excepts to PdfiumError. Output is byte-identical (golden-diff verified). Co-Authored-By: Claude Fable 5 <[email protected]>

- Add password support for encrypted PDFs (password= kwarg on all public functions, threaded through workers; --password CLI flag) and raise a clear PdfPasswordError instead of a cryptic pdfium failure - Validate page ranges in the library (ValueError before forking workers) and CLI (click.BadParameter; fixes off-by-one that allowed p == doc_len) - Replace validation asserts with ValueError (survives python -O) - Narrow get_fontname except to PdfiumError, decode with errors=replace - Surface worker crashes as RuntimeError with context (BrokenProcessPool) Co-Authored-By: Claude Fable 5 <[email protected]>

- handle_hyphens no longer drops the final character (was masked by trailing newlines from merge_text) - Spans always carry superscript/subscript keys, matching the Span TypedDict; link-reconstructed spans previously dropped the flags entirely - Block dicts consistently include rotation - Guard zero-division in Bbox.rescale and negative area in intersection_pct; validate img_size dims in table_cell_text Co-Authored-By: Claude Fable 5 <[email protected]>

- Hoist pdfium FFI lookups to locals and reuse ctypes objects (FS_RECTF, c_doubles, font buffer) across the per-char loop instead of allocating per character - Inline the charbox call (replicates pypdfium2 get_charbox semantics) - Intern font dicts per page; font comparisons in span/word breaking use identity as the fast path - Skip Bbox.rotate() on unrotated pages Output is byte-identical (golden-diff verified). ~12% faster on a 65-page text-heavy PDF. Co-Authored-By: Claude Fable 5 <[email protected]>

Accumulator bboxes (words, spans, lines, blocks) are now copied at creation and mutated with merge_inplace instead of allocating a new Bbox per character/span/line merged. Add __slots__ and a compact __reduce__ to Bbox (smaller worker pickles). Output is byte-identical including char-level bboxes (golden-diff verified with keep_chars=True); workers=4 round-trip identical. Co-Authored-By: Claude Fable 5 <[email protected]>

- assign_scripts precomputes per-span geometry once per line and answers the any-other-span-above/below checks with top-2/bottom-2 aggregates instead of O(n^2) comparisons (was the single hottest function) - postprocess_text folds special chars and ligatures into one str.translate table; control-char removal gets an ASCII translate fast path and a memoized fallback - _reconstruct_spans drops per-char Bbox construction and per-char sort Output is byte-identical (golden-diff verified). Co-Authored-By: Claude Fable 5 <[email protected]>

Plain-text output and dictionary_output(keep_chars=False, disable_links=True) never read span chars, so workers delete them before pickling results back, cutting IPC cost (~2x on the worker path for a 65-page PDF). Chars are kept whenever links or keep_chars need them. Co-Authored-By: Claude Fable 5 <[email protected]>

Skips pypdfium2's per-call auto-cast (~6 calls/char). Output is byte-identical (golden-diff verified). Note: batching char extraction via FPDFText_GetText was tested and rejected — its text-index space diverges from the char-index space on real PDFs (5096 value mismatches + an index shift on the adversarial fixture), so per-char FPDFText_GetUnicode stays. Co-Authored-By: Claude Fable 5 <[email protected]>

- Generated fixtures (rotated, AES-encrypted, empty PDFs) built at test time with pymupdf; no binaries checked in - New tests: rotated pages, encrypted PDFs (missing/wrong/correct password, through workers), empty pages, out-of-range page ranges, workers-vs-serial equivalence, links/refs with script flags, hyphen/postprocessing units, CLI validation paths, table input mismatch, repeated flatten - Fix dead test (text_plain_text_output -> test_plain_text_output) - CI: os matrix (ubuntu/windows/macos) x python (3.10/3.11/3.13), bump checkout/setup-python actions 34 tests passing, up from 8. Co-Authored-By: Claude Fable 5 <[email protected]>

Table extraction (rotated pages were significantly broken): - get_dynamic_gap_thresh measured the perpendicular axis for every rotation (branches shifted by one), making the dynamic threshold always negative and dead; now measures the text advance axis - is_same_span rot-180 had an inverted sign (any gap merged) and two dimension typos normalizing x-coords by image height - pdfium-injected \r\n chars no longer embed in rotated cell text - table_output validates caller-supplied pages retain char data Extraction correctness: - Lone surrogates from broken ToUnicode CMaps are replaced with U+FFFD (previously crashed UTF-8 encoding of char-level JSON output) - page["bbox"] for 90/270-rotated pages is now a valid display-space box instead of corner-reversed (x_start > x_end) - get_lines rotation break compared radians against 45 (dead code); now breaks on roughly perpendicular text (45-135 degrees, circular), keeping pdfium's 180-degree negative-scale flips on the same line — vertical figure axis labels now separate from tick labels - Bbox.rotate no longer aliases the source list at rotation 0 API consistency and robustness: - refs key always present on pages (empty when disable_links=True) - quote_loosebox exposed on plain-text functions - File-like inputs fall back to serial instead of failing to pickle into worker initargs; empty trailing worker chunks dropped - Settings env vars now namespaced (PDFTEXT_ prefix) - CLI page-count check goes through _load_pdf for friendly password errors; worker_shutdown tolerant of teardown exceptions README: document thread-unsafety, worker threshold, spawn main-guard. Tests: 40 passing (6 new regression tests incl. a broken-CMap fixture). Co-Authored-By: Claude Fable 5 <[email protected]>

- Refresh benchmark table from a full 200-doc run on pypdfium2 5.9.0 (pymupdf 1.25.3, pdfplumber 0.11): pdftext 1.36 s/doc at 97.54 alignment - Label benchmark times as per-doc (they always were; the header said per-page) - Document --password on both CLI modes; correct span rotation units to radians; fix benchmark script path and add --pdftext_workers - Python requirement 3.9+ -> 3.10+ to match the package constraint - Re-measure the pypdfium2-alone comparison (~3x after the hot-loop optimizations shifted the balance toward grouping) - Drop the stale scikit-learn credit and decision-tree description Co-Authored-By: Claude Fable 5 <[email protected]>

get_chars now fills flat arrays in the FFI loop instead of building a dict + Bbox per character; coordinate transforms, page rotation, and surrogate filtering are vectorized over the whole page. Word deduplication computes break flags and word bboxes with numpy (isin/reduceat) and only loops over words. Span building runs over plain lists with float accumulators, and span/char dicts are materialized once at the end - and only when the caller actually needs chars (links, tables, keep_chars), which also replaces the worker-side _drop_chars pass. Declare numpy as a runtime dependency - it was already imported unconditionally (utils.py, tables.py) but missing from the published wheel metadata. Output is byte-identical (golden-diff on all fixtures, 40 tests, 40-doc real-world fuzz, workers paths). Benchmark: 0.58 s/doc vs 0.95 before (2.3x of pymupdf, from 3.9x), alignment unchanged at 98.41. Co-Authored-By: Claude Fable 5 <[email protected]>

CJK round-trip tests (Chinese/Japanese/Korean) plus Cyrillic, Greek, and Vietnamese coverage. README documents what was verified: LTR scripts extract correctly; RTL (Arabic/Hebrew) comes back in pdfium's visual order since pdfium does no bidi reordering; complex-script and emoji fidelity depend on the PDF's ToUnicode map (extractor- independent - pymupdf produces identical output on broken maps). The numpy rewrite was verified character-identical to the previous implementation across ten scripts. Co-Authored-By: Claude Fable 5 <[email protected]>

Clean idle-machine 200-doc run: pdftext 0.69 s/doc (2.0x pymupdf, was 1.36/4x), alignment unchanged at 97.54. Update the pypdfium2-alone overhead estimate to ~1.5-2x. Co-Authored-By: Claude Fable 5 <[email protected]>

VikParuchuri and others added 15 commits June 10, 2026 11:41

Update benchmark table after numpy rewrite

3e54ade

Clean idle-machine 200-doc run: pdftext 0.69 s/doc (2.0x pymupdf, was 1.36/4x), alignment unchanged at 97.54. Update the pypdfium2-alone overhead estimate to ~1.5-2x. Co-Authored-By: Claude Fable 5 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev#56

Dev#56
VikParuchuri wants to merge 15 commits into
masterfrom
dev

VikParuchuri commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VikParuchuri commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant