Skip to content

Dev#56

Open
VikParuchuri wants to merge 15 commits into
masterfrom
dev
Open

Dev#56
VikParuchuri wants to merge 15 commits into
masterfrom
dev

Conversation

@VikParuchuri

Copy link
Copy Markdown
Member

No description provided.

VikParuchuri and others added 15 commits June 10, 2026 11:41
Bundled pdfium moves 6462 -> 7869. Text extraction improves on math-heavy
content (previously dropped chars in formulas); loose charboxes shift
slightly due to upstream FontBBox bounding changes. Benchmark alignment
vs pymupdf unchanged (98.4). Keep the exact pin: every pypdfium2 minor
swaps the pdfium binary and silently changes extraction output.

Co-Authored-By: Claude Fable 5 <[email protected]>
Close page and textpage handles in get_pages (including the first page
handle when flattening re-fetches), close the document in a finally in
_get_pages and around link extraction, and release page/annotation
handles in get_links. Narrow the get_rotation excepts to PdfiumError.
Output is byte-identical (golden-diff verified).

Co-Authored-By: Claude Fable 5 <[email protected]>
- Add password support for encrypted PDFs (password= kwarg on all public
  functions, threaded through workers; --password CLI flag) and raise a
  clear PdfPasswordError instead of a cryptic pdfium failure
- Validate page ranges in the library (ValueError before forking workers)
  and CLI (click.BadParameter; fixes off-by-one that allowed p == doc_len)
- Replace validation asserts with ValueError (survives python -O)
- Narrow get_fontname except to PdfiumError, decode with errors=replace
- Surface worker crashes as RuntimeError with context (BrokenProcessPool)

Co-Authored-By: Claude Fable 5 <[email protected]>
- handle_hyphens no longer drops the final character (was masked by
  trailing newlines from merge_text)
- Spans always carry superscript/subscript keys, matching the Span
  TypedDict; link-reconstructed spans previously dropped the flags
  entirely
- Block dicts consistently include rotation
- Guard zero-division in Bbox.rescale and negative area in
  intersection_pct; validate img_size dims in table_cell_text

Co-Authored-By: Claude Fable 5 <[email protected]>
- Hoist pdfium FFI lookups to locals and reuse ctypes objects
  (FS_RECTF, c_doubles, font buffer) across the per-char loop instead
  of allocating per character
- Inline the charbox call (replicates pypdfium2 get_charbox semantics)
- Intern font dicts per page; font comparisons in span/word breaking
  use identity as the fast path
- Skip Bbox.rotate() on unrotated pages

Output is byte-identical (golden-diff verified). ~12% faster on a
65-page text-heavy PDF.

Co-Authored-By: Claude Fable 5 <[email protected]>
Accumulator bboxes (words, spans, lines, blocks) are now copied at
creation and mutated with merge_inplace instead of allocating a new
Bbox per character/span/line merged. Add __slots__ and a compact
__reduce__ to Bbox (smaller worker pickles).

Output is byte-identical including char-level bboxes (golden-diff
verified with keep_chars=True); workers=4 round-trip identical.

Co-Authored-By: Claude Fable 5 <[email protected]>
- assign_scripts precomputes per-span geometry once per line and answers
  the any-other-span-above/below checks with top-2/bottom-2 aggregates
  instead of O(n^2) comparisons (was the single hottest function)
- postprocess_text folds special chars and ligatures into one
  str.translate table; control-char removal gets an ASCII translate
  fast path and a memoized fallback
- _reconstruct_spans drops per-char Bbox construction and per-char sort

Output is byte-identical (golden-diff verified).

Co-Authored-By: Claude Fable 5 <[email protected]>
Plain-text output and dictionary_output(keep_chars=False,
disable_links=True) never read span chars, so workers delete them
before pickling results back, cutting IPC cost (~2x on the worker
path for a 65-page PDF). Chars are kept whenever links or keep_chars
need them.

Co-Authored-By: Claude Fable 5 <[email protected]>
Skips pypdfium2's per-call auto-cast (~6 calls/char). Output is
byte-identical (golden-diff verified).

Note: batching char extraction via FPDFText_GetText was tested and
rejected — its text-index space diverges from the char-index space on
real PDFs (5096 value mismatches + an index shift on the adversarial
fixture), so per-char FPDFText_GetUnicode stays.

Co-Authored-By: Claude Fable 5 <[email protected]>
- Generated fixtures (rotated, AES-encrypted, empty PDFs) built at test
  time with pymupdf; no binaries checked in
- New tests: rotated pages, encrypted PDFs (missing/wrong/correct
  password, through workers), empty pages, out-of-range page ranges,
  workers-vs-serial equivalence, links/refs with script flags,
  hyphen/postprocessing units, CLI validation paths, table input
  mismatch, repeated flatten
- Fix dead test (text_plain_text_output -> test_plain_text_output)
- CI: os matrix (ubuntu/windows/macos) x python (3.10/3.11/3.13),
  bump checkout/setup-python actions

34 tests passing, up from 8.

Co-Authored-By: Claude Fable 5 <[email protected]>
Table extraction (rotated pages were significantly broken):
- get_dynamic_gap_thresh measured the perpendicular axis for every
  rotation (branches shifted by one), making the dynamic threshold
  always negative and dead; now measures the text advance axis
- is_same_span rot-180 had an inverted sign (any gap merged) and two
  dimension typos normalizing x-coords by image height
- pdfium-injected \r\n chars no longer embed in rotated cell text
- table_output validates caller-supplied pages retain char data

Extraction correctness:
- Lone surrogates from broken ToUnicode CMaps are replaced with U+FFFD
  (previously crashed UTF-8 encoding of char-level JSON output)
- page["bbox"] for 90/270-rotated pages is now a valid display-space
  box instead of corner-reversed (x_start > x_end)
- get_lines rotation break compared radians against 45 (dead code);
  now breaks on roughly perpendicular text (45-135 degrees, circular),
  keeping pdfium's 180-degree negative-scale flips on the same line —
  vertical figure axis labels now separate from tick labels
- Bbox.rotate no longer aliases the source list at rotation 0

API consistency and robustness:
- refs key always present on pages (empty when disable_links=True)
- quote_loosebox exposed on plain-text functions
- File-like inputs fall back to serial instead of failing to pickle
  into worker initargs; empty trailing worker chunks dropped
- Settings env vars now namespaced (PDFTEXT_ prefix)
- CLI page-count check goes through _load_pdf for friendly password
  errors; worker_shutdown tolerant of teardown exceptions

README: document thread-unsafety, worker threshold, spawn main-guard.
Tests: 40 passing (6 new regression tests incl. a broken-CMap fixture).

Co-Authored-By: Claude Fable 5 <[email protected]>
- Refresh benchmark table from a full 200-doc run on pypdfium2 5.9.0
  (pymupdf 1.25.3, pdfplumber 0.11): pdftext 1.36 s/doc at 97.54
  alignment
- Label benchmark times as per-doc (they always were; the header said
  per-page)
- Document --password on both CLI modes; correct span rotation units
  to radians; fix benchmark script path and add --pdftext_workers
- Python requirement 3.9+ -> 3.10+ to match the package constraint
- Re-measure the pypdfium2-alone comparison (~3x after the hot-loop
  optimizations shifted the balance toward grouping)
- Drop the stale scikit-learn credit and decision-tree description

Co-Authored-By: Claude Fable 5 <[email protected]>
get_chars now fills flat arrays in the FFI loop instead of building a
dict + Bbox per character; coordinate transforms, page rotation, and
surrogate filtering are vectorized over the whole page. Word
deduplication computes break flags and word bboxes with numpy
(isin/reduceat) and only loops over words. Span building runs over
plain lists with float accumulators, and span/char dicts are
materialized once at the end - and only when the caller actually
needs chars (links, tables, keep_chars), which also replaces the
worker-side _drop_chars pass.

Declare numpy as a runtime dependency - it was already imported
unconditionally (utils.py, tables.py) but missing from the published
wheel metadata.

Output is byte-identical (golden-diff on all fixtures, 40 tests,
40-doc real-world fuzz, workers paths). Benchmark: 0.58 s/doc vs
0.95 before (2.3x of pymupdf, from 3.9x), alignment unchanged at
98.41.

Co-Authored-By: Claude Fable 5 <[email protected]>
CJK round-trip tests (Chinese/Japanese/Korean) plus Cyrillic, Greek,
and Vietnamese coverage. README documents what was verified: LTR
scripts extract correctly; RTL (Arabic/Hebrew) comes back in pdfium's
visual order since pdfium does no bidi reordering; complex-script and
emoji fidelity depend on the PDF's ToUnicode map (extractor-
independent - pymupdf produces identical output on broken maps).

The numpy rewrite was verified character-identical to the previous
implementation across ten scripts.

Co-Authored-By: Claude Fable 5 <[email protected]>
Clean idle-machine 200-doc run: pdftext 0.69 s/doc (2.0x pymupdf, was
1.36/4x), alignment unchanged at 97.54. Update the pypdfium2-alone
overhead estimate to ~1.5-2x.

Co-Authored-By: Claude Fable 5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant