Dev#56
Open
VikParuchuri wants to merge 15 commits into
Open
Conversation
Bundled pdfium moves 6462 -> 7869. Text extraction improves on math-heavy content (previously dropped chars in formulas); loose charboxes shift slightly due to upstream FontBBox bounding changes. Benchmark alignment vs pymupdf unchanged (98.4). Keep the exact pin: every pypdfium2 minor swaps the pdfium binary and silently changes extraction output. Co-Authored-By: Claude Fable 5 <[email protected]>
Close page and textpage handles in get_pages (including the first page handle when flattening re-fetches), close the document in a finally in _get_pages and around link extraction, and release page/annotation handles in get_links. Narrow the get_rotation excepts to PdfiumError. Output is byte-identical (golden-diff verified). Co-Authored-By: Claude Fable 5 <[email protected]>
- Add password support for encrypted PDFs (password= kwarg on all public functions, threaded through workers; --password CLI flag) and raise a clear PdfPasswordError instead of a cryptic pdfium failure - Validate page ranges in the library (ValueError before forking workers) and CLI (click.BadParameter; fixes off-by-one that allowed p == doc_len) - Replace validation asserts with ValueError (survives python -O) - Narrow get_fontname except to PdfiumError, decode with errors=replace - Surface worker crashes as RuntimeError with context (BrokenProcessPool) Co-Authored-By: Claude Fable 5 <[email protected]>
- handle_hyphens no longer drops the final character (was masked by trailing newlines from merge_text) - Spans always carry superscript/subscript keys, matching the Span TypedDict; link-reconstructed spans previously dropped the flags entirely - Block dicts consistently include rotation - Guard zero-division in Bbox.rescale and negative area in intersection_pct; validate img_size dims in table_cell_text Co-Authored-By: Claude Fable 5 <[email protected]>
- Hoist pdfium FFI lookups to locals and reuse ctypes objects (FS_RECTF, c_doubles, font buffer) across the per-char loop instead of allocating per character - Inline the charbox call (replicates pypdfium2 get_charbox semantics) - Intern font dicts per page; font comparisons in span/word breaking use identity as the fast path - Skip Bbox.rotate() on unrotated pages Output is byte-identical (golden-diff verified). ~12% faster on a 65-page text-heavy PDF. Co-Authored-By: Claude Fable 5 <[email protected]>
Accumulator bboxes (words, spans, lines, blocks) are now copied at creation and mutated with merge_inplace instead of allocating a new Bbox per character/span/line merged. Add __slots__ and a compact __reduce__ to Bbox (smaller worker pickles). Output is byte-identical including char-level bboxes (golden-diff verified with keep_chars=True); workers=4 round-trip identical. Co-Authored-By: Claude Fable 5 <[email protected]>
- assign_scripts precomputes per-span geometry once per line and answers the any-other-span-above/below checks with top-2/bottom-2 aggregates instead of O(n^2) comparisons (was the single hottest function) - postprocess_text folds special chars and ligatures into one str.translate table; control-char removal gets an ASCII translate fast path and a memoized fallback - _reconstruct_spans drops per-char Bbox construction and per-char sort Output is byte-identical (golden-diff verified). Co-Authored-By: Claude Fable 5 <[email protected]>
Plain-text output and dictionary_output(keep_chars=False, disable_links=True) never read span chars, so workers delete them before pickling results back, cutting IPC cost (~2x on the worker path for a 65-page PDF). Chars are kept whenever links or keep_chars need them. Co-Authored-By: Claude Fable 5 <[email protected]>
Skips pypdfium2's per-call auto-cast (~6 calls/char). Output is byte-identical (golden-diff verified). Note: batching char extraction via FPDFText_GetText was tested and rejected — its text-index space diverges from the char-index space on real PDFs (5096 value mismatches + an index shift on the adversarial fixture), so per-char FPDFText_GetUnicode stays. Co-Authored-By: Claude Fable 5 <[email protected]>
- Generated fixtures (rotated, AES-encrypted, empty PDFs) built at test time with pymupdf; no binaries checked in - New tests: rotated pages, encrypted PDFs (missing/wrong/correct password, through workers), empty pages, out-of-range page ranges, workers-vs-serial equivalence, links/refs with script flags, hyphen/postprocessing units, CLI validation paths, table input mismatch, repeated flatten - Fix dead test (text_plain_text_output -> test_plain_text_output) - CI: os matrix (ubuntu/windows/macos) x python (3.10/3.11/3.13), bump checkout/setup-python actions 34 tests passing, up from 8. Co-Authored-By: Claude Fable 5 <[email protected]>
Table extraction (rotated pages were significantly broken): - get_dynamic_gap_thresh measured the perpendicular axis for every rotation (branches shifted by one), making the dynamic threshold always negative and dead; now measures the text advance axis - is_same_span rot-180 had an inverted sign (any gap merged) and two dimension typos normalizing x-coords by image height - pdfium-injected \r\n chars no longer embed in rotated cell text - table_output validates caller-supplied pages retain char data Extraction correctness: - Lone surrogates from broken ToUnicode CMaps are replaced with U+FFFD (previously crashed UTF-8 encoding of char-level JSON output) - page["bbox"] for 90/270-rotated pages is now a valid display-space box instead of corner-reversed (x_start > x_end) - get_lines rotation break compared radians against 45 (dead code); now breaks on roughly perpendicular text (45-135 degrees, circular), keeping pdfium's 180-degree negative-scale flips on the same line — vertical figure axis labels now separate from tick labels - Bbox.rotate no longer aliases the source list at rotation 0 API consistency and robustness: - refs key always present on pages (empty when disable_links=True) - quote_loosebox exposed on plain-text functions - File-like inputs fall back to serial instead of failing to pickle into worker initargs; empty trailing worker chunks dropped - Settings env vars now namespaced (PDFTEXT_ prefix) - CLI page-count check goes through _load_pdf for friendly password errors; worker_shutdown tolerant of teardown exceptions README: document thread-unsafety, worker threshold, spawn main-guard. Tests: 40 passing (6 new regression tests incl. a broken-CMap fixture). Co-Authored-By: Claude Fable 5 <[email protected]>
- Refresh benchmark table from a full 200-doc run on pypdfium2 5.9.0 (pymupdf 1.25.3, pdfplumber 0.11): pdftext 1.36 s/doc at 97.54 alignment - Label benchmark times as per-doc (they always were; the header said per-page) - Document --password on both CLI modes; correct span rotation units to radians; fix benchmark script path and add --pdftext_workers - Python requirement 3.9+ -> 3.10+ to match the package constraint - Re-measure the pypdfium2-alone comparison (~3x after the hot-loop optimizations shifted the balance toward grouping) - Drop the stale scikit-learn credit and decision-tree description Co-Authored-By: Claude Fable 5 <[email protected]>
get_chars now fills flat arrays in the FFI loop instead of building a dict + Bbox per character; coordinate transforms, page rotation, and surrogate filtering are vectorized over the whole page. Word deduplication computes break flags and word bboxes with numpy (isin/reduceat) and only loops over words. Span building runs over plain lists with float accumulators, and span/char dicts are materialized once at the end - and only when the caller actually needs chars (links, tables, keep_chars), which also replaces the worker-side _drop_chars pass. Declare numpy as a runtime dependency - it was already imported unconditionally (utils.py, tables.py) but missing from the published wheel metadata. Output is byte-identical (golden-diff on all fixtures, 40 tests, 40-doc real-world fuzz, workers paths). Benchmark: 0.58 s/doc vs 0.95 before (2.3x of pymupdf, from 3.9x), alignment unchanged at 98.41. Co-Authored-By: Claude Fable 5 <[email protected]>
CJK round-trip tests (Chinese/Japanese/Korean) plus Cyrillic, Greek, and Vietnamese coverage. README documents what was verified: LTR scripts extract correctly; RTL (Arabic/Hebrew) comes back in pdfium's visual order since pdfium does no bidi reordering; complex-script and emoji fidelity depend on the PDF's ToUnicode map (extractor- independent - pymupdf produces identical output on broken maps). The numpy rewrite was verified character-identical to the previous implementation across ten scripts. Co-Authored-By: Claude Fable 5 <[email protected]>
Clean idle-machine 200-doc run: pdftext 0.69 s/doc (2.0x pymupdf, was 1.36/4x), alignment unchanged at 97.54. Update the pypdfium2-alone overhead estimate to ~1.5-2x. Co-Authored-By: Claude Fable 5 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.