feat: Visual parser spike#1990
Draft
Wallgau wants to merge 3 commits into
Draft
Conversation
…cument is parsed and indexed as it ingests, instead of waiting blindly for it to land in Knowledge. After a preview-mode ingest, a dialog shows: - **Parsed layout** — Docling's output rendered two ways: - paged formats (PDF/images): the real page raster with Docling's bounding-box overlay - office/non-paged formats (docx, pptx, xlsx, html, csv, md…): each region Docling detected, rendered as a colored, labeled block (Title / Heading / List / Table / Figure), since these formats have no page raster to draw boxes on - **Search index** — a live step pipeline (layout parsed → chunks created → embeddings → stored in OpenSearch), driven by the existing `/tasks/enhanced` data, with failures landing on the right step. The preview is **opt-in and ephemeral**: results live in an in-memory TTL cache and the UI clearly communicates when the live window has ended (the document remains indexed and searchable).
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| if (file.type.startsWith("image/") && objectUrl) { | ||
| return ( | ||
| <img | ||
| src={objectUrl} |
67a0d1a to
b0ff919
Compare
| // won't display PDFs inside an <iframe>. | ||
| return ( | ||
| <object | ||
| data={objectUrl} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an ephemeral parse preview so users can see exactly how a document is parsed and indexed as it ingests, instead of waiting blindly for it to land in Knowledge.
After a preview-mode ingest, a dialog shows:
/tasks/enhanceddata, with failures landing on the right step.The preview is opt-in and ephemeral: results live in an in-memory TTL cache and the UI clearly communicates when the live window has ended (the document remains indexed and searchable).
How it works
IngestPreviewService— in-memory TTL cache (single-worker safe), keyed per(user_id, task_id, file_path)so multi-file and folder uploads each hold their own parse preview.image_export_mode=embedded,include_page_images,include_images); on success theDoclingDocumentJSON is cached.GET /api/ingest/preview/{task_id}/docling?file=…→ cached Docling JSONGET /api/ingest/preview/{task_id}/index-proof?file=…→ chunk/embedding proofEntry points
Scope & safeguards
MAX_PREVIEWS_PER_TASK) so large folder/connector syncs can't grow the cache unbounded.MAX_PREVIEW_POLLS) so a file that never produces a preview can't poll forever.Fileobject, so they show a spinner until Docling returns (no local pre-parse preview); binary office formats likewise show a placeholder until parsing finishes.Out of scope (follow-ups)
Testing
IngestPreviewServicecache (store/get, per-file, TTL expiry, per-task cap), preview endpoints, Docling preview options, langflow preview caching, router/task-service preview wiring.ruffclean on touched files.biomeclean on touched files;tscshows only pre-existing errors (@docling/docling-componentstypes, test-runner globals).Notes for reviewers
@docling/docling-components(renders the page-image box overlay for paged formats).IngestPreviewServiceis per-process (TTLCache), consistent with the repo's single-worker constraint; a Redis-backed cache would be required to scale horizontally.docling_service.fetch_task_resultis unchanged in shape (an earlier HTML-export path was dropped to keep the blast radius small).