Skip to content

feat: Visual parser spike#1990

Draft
Wallgau wants to merge 3 commits into
mainfrom
visual-parser-spike
Draft

feat: Visual parser spike#1990
Wallgau wants to merge 3 commits into
mainfrom
visual-parser-spike

Conversation

@Wallgau

@Wallgau Wallgau commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds an ephemeral parse preview so users can see exactly how a document is parsed and indexed as it ingests, instead of waiting blindly for it to land in Knowledge.

After a preview-mode ingest, a dialog shows:

  • Parsed layout — Docling's output rendered two ways:
    • paged formats (PDF/images): the real page raster with Docling's bounding-box overlay
    • office/non-paged formats (docx, pptx, xlsx, html, csv, md…): each region Docling detected, rendered as a colored, labeled block (Title / Heading / List / Table / Figure), since these formats have no page raster to draw boxes on
  • Search index — a live step pipeline (layout parsed → chunks created → embeddings → stored in OpenSearch), driven by the existing /tasks/enhanced data, with failures landing on the right step.

The preview is opt-in and ephemeral: results live in an in-memory TTL cache and the UI clearly communicates when the live window has ended (the document remains indexed and searchable).

How it works

  • IngestPreviewService — in-memory TTL cache (single-worker safe), keyed per (user_id, task_id, file_path) so multi-file and folder uploads each hold their own parse preview.
  • Docling is invoked in preview mode (image_export_mode=embedded, include_page_images, include_images); on success the DoclingDocument JSON is cached.
  • Two new endpoints:
    • GET /api/ingest/preview/{task_id}/docling?file=… → cached Docling JSON
    • GET /api/ingest/preview/{task_id}/index-proof?file=… → chunk/embedding proof
  • Frontend polls per-file (bounded), renders boxes (paged) or labeled blocks (non-paged), and falls back to a browser-native preview before Docling returns.

Entry points

  • Onboarding upload
  • Knowledge dropdown (single + multi-file / folder)
  • Cloud picker (Google Drive / OneDrive / SharePoint)
  • Connector file browser — explicit file selection only (bucket-level and sync-all intentionally excluded to avoid bulk preview load)

Scope & safeguards

  • Entire feature gated behind the ingest-preview run-mode flag.
  • Per-task preview cache cap (MAX_PREVIEWS_PER_TASK) so large folder/connector syncs can't grow the cache unbounded.
  • Bounded client polling (MAX_PREVIEW_POLLS) so a file that never produces a preview can't poll forever.
  • Connector files have no browser File object, so they show a spinner until Docling returns (no local pre-parse preview); binary office formats likewise show a placeholder until parsing finishes.

Out of scope (follow-ups)

  • User-facing parsing/chunking customization (re-chunk against the cached parse, structure-aware chunking, presets).
  • Splitting the connector-preview and knowledge-dropdown entry points into separate PRs if reviewers prefer smaller units.

Testing

  • IngestPreviewService cache (store/get, per-file, TTL expiry, per-task cap), preview endpoints, Docling preview options, langflow preview caching, router/task-service preview wiring.
  • Backend: affected unit tests pass; ruff clean on touched files.
  • Frontend: biome clean on touched files; tsc shows only pre-existing errors (@docling/docling-components types, test-runner globals).

Notes for reviewers

  • New dependency: @docling/docling-components (renders the page-image box overlay for paged formats).
  • IngestPreviewService is per-process (TTLCache), consistent with the repo's single-worker constraint; a Redis-backed cache would be required to scale horizontally.
  • docling_service.fetch_task_result is unchanged in shape (an earlier HTML-export path was dropped to keep the blast radius small).

…cument is parsed and indexed as it ingests, instead of waiting blindly for it to land in Knowledge.

After a preview-mode ingest, a dialog shows:
- **Parsed layout** — Docling's output rendered two ways:
  - paged formats (PDF/images): the real page raster with Docling's bounding-box overlay
  - office/non-paged formats (docx, pptx, xlsx, html, csv, md…): each region Docling detected, rendered as a colored, labeled block (Title / Heading / List / Table / Figure), since these formats have no page raster to draw boxes on
- **Search index** — a live step pipeline (layout parsed → chunks created → embeddings → stored in OpenSearch), driven by the existing `/tasks/enhanced` data, with failures landing on the right step.
The preview is **opt-in and ephemeral**: results live in an in-memory TTL cache and the UI clearly communicates when the live window has ended (the document remains indexed and searchable).
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2cb11865-f967-4a31-a19d-777fdc17662a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch visual-parser-spike

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions github-actions Bot added frontend 🟨 Issues related to the UI/UX backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) tests enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Jun 30, 2026
Comment thread frontend/components/ingest-preview.tsx Fixed
if (file.type.startsWith("image/") && objectUrl) {
return (
<img
src={objectUrl}
@Wallgau Wallgau force-pushed the visual-parser-spike branch from 67a0d1a to b0ff919 Compare June 30, 2026 18:02
@github-actions github-actions Bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Jun 30, 2026
@github-actions github-actions Bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Jun 30, 2026
// won't display PDFs inside an <iframe>.
return (
<object
data={objectUrl}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) enhancement 🔵 New feature or request frontend 🟨 Issues related to the UI/UX tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants