feat: Visual parser spike by Wallgau · Pull Request #1990 · langflow-ai/openrag

Wallgau · 2026-06-30T15:47:06Z

Summary

Adds an ephemeral parse preview so users can see exactly how a document is parsed and indexed as it ingests, instead of waiting blindly for it to land in Knowledge.

After a preview-mode ingest, a dialog shows:

Parsed layout — Docling's output rendered two ways:
- paged formats (PDF/images): the real page raster with Docling's bounding-box overlay
- office/non-paged formats (docx, pptx, xlsx, html, csv, md…): each region Docling detected, rendered as a colored, labeled block (Title / Heading / List / Table / Figure), since these formats have no page raster to draw boxes on
Search index — a live step pipeline (layout parsed → chunks created → embeddings → stored in OpenSearch), driven by the existing /tasks/enhanced data, with failures landing on the right step.

The preview is opt-in and ephemeral: results live in an in-memory TTL cache and the UI clearly communicates when the live window has ended (the document remains indexed and searchable).

How it works

IngestPreviewService — in-memory TTL cache (single-worker safe), keyed per (user_id, task_id, file_path) so multi-file and folder uploads each hold their own parse preview.
Docling is invoked in preview mode (image_export_mode=embedded, include_page_images, include_images); on success the DoclingDocument JSON is cached.
Two new endpoints:
- GET /api/ingest/preview/{task_id}/docling?file=… → cached Docling JSON
- GET /api/ingest/preview/{task_id}/index-proof?file=… → chunk/embedding proof
Frontend polls per-file (bounded), renders boxes (paged) or labeled blocks (non-paged), and falls back to a browser-native preview before Docling returns.

Entry points

Onboarding upload
Knowledge dropdown (single + multi-file / folder)
Cloud picker (Google Drive / OneDrive / SharePoint)
Connector file browser — explicit file selection only (bucket-level and sync-all intentionally excluded to avoid bulk preview load)

Scope & safeguards

Entire feature gated behind the ingest-preview run-mode flag.
Per-task preview cache cap (MAX_PREVIEWS_PER_TASK) so large folder/connector syncs can't grow the cache unbounded.
Bounded client polling (MAX_PREVIEW_POLLS) so a file that never produces a preview can't poll forever.
Connector files have no browser File object, so they show a spinner until Docling returns (no local pre-parse preview); binary office formats likewise show a placeholder until parsing finishes.

Out of scope (follow-ups)

User-facing parsing/chunking customization (re-chunk against the cached parse, structure-aware chunking, presets).
Splitting the connector-preview and knowledge-dropdown entry points into separate PRs if reviewers prefer smaller units.

Testing

IngestPreviewService cache (store/get, per-file, TTL expiry, per-task cap), preview endpoints, Docling preview options, langflow preview caching, router/task-service preview wiring.
Backend: affected unit tests pass; ruff clean on touched files.
Frontend: biome clean on touched files; tsc shows only pre-existing errors (@docling/docling-components types, test-runner globals).

Notes for reviewers

New dependency: @docling/docling-components (renders the page-image box overlay for paged formats).
IngestPreviewService is per-process (TTLCache), consistent with the repo's single-worker constraint; a Redis-backed cache would be required to scale horizontally.
docling_service.fetch_task_result is unchanged in shape (an earlier HTML-export path was dropped to keep the blast radius small).

…cument is parsed and indexed as it ingests, instead of waiting blindly for it to land in Knowledge. After a preview-mode ingest, a dialog shows: - **Parsed layout** — Docling's output rendered two ways: - paged formats (PDF/images): the real page raster with Docling's bounding-box overlay - office/non-paged formats (docx, pptx, xlsx, html, csv, md…): each region Docling detected, rendered as a colored, labeled block (Title / Heading / List / Table / Figure), since these formats have no page raster to draw boxes on - **Search index** — a live step pipeline (layout parsed → chunks created → embeddings → stored in OpenSearch), driven by the existing `/tasks/enhanced` data, with failures landing on the right step. The preview is **opt-in and ephemeral**: results live in an in-memory TTL cache and the UI clearly communicates when the live window has ended (the document remains indexed and searchable).

coderabbitai · 2026-06-30T15:47:17Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2cb11865-f967-4a31-a19d-777fdc17662a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch visual-parser-spike

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

+  if (file.type.startsWith("image/") && objectUrl) {
+    return (
+      <img
+        src={objectUrl}


+    // won't display PDFs inside an <iframe>.
+    return (
+      <object
+        data={objectUrl}


github-actions Bot added frontend 🟨 Issues related to the UI/UX backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) tests enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Jun 30, 2026

github-advanced-security AI found potential problems Jun 30, 2026

View reviewed changes

Comment thread frontend/components/ingest-preview.tsx Fixed

Comment thread frontend/components/ingest-preview.tsx

if (file.type.startsWith("image/") && objectUrl) {

return (

<img

src={objectUrl}

fix XSS risk

b0ff919

Wallgau force-pushed the visual-parser-spike branch from 67a0d1a to b0ff919 Compare June 30, 2026 18:02

github-actions Bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Jun 30, 2026

style: ruff autofix (auto)

6bcbff4

github-actions Bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Jun 30, 2026

github-advanced-security AI found potential problems Jun 30, 2026

View reviewed changes

Comment thread frontend/components/ingest-preview.tsx

// won't display PDFs inside an <iframe>.

return (

<object

data={objectUrl}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Visual parser spike#1990

feat: Visual parser spike#1990
Wallgau wants to merge 3 commits into
mainfrom
visual-parser-spike

Wallgau commented Jun 30, 2026

Uh oh!

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Wallgau commented Jun 30, 2026

Summary

How it works

Entry points

Scope & safeguards

Out of scope (follow-ups)

Testing

Notes for reviewers

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading