Skip to content

fix: enhance error handling for document processing and add OCR requirement check for image files (backport of #1859)#2017

Merged
ricofurtado merged 4 commits into
mainfrom
cpd-backport-1859-ocr-error-handling
Jul 4, 2026
Merged

fix: enhance error handling for document processing and add OCR requirement check for image files (backport of #1859)#2017
ricofurtado merged 4 commits into
mainfrom
cpd-backport-1859-ocr-error-handling

Conversation

@ricofurtado

@ricofurtado ricofurtado commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Backport of #1859 from release-cpd to main.

Merges two independent post-ingest verification improvements that had each evolved separately on release-cpd and main since they diverged:

Test plan

  • python3 -m py_compile on touched files
  • Full unit suite for touched areas passes

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved document deletion and verification reliability so newly ingested files are recognized even with indexing delays.
    • Fixed cases where a document could appear missing right after ingestion, reducing false failures during upload and delete checks.
    • Updated test coverage to use unique sample files, helping prevent leftover data from affecting results.

…rement check for image files (#1859)

* fix: enhance error handling for document processing and add OCR requirement check for image files

* style: ruff autofix (auto)

* remove clickhouse connect from langflow image (#1801)

* chore: fix ci (#1889)

* Update processors.py

* test fix

* fix ci

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Lucas Oliveira <[email protected]>
Co-authored-by: Edwin Jose <[email protected]>
(cherry picked from commit a286581)
@github-actions github-actions Bot added backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) bug 🔴 Something isn't working. labels Jul 3, 2026
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Adds a wait_for_visibility retry option to TaskProcessor.check_document_exists and check_filename_exists to handle OpenSearch near-real-time refresh delays, wires this into Langflow/connector ingestion verification, and updates the SDK delete-document integration test to use a unique file to avoid interference from earlier ingests.

Changes

Visibility Retry and Ingestion Verification

Layer / File(s) Summary
check_document_exists retry logic
src/models/processors.py
Adds kw-only wait_for_visibility parameter and retry-with-backoff when no hits are found, with docstring updates.
check_filename_exists retry logic
src/models/processors.py
Adds kw-only wait_for_visibility parameter; resets alias candidates and retries after delay when no hits found across all aliases.
Ingestion verification wiring
src/models/processors.py
ConnectorFileProcessor.process_item passes wait_for_visibility=True; LangflowFileProcessor.process_item switches from hash-based check_document_exists to check_filename_exists(original_filename, wait_for_visibility=True).

SDK Integration Test

Layer / File(s) Summary
Unique file per delete test
sdks/typescript/tests/integration.test.ts
should delete a document now creates, ingests, and deletes a uniquely named temporary file instead of reusing the shared test file, keeping existing assertion logic.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
  participant LangflowFileProcessor
  participant TaskProcessor
  participant OpenSearch

  LangflowFileProcessor->>TaskProcessor: check_filename_exists(original_filename, wait_for_visibility=True)
  TaskProcessor->>OpenSearch: query for filename
  OpenSearch-->>TaskProcessor: no hits
  TaskProcessor->>TaskProcessor: sleep with backoff, retry
  TaskProcessor->>OpenSearch: query for filename (retry)
  OpenSearch-->>TaskProcessor: hits found
  TaskProcessor-->>LangflowFileProcessor: exists = True
Loading

Possibly related PRs

  • langflow-ai/openrag#1695: Both PRs modify check_document_exists/OpenSearch existence check behavior in the same method.
  • langflow-ai/openrag#1851: Both PRs update post-ingestion verification logic in src/models/processors.py for Langflow/connector flows.
  • langflow-ai/openrag#1857: Both PRs address OpenSearch near-real-time visibility issues via retry-based existence checks in the same ingestion verification paths.

Suggested reviewers: edwinjosechittilappilly, lucaseduoli, mfortman11

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title matches the main document-processing fixes and correctly frames the change as a backport.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cpd-backport-1859-ocr-error-handling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels Jul 3, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/models/processors.py (1)

159-237: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Add on_error handling to check_filename_exists
check_filename_exists still treats final OpenSearch errors as “missing”, so the post-ingest verification path can mark an already ingested file as FAILED on a transient outage. Mirror check_document_exists here, and pass on_error="assume_exists" at the verification call site; dedupe callers can keep the default assume_missing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/models/processors.py` around lines 159 - 237, `check_filename_exists`
currently treats final OpenSearch failures as missing, which can incorrectly
fail post-ingest verification; update this helper to support an `on_error`
option like `check_document_exists`, and in the final exception path honor
`on_error="assume_exists"` instead of always returning false. Then update the
verification call site that uses `check_filename_exists` to pass
`on_error="assume_exists"`, while leaving dedupe/default callers on the existing
`assume_missing` behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/models/processors.py`:
- Around line 159-237: `check_filename_exists` currently treats final OpenSearch
failures as missing, which can incorrectly fail post-ingest verification; update
this helper to support an `on_error` option like `check_document_exists`, and in
the final exception path honor `on_error="assume_exists"` instead of always
returning false. Then update the verification call site that uses
`check_filename_exists` to pass `on_error="assume_exists"`, while leaving
dedupe/default callers on the existing `assume_missing` behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7995aa04-520a-4e34-9a3a-64819cd6c5fd

📥 Commits

Reviewing files that changed from the base of the PR and between bdd873f and 027a6d0.

📒 Files selected for processing (2)
  • sdks/typescript/tests/integration.test.ts
  • src/models/processors.py

@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels Jul 3, 2026
@github-actions github-actions Bot added the lgtm label Jul 3, 2026
@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels Jul 3, 2026
@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels Jul 3, 2026
@ricofurtado ricofurtado merged commit 4c6acd8 into main Jul 4, 2026
20 checks passed
@github-actions github-actions Bot deleted the cpd-backport-1859-ocr-error-handling branch July 4, 2026 00:44
@mpawlow mpawlow removed their request for review July 4, 2026 19:53
ricofurtado added a commit that referenced this pull request Jul 5, 2026
…to release-saas-ga-0.6.2 (#2021)

* fix: error in skip duplicates functionality for folder ingest for connectors (backport of #1941) (#2018)

* fix: error in skip duplicates functionality for folder ingest for connectors 75616(#1842) (#1941)

* fix: enhance duplicate handling and indexing for connector file uploads

* style: ruff autofix (auto)

* fix: improve duplicate handling in connector sync functionality

* chore: trigger CodeRabbit

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
(cherry picked from commit effaf02)

* style: ruff autofix (auto)

* fix: remove duplicate test definitions from main merge

The automated merge-from-main on this branch concatenated two tests
(test_connector_processor_indexes_cleaned_filename,
test_langflow_connector_processor_uses_cleaned_filename) that already
existed under both names, tripping ruff F811. Keep one copy of each,
preferring the version wired with connector_id_exists=True to match
current check_document_exists behavior.

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
(cherry picked from commit 286e9db)

* fix: Ingestion of buckets not working right after connection is tested - CPD #75506 (#1942) (#2019)

* fix: update bucket ingestion messages for clarity and consistency

* fix: guard bucket pre-selection against connection ID mismatch

The defaults query returns the first S3/COS connection regardless of active
status. Only pre-select saved bucket_names when the defaults connection_id
matches the current connector.connectionId to avoid seeding the wrong
buckets if a stale connection exists alongside the active one.

* fix: prevent defaults from overwriting user bucket selections

Two issues in the initial-selection effect:
1. hasAppliedInitial was only set when valid buckets were found, leaving the
   effect live across bucket refreshes and allowing stale defaults to overwrite
   selections made after a refetch.
2. No guard against the async race where buckets loads before initialSelectedBuckets:
   user clicks buckets, defaults arrive later and silently overwrite them.

Fix: mark hasAppliedInitial unconditionally on first evaluation, and only apply
defaults when selectedBuckets is still empty (user hasn't acted yet).

---------

(cherry picked from commit f3823b8)

Co-authored-by: Claude Sonnet 4.6 <[email protected]>
(cherry picked from commit 36f115f)

* fix: handle orphan files and return appropriate responses when syncing connectors (#1785) (#2016)

* feat: handle orphan files and return appropriate responses when syncing connectors

* feat: refactor GoogleDriveOAuth to separate required scopes and add handling for missing optional group scopes in tests

* style: ruff autofix (auto)

* feat: add handling for skipped files and warnings in task processing and UI components

---------

(cherry picked from commit aca865a)

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
(cherry picked from commit 79b33f0)

* fix: enhance error handling for document processing and add OCR requirement check for image files (#1859) (#2017)

* fix: enhance error handling for document processing and add OCR requirement check for image files

* style: ruff autofix (auto)

* remove clickhouse connect from langflow image (#1801)

* chore: fix ci (#1889)

* Update processors.py

* test fix

* fix ci

---------

(cherry picked from commit a286581)

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Lucas Oliveira <[email protected]>
Co-authored-by: Edwin Jose <[email protected]>
(cherry picked from commit 4c6acd8)

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <[email protected]>
Co-authored-by: Lucas Oliveira <[email protected]>
Co-authored-by: Edwin Jose <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) bug 🔴 Something isn't working. lgtm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants