fix: enhance error handling for document processing and add OCR requirement check for image files (backport of #1859)#2017
Conversation
…rement check for image files (#1859) * fix: enhance error handling for document processing and add OCR requirement check for image files * style: ruff autofix (auto) * remove clickhouse connect from langflow image (#1801) * chore: fix ci (#1889) * Update processors.py * test fix * fix ci --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Lucas Oliveira <[email protected]> Co-authored-by: Edwin Jose <[email protected]> (cherry picked from commit a286581)
WalkthroughAdds a ChangesVisibility Retry and Ingestion Verification
SDK Integration Test
Estimated code review effort: 3 (Moderate) | ~25 minutes Sequence Diagram(s)sequenceDiagram
participant LangflowFileProcessor
participant TaskProcessor
participant OpenSearch
LangflowFileProcessor->>TaskProcessor: check_filename_exists(original_filename, wait_for_visibility=True)
TaskProcessor->>OpenSearch: query for filename
OpenSearch-->>TaskProcessor: no hits
TaskProcessor->>TaskProcessor: sleep with backoff, retry
TaskProcessor->>OpenSearch: query for filename (retry)
OpenSearch-->>TaskProcessor: hits found
TaskProcessor-->>LangflowFileProcessor: exists = True
Possibly related PRs
Suggested reviewers: 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/models/processors.py (1)
159-237: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winAdd
on_errorhandling tocheck_filename_exists
check_filename_existsstill treats final OpenSearch errors as “missing”, so the post-ingest verification path can mark an already ingested file as FAILED on a transient outage. Mirrorcheck_document_existshere, and passon_error="assume_exists"at the verification call site; dedupe callers can keep the defaultassume_missing.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/models/processors.py` around lines 159 - 237, `check_filename_exists` currently treats final OpenSearch failures as missing, which can incorrectly fail post-ingest verification; update this helper to support an `on_error` option like `check_document_exists`, and in the final exception path honor `on_error="assume_exists"` instead of always returning false. Then update the verification call site that uses `check_filename_exists` to pass `on_error="assume_exists"`, while leaving dedupe/default callers on the existing `assume_missing` behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/models/processors.py`:
- Around line 159-237: `check_filename_exists` currently treats final OpenSearch
failures as missing, which can incorrectly fail post-ingest verification; update
this helper to support an `on_error` option like `check_document_exists`, and in
the final exception path honor `on_error="assume_exists"` instead of always
returning false. Then update the verification call site that uses
`check_filename_exists` to pass `on_error="assume_exists"`, while leaving
dedupe/default callers on the existing `assume_missing` behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7995aa04-520a-4e34-9a3a-64819cd6c5fd
📒 Files selected for processing (2)
sdks/typescript/tests/integration.test.tssrc/models/processors.py
…to release-saas-ga-0.6.2 (#2021) * fix: error in skip duplicates functionality for folder ingest for connectors (backport of #1941) (#2018) * fix: error in skip duplicates functionality for folder ingest for connectors 75616(#1842) (#1941) * fix: enhance duplicate handling and indexing for connector file uploads * style: ruff autofix (auto) * fix: improve duplicate handling in connector sync functionality * chore: trigger CodeRabbit --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> (cherry picked from commit effaf02) * style: ruff autofix (auto) * fix: remove duplicate test definitions from main merge The automated merge-from-main on this branch concatenated two tests (test_connector_processor_indexes_cleaned_filename, test_langflow_connector_processor_uses_cleaned_filename) that already existed under both names, tripping ruff F811. Keep one copy of each, preferring the version wired with connector_id_exists=True to match current check_document_exists behavior. --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> (cherry picked from commit 286e9db) * fix: Ingestion of buckets not working right after connection is tested - CPD #75506 (#1942) (#2019) * fix: update bucket ingestion messages for clarity and consistency * fix: guard bucket pre-selection against connection ID mismatch The defaults query returns the first S3/COS connection regardless of active status. Only pre-select saved bucket_names when the defaults connection_id matches the current connector.connectionId to avoid seeding the wrong buckets if a stale connection exists alongside the active one. * fix: prevent defaults from overwriting user bucket selections Two issues in the initial-selection effect: 1. hasAppliedInitial was only set when valid buckets were found, leaving the effect live across bucket refreshes and allowing stale defaults to overwrite selections made after a refetch. 2. No guard against the async race where buckets loads before initialSelectedBuckets: user clicks buckets, defaults arrive later and silently overwrite them. Fix: mark hasAppliedInitial unconditionally on first evaluation, and only apply defaults when selectedBuckets is still empty (user hasn't acted yet). --------- (cherry picked from commit f3823b8) Co-authored-by: Claude Sonnet 4.6 <[email protected]> (cherry picked from commit 36f115f) * fix: handle orphan files and return appropriate responses when syncing connectors (#1785) (#2016) * feat: handle orphan files and return appropriate responses when syncing connectors * feat: refactor GoogleDriveOAuth to separate required scopes and add handling for missing optional group scopes in tests * style: ruff autofix (auto) * feat: add handling for skipped files and warnings in task processing and UI components --------- (cherry picked from commit aca865a) Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> (cherry picked from commit 79b33f0) * fix: enhance error handling for document processing and add OCR requirement check for image files (#1859) (#2017) * fix: enhance error handling for document processing and add OCR requirement check for image files * style: ruff autofix (auto) * remove clickhouse connect from langflow image (#1801) * chore: fix ci (#1889) * Update processors.py * test fix * fix ci --------- (cherry picked from commit a286581) Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Lucas Oliveira <[email protected]> Co-authored-by: Edwin Jose <[email protected]> (cherry picked from commit 4c6acd8) --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <[email protected]> Co-authored-by: Lucas Oliveira <[email protected]> Co-authored-by: Edwin Jose <[email protected]>
Backport of #1859 from
release-cpdtomain.Merges two independent post-ingest verification improvements that had each evolved separately on
release-cpdandmainsince they diverged:main'son_error="assume_exists"handling (don't fail a file on transient OpenSearch errors after Langflow already reported success) is combined with fix: enhance error handling for document processing and add OCR requirement check for image files #1859'swait_for_visibilityretry (ride out OpenSearch's near-real-time refresh window).check_filename_exists, not byhash_id(item)— Langflow assigns its own document_id there, so the hash was never actually the stored id and the check was effectively a no-op.task_service.pyand a flaky-test fix in the TS SDK integration suite.Test plan
python3 -m py_compileon touched files🤖 Generated with Claude Code
Summary by CodeRabbit