Skip to content

Cio shared files 3#2012

Open
ricofurtado wants to merge 39 commits into
mainfrom
cio-shared-files-3
Open

Cio shared files 3#2012
ricofurtado wants to merge 39 commits into
mainfrom
cio-shared-files-3

Conversation

@ricofurtado

@ricofurtado ricofurtado commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

s

Summary by CodeRabbit

  • New Features
    • Added support for shared bucket imports in the file browser flow.
    • Shared status is now carried through during ingestion so selected files can be handled appropriately.
  • Bug Fixes
    • Improved file sync handling so previously imported items are classified more accurately when shared buckets are involved.

Wallgau and others added 30 commits June 25, 2026 09:18
* feat: add SaaS workspace connector permissions with RBAC gates
Introduce admin Connectors Permission settings (cloud-only) backed by
workspace_config, enforce availability on connector APIs and OAuth init,
and sync the new connectors:manage:access permission at startup without
an Alembic migration. OSS/dev OSS brand bypasses policy; add dev role
toggle and brand-aware proxy headers for local RBAC testing.

* style: ruff autofix (auto)

* address CodeRabbit comments and fix backend lint unused imports

* fix backend lint

* feat: workspace connector permissions with aligned SaaS policy guards
Add admin Connectors Permission (workspace_config-backed) with RBAC
connectors:manage:access, enforce policy on connector APIs and OAuth init,
and keep the permission list independent of the live connectors tab.
Consolidate frontend connector-access hooks into useGetConnectorsQuery and
settings tab helpers into brand.ts. Fix Connectors Permission tab redirects
(RSC dev-brand default, wait for permissionsResolved). Let explicitly
enabled connectors override deployment visibility filters; add
OPENRAG_DEV_CONNECTOR_POLICY for local OSS dev backend enforcement Snapshot and restore all cached connector query keys in connect/disconnect
mutations so optimistic updates stay correct when policy context changes
mid-mutation (brand toggle, permissions resolving).

* style: ruff autofix (auto)

* addressed coderabbit comments

* remove mentions of dev only envs in the env.example

---------

Co-authored-by: Olfa Maslah <[email protected]>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Edwin Jose <[email protected]>
* Update connector.py

* update env in settings

* fix webhook url of sharepoint
…bscriptions, change discovery) (#1867)

* fix: alias legacy /connectors/google/webhook to google_drive and reload connections store on webhook channel lookup miss

* Fix Sharepoint and Google Drive

* fix: propagate deleted items from SharePoint/OneDrive webhook delta query

Main's delete-event coverage (#1852) expects handle_webhook to return
deleted file ids so sync_specific_files can run its deleted-at-source
cleanup (get_file_content -> 404 -> delete indexed chunks). The delta
implementation skipped deleted items; now deleted files propagate and
deleted folders stay excluded.

* use logging _config

* Update test_webhook_type_alias.py
…nRAG-API-JWT header (#1874)

* use fallback header

* Notes on new header

* raises http exception for missing jwt
Improving error return

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
* no derivied state

* coderabbit comments
What changed:

  - Renamed the env var read by the backend to OPENSEARCH_NODE_COUNT_CHECK_ENABLED in src/config/settings.py.
  - Updated Docker Compose to pass OPENSEARCH_NODE_COUNT_CHECK_ENABLED in docker-compose.yml.
  - Added the canonical flag to the example env file in .env.example.
  - Documented the flag in the OpenSearch config table in docs/docs/reference/configuration.mdx.
  - Updated the existing readiness test wording in tests/unit/test_opensearch_wait_node_count.py.
  - Added a regression test that loads config.settings with OPENSEARCH_NODE_COUNT_CHECK_ENABLED=false and verifies the canonical flag is what the code reads in tests/unit/config/
    test_opensearch_node_count_check_enabled.py.
* chore: Segment improvements (#1963)

* Update static properties

* and knowledge/setting actions

* avoid duplicate events

* coderabbit comments
* fix the cleanup timeout issue for the embedding step

* final step fix
* fix the cleanup timeout issue for the embedding step

* final step fix
lucaseduoli and others added 8 commits June 30, 2026 15:09
* fix max tokens when splitting for langflow-less ingestion

* fix coderabbit picks

* style: ruff autofix (auto)

* added non lf ingestion test

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
* upgrade OpenSearch to 3.6.0

* fixed PR review comments

* fix nitpicks
What changed:

  - Renamed the env var read by the backend to OPENSEARCH_NODE_COUNT_CHECK_ENABLED in src/config/settings.py.
  - Updated Docker Compose to pass OPENSEARCH_NODE_COUNT_CHECK_ENABLED in docker-compose.yml.
  - Added the canonical flag to the example env file in .env.example.
  - Documented the flag in the OpenSearch config table in docs/docs/reference/configuration.mdx.
  - Updated the existing readiness test wording in tests/unit/test_opensearch_wait_node_count.py.
  - Added a regression test that loads config.settings with OPENSEARCH_NODE_COUNT_CHECK_ENABLED=false and verifies the canonical flag is what the code reads in tests/unit/config/
    test_opensearch_node_count_check_enabled.py.
* fix the cleanup timeout issue for the embedding step

* final step fix
* ascii support for ingest files

* update the flow json

* style: ruff autofix (auto)

* Update langflow_headers.py

* Update test_langflow_ingest_callback.py

* Update test-ci.yml

* Update test_onboarding_sample_docs.py

* style: ruff autofix (auto)

* Update test_ascii_safe_header_value.py

* Update langflow_file_service.py

* re trigger commit

* fix: purge config_manager singleton in non-langflow ingestion test

The _purge_modules() list was missing "config.config_manager", so the
ConfigManager singleton retained its cached _config from a previous test.
Subsequent calls to get_openrag_config() returned the old config where
disable_ingest_with_langflow=False, causing the router to use the Langflow
path even though DISABLE_INGEST_WITH_LANGFLOW=true was set in the environment.
Adding config.config_manager to the purge list forces a fresh load on the
next import, picking up the correct env-var values.

* fix: use hash_id to derive expected document_id in CSV ingestion test

DocumentFileProcessor ignores Docling's binary_hash field and computes
document_id = hash_id(file_path) from the actual file content. The test
was asserting against the literal mock value "sha-csv-integration-123"
which never appears in the indexed documents, giving 0 hits and a false
failure. Now the test derives expected_document_id via hash_id(csv_path)
to match what the production code actually indexes.

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
@github-actions github-actions Bot added frontend 🟨 Issues related to the UI/UX backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) ci ⬛ CI/CD, build, and infrastructure issues tests labels Jul 3, 2026
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR threads an optional shared boolean through the connector sync flow: FileBrowserDialog accepts a shared prop and conditionally includes it in the sync mutation payload, SharedBucketView forwards it based on ingest settings, and the backend connector_sync endpoint passes ingest_settings and shared to get_synced_file_ids_for_connector.

Changes

Shared flag propagation for connector sync

Layer / File(s) Summary
FileBrowserDialog shared prop and mutation payload
frontend/components/file-browser-dialog.tsx, frontend/components/connectors/shared-bucket-view.tsx
Adds optional shared?: boolean to FileBrowserDialogProps, destructures it, conditionally adds it to the syncMutation.mutateAsync payload, and SharedBucketView forwards ingestSettings.shared ?? false (or undefined) based on showShared.
Backend connector_sync forwards shared flag
src/api/connectors.py
connector_sync passes ingest_settings=body.settings and shared=body.shared into get_synced_file_ids_for_connector(...) during bucket-filter reconciliation.

Estimated code review effort: 1 (Trivial) | ~5 minutes

Suggested labels: enhancement

Suggested reviewers: edwinjosechittilappilly, lucaseduoli, mfortman11

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title is related to shared-file work, but it is too vague and branch-like to clearly describe the change. Rename it to a concise description of the main change, such as supporting shared files in connector sync and the file browser.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cio-shared-files-3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

React Doctor found 1 new issue in 1 file · 1 warning · score 87 / 100 (Great) · 0 fixed · vs main

1 warning

components/file-browser-dialog.tsx

  • ⚠️ L157 Missing effect dependencies exhaustive-deps

Reviewed by React Doctor for commit 6cd0631. See inline comments for fixes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/models/processors.py (1)

268-315: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

_delete_connector_chunks should match both shared and private ownership states for the same stable file id.

resolve_shared_owner_fields writes connector chunks with owner=None when shared=True, but this cleanup only targets one ownership shape at a time. If the same connector file is re-synced after its shared flag changes, deleted-at-source and rename cleanup can leave stale chunks behind under the previous ownership state, causing duplicate hits.

Use the same owner == user OR owner missing filter here, as in build_replace_filename_query, so the stable file_id is cleaned up regardless of the prior mode.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/models/processors.py` around lines 268 - 315, The ownership filter in
_delete_connector_chunks only matches one state at a time, so stale chunks can
survive when a connector file switches between shared and private modes. Update
the query-building logic in _delete_connector_chunks to use the same
owner-matching behavior as build_replace_filename_query: for a given stable
file_id, match both owner == owner_user_id and owner missing/None so cleanup
removes chunks regardless of the prior shared flag. Keep the existing
document_id/connector_file_id matching, but make the owner filter cover both
ownership shapes.
🧹 Nitpick comments (2)
src/api/v1/documents.py (1)

185-212: 🗄️ Data Integrity & Integration | 🔵 Trivial | 🏗️ Heavy lift

Type the new aggregate response outside the route handler.

This /v1 handler now constructs a new per-file response envelope inline. Please move the aggregation shape into a core/service helper and expose it through a Pydantic response model so the stable API schema/SDK contract cannot drift from the implementation. As per path instructions, "src/api/**/*.py: FastAPI routes. Public/stable endpoints belong under src/api/v1/. Verify dependency injection via src/dependencies.py, response model typing, and correct HTTP status codes. No business logic in route handlers."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/v1/documents.py` around lines 185 - 212, The /v1 documents route is
building a new aggregate response inline in the handler, which should be moved
out of the FastAPI route and into a core/service helper with a Pydantic response
model. Update the delete-by-filename flow around
delete_documents_by_filename_core and the aggregation in the route to return a
typed response object instead of assembling JSONResponse manually, so the public
schema and SDK contract stay stable. Keep the route thin by delegating the
per-file aggregation and status calculation to the service layer, and ensure the
handler uses the declared response model and correct HTTP status code.

Source: Path instructions

tests/unit/test_document_delete_by_id.py (1)

140-179: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the exact combined-scope delete IDs.

This test proves the query and count, but not that both returned IDs are actually passed to delete. Add the same delete_calls assertion used in the anonymous-only test to lock down the owned + ownerless path.

Proposed test assertion
     assert backend_opensearch_client.search_calls[0]["body"]["query"] == {
         "bool": {
             "filter": [
                 {"term": {"filename": "shared.pdf"}},
@@
             ]
         }
     }
+    assert backend_opensearch_client.delete_calls == [
+        {"index": "documents", "id": "owned-chunk", "refresh": True},
+        {"index": "documents", "id": "anonymous-chunk", "refresh": True},
+    ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_document_delete_by_id.py` around lines 140 - 179, The test
for delete_documents_by_filename_core should also verify that both matched
document IDs are actually deleted, not just counted. Extend
test_delete_documents_by_filename_combines_owned_and_anonymous_scopes to assert
backend_opensearch_client.delete_calls contains the expected owned and anonymous
chunk IDs, using the same delete_calls pattern already used in the
anonymous-only test, so the combined-scope path is fully covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@flows/ingestion_flow.json`:
- Line 3754: The embedding model list now only includes text-embedding-3-large,
which makes search() skip existing text-embedding-3-small vectors during query
planning. Update the embedding configuration in the ingestion flow to keep
text-embedding-3-small available alongside text-embedding-3-large until old
vectors are migrated, so the matching embedding object still exists for
previously indexed documents. Keep the change localized to the embedding/model
mapping used by search() and the ingestion flow config.

In `@tests/unit/test_shared_flag.py`:
- Around line 181-237: The two shared-flag tests are failing before the guard is
exercised because `connector_sync` is invoked without its required `request`
argument. Update both calls in `test_non_cos_connector_rejects_shared_true` and
`test_ibm_cos_shared_true_does_not_hit_guard` to pass a mock `Request` object,
using the same `connector_sync` setup already in the tests so the shared-flag
logic can be reached.

---

Outside diff comments:
In `@src/models/processors.py`:
- Around line 268-315: The ownership filter in _delete_connector_chunks only
matches one state at a time, so stale chunks can survive when a connector file
switches between shared and private modes. Update the query-building logic in
_delete_connector_chunks to use the same owner-matching behavior as
build_replace_filename_query: for a given stable file_id, match both owner ==
owner_user_id and owner missing/None so cleanup removes chunks regardless of the
prior shared flag. Keep the existing document_id/connector_file_id matching, but
make the owner filter cover both ownership shapes.

---

Nitpick comments:
In `@src/api/v1/documents.py`:
- Around line 185-212: The /v1 documents route is building a new aggregate
response inline in the handler, which should be moved out of the FastAPI route
and into a core/service helper with a Pydantic response model. Update the
delete-by-filename flow around delete_documents_by_filename_core and the
aggregation in the route to return a typed response object instead of assembling
JSONResponse manually, so the public schema and SDK contract stay stable. Keep
the route thin by delegating the per-file aggregation and status calculation to
the service layer, and ensure the handler uses the declared response model and
correct HTTP status code.

In `@tests/unit/test_document_delete_by_id.py`:
- Around line 140-179: The test for delete_documents_by_filename_core should
also verify that both matched document IDs are actually deleted, not just
counted. Extend
test_delete_documents_by_filename_combines_owned_and_anonymous_scopes to assert
backend_opensearch_client.delete_calls contains the expected owned and anonymous
chunk IDs, using the same delete_calls pattern already used in the
anonymous-only test, so the combined-scope path is fully covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8915c68e-2a40-4582-99bd-88f0ee53fd78

📥 Commits

Reviewing files that changed from the base of the PR and between 4f104ec and 65c2c56.

📒 Files selected for processing (35)
  • .github/workflows/test-ci.yml
  • alembic/versions/0007_add_knowledge_delete_anonymous.py
  • flows/ingestion_flow.json
  • frontend/app/api/mutations/useSyncConnector.ts
  • frontend/components/cloud-picker/ingest-settings.tsx
  • frontend/components/cloud-picker/types.ts
  • frontend/components/connectors/shared-bucket-view.tsx
  • frontend/components/file-browser-dialog.tsx
  • frontend/components/knowledge-actions-dropdown.tsx
  • frontend/components/knowledge-batch-actions-bar.tsx
  • frontend/enhancements/connectors/ibm-cos/components/bucket-view.tsx
  • src/api/connectors.py
  • src/api/documents.py
  • src/api/v1/documents.py
  • src/config/settings.py
  • src/connectors/service.py
  • src/db/seed.py
  • src/dependencies.py
  • src/models/processors.py
  • src/services/document_index_writer.py
  • src/services/langflow_file_service.py
  • src/utils/langflow_headers.py
  • src/utils/opensearch_queries.py
  • tests/integration/core/test_non_langflow_ingestion.py
  • tests/integration/core/test_onboarding_sample_docs.py
  • tests/integration/core/test_shared_flag_dls.py
  • tests/unit/db/migrations/test_add_anonymous_delete_permission.py
  • tests/unit/db/test_rbac_seed_idempotency.py
  • tests/unit/dependencies/test_rbac_kill_switch.py
  • tests/unit/dependencies/test_require_api_key_permission.py
  • tests/unit/dependencies/test_require_permission.py
  • tests/unit/test_ascii_safe_header_value.py
  • tests/unit/test_document_delete_by_id.py
  • tests/unit/test_langflow_ingest_callback.py
  • tests/unit/test_shared_flag.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/models/processors.py (1)

268-315: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

_delete_connector_chunks should match both shared and private ownership states for the same stable file id.

resolve_shared_owner_fields writes connector chunks with owner=None when shared=True, but this cleanup only targets one ownership shape at a time. If the same connector file is re-synced after its shared flag changes, deleted-at-source and rename cleanup can leave stale chunks behind under the previous ownership state, causing duplicate hits.

Use the same owner == user OR owner missing filter here, as in build_replace_filename_query, so the stable file_id is cleaned up regardless of the prior mode.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/models/processors.py` around lines 268 - 315, The ownership filter in
_delete_connector_chunks only matches one state at a time, so stale chunks can
survive when a connector file switches between shared and private modes. Update
the query-building logic in _delete_connector_chunks to use the same
owner-matching behavior as build_replace_filename_query: for a given stable
file_id, match both owner == owner_user_id and owner missing/None so cleanup
removes chunks regardless of the prior shared flag. Keep the existing
document_id/connector_file_id matching, but make the owner filter cover both
ownership shapes.
🧹 Nitpick comments (2)
src/api/v1/documents.py (1)

185-212: 🗄️ Data Integrity & Integration | 🔵 Trivial | 🏗️ Heavy lift

Type the new aggregate response outside the route handler.

This /v1 handler now constructs a new per-file response envelope inline. Please move the aggregation shape into a core/service helper and expose it through a Pydantic response model so the stable API schema/SDK contract cannot drift from the implementation. As per path instructions, "src/api/**/*.py: FastAPI routes. Public/stable endpoints belong under src/api/v1/. Verify dependency injection via src/dependencies.py, response model typing, and correct HTTP status codes. No business logic in route handlers."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/v1/documents.py` around lines 185 - 212, The /v1 documents route is
building a new aggregate response inline in the handler, which should be moved
out of the FastAPI route and into a core/service helper with a Pydantic response
model. Update the delete-by-filename flow around
delete_documents_by_filename_core and the aggregation in the route to return a
typed response object instead of assembling JSONResponse manually, so the public
schema and SDK contract stay stable. Keep the route thin by delegating the
per-file aggregation and status calculation to the service layer, and ensure the
handler uses the declared response model and correct HTTP status code.

Source: Path instructions

tests/unit/test_document_delete_by_id.py (1)

140-179: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the exact combined-scope delete IDs.

This test proves the query and count, but not that both returned IDs are actually passed to delete. Add the same delete_calls assertion used in the anonymous-only test to lock down the owned + ownerless path.

Proposed test assertion
     assert backend_opensearch_client.search_calls[0]["body"]["query"] == {
         "bool": {
             "filter": [
                 {"term": {"filename": "shared.pdf"}},
@@
             ]
         }
     }
+    assert backend_opensearch_client.delete_calls == [
+        {"index": "documents", "id": "owned-chunk", "refresh": True},
+        {"index": "documents", "id": "anonymous-chunk", "refresh": True},
+    ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_document_delete_by_id.py` around lines 140 - 179, The test
for delete_documents_by_filename_core should also verify that both matched
document IDs are actually deleted, not just counted. Extend
test_delete_documents_by_filename_combines_owned_and_anonymous_scopes to assert
backend_opensearch_client.delete_calls contains the expected owned and anonymous
chunk IDs, using the same delete_calls pattern already used in the
anonymous-only test, so the combined-scope path is fully covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@flows/ingestion_flow.json`:
- Line 3754: The embedding model list now only includes text-embedding-3-large,
which makes search() skip existing text-embedding-3-small vectors during query
planning. Update the embedding configuration in the ingestion flow to keep
text-embedding-3-small available alongside text-embedding-3-large until old
vectors are migrated, so the matching embedding object still exists for
previously indexed documents. Keep the change localized to the embedding/model
mapping used by search() and the ingestion flow config.

In `@tests/unit/test_shared_flag.py`:
- Around line 181-237: The two shared-flag tests are failing before the guard is
exercised because `connector_sync` is invoked without its required `request`
argument. Update both calls in `test_non_cos_connector_rejects_shared_true` and
`test_ibm_cos_shared_true_does_not_hit_guard` to pass a mock `Request` object,
using the same `connector_sync` setup already in the tests so the shared-flag
logic can be reached.

---

Outside diff comments:
In `@src/models/processors.py`:
- Around line 268-315: The ownership filter in _delete_connector_chunks only
matches one state at a time, so stale chunks can survive when a connector file
switches between shared and private modes. Update the query-building logic in
_delete_connector_chunks to use the same owner-matching behavior as
build_replace_filename_query: for a given stable file_id, match both owner ==
owner_user_id and owner missing/None so cleanup removes chunks regardless of the
prior shared flag. Keep the existing document_id/connector_file_id matching, but
make the owner filter cover both ownership shapes.

---

Nitpick comments:
In `@src/api/v1/documents.py`:
- Around line 185-212: The /v1 documents route is building a new aggregate
response inline in the handler, which should be moved out of the FastAPI route
and into a core/service helper with a Pydantic response model. Update the
delete-by-filename flow around delete_documents_by_filename_core and the
aggregation in the route to return a typed response object instead of assembling
JSONResponse manually, so the public schema and SDK contract stay stable. Keep
the route thin by delegating the per-file aggregation and status calculation to
the service layer, and ensure the handler uses the declared response model and
correct HTTP status code.

In `@tests/unit/test_document_delete_by_id.py`:
- Around line 140-179: The test for delete_documents_by_filename_core should
also verify that both matched document IDs are actually deleted, not just
counted. Extend
test_delete_documents_by_filename_combines_owned_and_anonymous_scopes to assert
backend_opensearch_client.delete_calls contains the expected owned and anonymous
chunk IDs, using the same delete_calls pattern already used in the
anonymous-only test, so the combined-scope path is fully covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8915c68e-2a40-4582-99bd-88f0ee53fd78

📥 Commits

Reviewing files that changed from the base of the PR and between 4f104ec and 65c2c56.

📒 Files selected for processing (35)
  • .github/workflows/test-ci.yml
  • alembic/versions/0007_add_knowledge_delete_anonymous.py
  • flows/ingestion_flow.json
  • frontend/app/api/mutations/useSyncConnector.ts
  • frontend/components/cloud-picker/ingest-settings.tsx
  • frontend/components/cloud-picker/types.ts
  • frontend/components/connectors/shared-bucket-view.tsx
  • frontend/components/file-browser-dialog.tsx
  • frontend/components/knowledge-actions-dropdown.tsx
  • frontend/components/knowledge-batch-actions-bar.tsx
  • frontend/enhancements/connectors/ibm-cos/components/bucket-view.tsx
  • src/api/connectors.py
  • src/api/documents.py
  • src/api/v1/documents.py
  • src/config/settings.py
  • src/connectors/service.py
  • src/db/seed.py
  • src/dependencies.py
  • src/models/processors.py
  • src/services/document_index_writer.py
  • src/services/langflow_file_service.py
  • src/utils/langflow_headers.py
  • src/utils/opensearch_queries.py
  • tests/integration/core/test_non_langflow_ingestion.py
  • tests/integration/core/test_onboarding_sample_docs.py
  • tests/integration/core/test_shared_flag_dls.py
  • tests/unit/db/migrations/test_add_anonymous_delete_permission.py
  • tests/unit/db/test_rbac_seed_idempotency.py
  • tests/unit/dependencies/test_rbac_kill_switch.py
  • tests/unit/dependencies/test_require_api_key_permission.py
  • tests/unit/dependencies/test_require_permission.py
  • tests/unit/test_ascii_safe_header_value.py
  • tests/unit/test_document_delete_by_id.py
  • tests/unit/test_langflow_ingest_callback.py
  • tests/unit/test_shared_flag.py
🛑 Comments failed to post (2)
flows/ingestion_flow.json (1)

3754-3754: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map the relevant file and locate the referenced nodes/strings.
git ls-files flows/ingestion_flow.json
rg -n '"name": "text-embedding-3-(small|large)"|EmbeddingModel-|chunk_embedding_|OpenSearch' flows/ingestion_flow.json

# Inspect the relevant line ranges around the three embedding model nodes.
sed -n '3720,3785p' flows/ingestion_flow.json
sed -n '4250,4325p' flows/ingestion_flow.json
sed -n '4775,4845p' flows/ingestion_flow.json

Repository: langflow-ai/openrag

Length of output: 50376


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find the search-time logic that decides which embedding fields are queried.
rg -n "chunk_embedding_|get_embedding_field_name|_get_index_properties|_is_knn_vector_field|search_documents|raw_search|embedding_model_name|detect.*model|available models|multi-model" flows/ingestion_flow.json

# Read the search-related methods in smaller slices.
sed -n '3000,3450p' flows/ingestion_flow.json
sed -n '3450,3950p' flows/ingestion_flow.json

Repository: langflow-ai/openrag

Length of output: 50376


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the search implementation and index-mapping helpers.
rg -n "def search_documents|def _get_index_properties|def _get_field_dimension|def _is_knn_vector_field|chunk_embedding_" flows/ingestion_flow.json

# Read only the relevant function bodies/slices.
sed -n '3950,4700p' flows/ingestion_flow.json
sed -n '4700,5400p' flows/ingestion_flow.json

Repository: langflow-ai/openrag

Length of output: 50376


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Tight, context-rich search for the exact helper/function definitions.
rg -n -C 40 "def search_documents|def _get_index_properties|def _get_field_dimension|def _is_knn_vector_field|def _get_available_embedding_fields|def _resolve_embedding_field|chunk_embedding_" flows/ingestion_flow.json

# If search_documents exists, print a slightly larger window around it using line numbers from the file.
python3 - <<'PY'
from pathlib import Path
text = Path("flows/ingestion_flow.json").read_text()
for needle in ["def search_documents", "def _get_index_properties", "def _get_field_dimension", "def _is_knn_vector_field", "def _get_available_embedding_fields", "def _resolve_embedding_field"]:
    idx = text.find(needle)
    if idx != -1:
        start = max(0, idx - 1500)
        end = min(len(text), idx + 7000)
        print(f"\n===== {needle} =====\n")
        print(text[start:end])
PY

Repository: langflow-ai/openrag

Length of output: 50376


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
import json
from pathlib import Path

obj = json.loads(Path("flows/ingestion_flow.json").read_text())
code = None

# Find the embedded code string in the JSON tree.
def walk(x):
    global code
    if code is not None:
        return
    if isinstance(x, dict):
        for k, v in x.items():
            if k == "code" and isinstance(v, str) and "def search_documents" in v:
                code = v
                return
            walk(v)
    elif isinstance(x, list):
        for v in x:
            walk(v)

walk(obj)
if code is None:
    raise SystemExit("embedded code not found")

needles = [
    "def search_documents",
    "def _get_index_properties",
    "def _get_field_dimension",
    "def _is_knn_vector_field",
    "def _get_available_embedding_fields",
    "def _resolve_embedding_field",
    "chunk_embedding_",
]
for needle in needles:
    idx = code.find(needle)
    print(f"\n===== {needle} =====")
    if idx == -1:
        print("NOT FOUND")
        continue
    start = max(0, idx - 1500)
    end = min(len(code), idx + 5000)
    print(code[start:end])
PY

Repository: langflow-ai/openrag

Length of output: 181


🏁 Script executed:

python3 - <<'PY'
import json
from pathlib import Path

obj = json.loads(Path("flows/ingestion_flow.json").read_text())

# Show the top-level shape so we can target the embedded code path.
print(type(obj).__name__)
if isinstance(obj, dict):
    print(sorted(obj.keys()))

# Recursively find any string values containing these markers and print their JSON path.
needles = ["search_documents", "_get_index_properties", "_get_field_dimension", "chunk_embedding_"]
hits = []

def walk(x, path="$"):
    if isinstance(x, dict):
        for k, v in x.items():
            walk(v, f"{path}.{k}")
    elif isinstance(x, list):
        for i, v in enumerate(x):
            walk(v, f"{path}[{i}]")
    elif isinstance(x, str):
        for needle in needles:
            if needle in x:
                hits.append((path, needle, x[:120].replace("\n", "\\n")))

walk(obj)
for path, needle, preview in hits[:100]:
    print(f"{needle} @ {path} :: {preview}")
print("hits:", len(hits))
PY

Repository: langflow-ai/openrag

Length of output: 1703


🏁 Script executed:

python3 - <<'PY'
import json
from pathlib import Path

obj = json.loads(Path("flows/ingestion_flow.json").read_text())
code = obj["data"]["nodes"][6]["data"]["node"]["template"]["code"]["value"]

needles = [
    "def search_documents",
    "def _get_index_properties",
    "def _get_field_dimension",
    "def _is_knn_vector_field",
    "def _get_available_embedding_fields",
    "def _resolve_embedding_field",
    "def _get_embedding_model_name",
    "get_embedding_field_name",
    "chunk_embedding_",
]
lines = code.splitlines()

for needle in needles:
    matches = [i for i, line in enumerate(lines, 1) if needle in line]
    print(f"\n===== {needle} =====")
    if not matches:
        print("NOT FOUND")
        continue
    for lineno in matches[:5]:
        start = max(1, lineno - 20)
        end = min(len(lines), lineno + 140)
        print(f"\n--- lines {start}-{end} ---")
        for i in range(start, end + 1):
            print(f"{i:5d}: {lines[i-1]}")
PY

Repository: langflow-ai/openrag

Length of output: 50375


🏁 Script executed:

python3 - <<'PY'
import json
from pathlib import Path

obj = json.loads(Path("flows/ingestion_flow.json").read_text())
code = obj["data"]["nodes"][6]["data"]["node"]["template"]["code"]["value"]
lines = code.splitlines()

# Print focused windows around the relevant ingestion and search logic.
windows = [
    (1370, 1515),  # ingestion field selection / dimensions
    (1680, 1845),  # model detection
    (1992, 2325),  # query embedding / field selection / dis_max search
    (2325, 2405),  # result handling
]
for start, end in windows:
    print(f"\n===== lines {start}-{end} =====")
    for i in range(start, min(end, len(lines)) + 1):
        print(f"{i:5d}: {lines[i-1]}")
PY

Repository: langflow-ai/openrag

Length of output: 39125


Keep text-embedding-3-small available until old vectors are migrated

search() only builds KNN queries for models that still have a matching embedding object. With only text-embedding-3-large wired here, documents already indexed as text-embedding-3-small will be detected in the corpus but skipped at query time, so they become effectively unsearchable unless you backfill/reindex or keep the small model available during the transition. text-embedding-3-large also doubles vector width, so the index footprint and embedding cost go up.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flows/ingestion_flow.json` at line 3754, The embedding model list now only
includes text-embedding-3-large, which makes search() skip existing
text-embedding-3-small vectors during query planning. Update the embedding
configuration in the ingestion flow to keep text-embedding-3-small available
alongside text-embedding-3-large until old vectors are migrated, so the matching
embedding object still exists for previously indexed documents. Keep the change
localized to the embedding/model mapping used by search() and the ingestion flow
config.
tests/unit/test_shared_flag.py (1)

181-237: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n "async def connector_sync" -A 12 src/api/connectors.py
rg -n "await connector_sync(" -A 8 tests/unit/test_shared_flag.py

Repository: langflow-ai/openrag

Length of output: 1534


🏁 Script executed:

#!/bin/bash
sed -n '1,220p' src/api/connectors.py | sed -n '1,220p'

Repository: langflow-ai/openrag

Length of output: 8063


🏁 Script executed:

#!/bin/bash
rg -n "def _connector_access_denied|async def _connector_access_denied|_connector_access_denied\(" src/api/connectors.py -A 40 -B 20

Repository: langflow-ai/openrag

Length of output: 23339


Pass request into both connector_sync calls
tests/unit/test_shared_flag.py:181-237connector_sync requires a request argument, so these tests raise TypeError before reaching the shared-flag branch. Add a mock Request to both calls.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_shared_flag.py` around lines 181 - 237, The two shared-flag
tests are failing before the guard is exercised because `connector_sync` is
invoked without its required `request` argument. Update both calls in
`test_non_cos_connector_rejects_shared_true` and
`test_ibm_cos_shared_true_does_not_hit_guard` to pass a mock `Request` object,
using the same `connector_sync` setup already in the tests so the shared-flag
logic can be reached.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/api/connectors.py (1)

72-77: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

get_synced_file_ids_for_connector must accept the new kwargs here

src/api/connectors.py:72-77 still defines get_synced_file_ids_for_connector() without ingest_settings or shared, so the bucket-filter sync path at src/api/connectors.py:984-991 raises TypeError on every request and falls through to the generic 500 handler. Update the function signature (and any downstream filtering if needed) to match the caller.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/connectors.py` around lines 72 - 77,
`get_synced_file_ids_for_connector` is missing the new `ingest_settings` and
`shared` kwargs, which causes the bucket-filter sync path to fail with a
TypeError when called from the sync flow. Update the function signature in
`get_synced_file_ids_for_connector` to accept these parameters, and thread them
through any related filtering logic or helper calls so the caller in the sync
path can pass them without error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/api/connectors.py`:
- Around line 72-77: `get_synced_file_ids_for_connector` is missing the new
`ingest_settings` and `shared` kwargs, which causes the bucket-filter sync path
to fail with a TypeError when called from the sync flow. Update the function
signature in `get_synced_file_ids_for_connector` to accept these parameters, and
thread them through any related filtering logic or helper calls so the caller in
the sync path can pass them without error.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f97ac426-86db-4987-8f34-43c1a0b1fd3d

📥 Commits

Reviewing files that changed from the base of the PR and between 65c2c56 and 6cd0631.

📒 Files selected for processing (3)
  • frontend/components/connectors/shared-bucket-view.tsx
  • frontend/components/file-browser-dialog.tsx
  • src/api/connectors.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • frontend/components/file-browser-dialog.tsx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) ci ⬛ CI/CD, build, and infrastructure issues frontend 🟨 Issues related to the UI/UX tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants