Skip to content

Collections: one batch processing per task #786

Open
nishika26 wants to merge 19 commits into
mainfrom
enhancement/collection_batching
Open

Collections: one batch processing per task #786
nishika26 wants to merge 19 commits into
mainfrom
enhancement/collection_batching

Conversation

@nishika26
Copy link
Copy Markdown
Collaborator

@nishika26 nishika26 commented Apr 24, 2026

Summary

Target issue is #798 and #768

Notes

  • New Features

    • Batch-driven collection creation and upload orchestration; added batch-tracking fields and provider file-id support to avoid redundant uploads.
  • Bug Fixes

    • Batching behavior changed: documents with missing file-size now raise an error instead of being treated as zero.
  • Documentation

    • Upload docs updated to state a 25 MB maximum file size.
  • Tests

    • Expanded coverage for batching, upload/file-id handling, provider interactions, and collection job workflows.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

📝 Walkthrough

Walkthrough

This PR refactors collection creation into a two-phase orchestration: a setup job uploads documents and computes batches; batch jobs process each batch (with is_final semantics) until the final Collection is created and documents linked. Provider APIs and DB schema were updated to support batch tracking and pre-uploaded file IDs.

Changes

Batch-Driven Collection Creation

Layer / File(s) Summary
Data Models
backend/app/models/document.py, backend/app/models/collection_job.py
Document adds openai_file_id; CollectionJob and CollectionJobUpdate add total_batches, current_batch_number, documents_uploaded and documents is JSON.
Database Migration
backend/app/alembic/versions/058_add_batch_tracking_to_collections_jobs.py
Adds collection_jobs.total_batches, collection_jobs.current_batch_number, collection_jobs.documents_uploaded, and document.openai_file_id.
Provider Interface
backend/app/services/collections/providers/base.py
BaseProvider adds upload_files, changes create to accept docs + vector_store_id/is_final, adds delete and get_existing_file_id.
OpenAI Provider
backend/app/services/collections/providers/openai.py
Implements get_existing_file_id and upload_files to upload/persist file IDs/sizes; create now accepts docs and optional vector_store_id, and supports non-final early return.
OpenAI Vector Store CRUD
backend/app/crud/rag/open_ai.py
update refactored from per-file upload/yield to batch attach using pre-existing openai_file_id values and upload_and_poll.
Batch Helpers
backend/app/services/collections/helpers.py
batch_documents now uses doc.file_size_kb directly (no fallback), affecting batch boundary behavior when size is missing.
Collection Service - Entry Point & Payloads
backend/app/services/collections/create_collection.py
start_job persists trace_id from correlation_id and schedules setup; payload builders updated to use public serialization and standardized failure payloads.
Collection Service - Setup Phase
backend/app/services/collections/create_collection.py
execute_setup_job uploads all docs via provider, computes batches, marks job PROCESSING, and enqueues first batch; handles timeouts/exceptions by marking job FAILED.
Collection Service - Batch Phase
backend/app/services/collections/create_collection.py
execute_batch_job processes one batch, calls provider.create with is_final, checkpoints progress, enqueues next batch if needed, and on final batch creates Collection, links documents, marks SUCCESSFUL, and sends callbacks.
Celery Tasks
backend/app/celery/tasks/job_execution.py
Added run_collection_setup_job and run_collection_batch_job tasks with gevent soft limits, correlation-id tracing and OpenTelemetry parent-context extraction; removed legacy run_create_collection_job.
Celery Utilities
backend/app/celery/utils.py
Added start_collection_setup_job and start_collection_batch_job starters; removed start_create_collection_job; updated gevent_timeout typing and timeout-instance handling.
Tests - Helpers
backend/app/tests/services/collections/test_helpers.py
Added tests for batch edge cases: zero-size files batch by count, exact-size boundary, and file_size_kb=None now raises TypeError.
Tests - OpenAI Provider
backend/app/tests/services/collections/providers/test_openai_provider.py
Updated provider tests to new create signature; added comprehensive upload_files tests and OpenAIVectorStoreCrud.update unit tests for success and failure scenarios.
Tests - Collection Service
backend/app/tests/services/collections/test_create_collection.py
Refactored to test start_job, execute_setup_job, and execute_batch_job behaviors with session/provider patching; covers scheduling, state transitions, callbacks, cleanup, and timeout handling.
Documentation
backend/app/api/docs/documents/upload.md
Updated upload documentation to state maximum file size is 25 MB.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • Prajna1999
  • vprashrex
  • kartpop

Poem

🐰 I hopped through code with joyful paws,
Split work in two, obeying laws,
Setup packs files, batches march along,
Each job hums out its tidy song,
Final collection springs—hooray! —from this small cause.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Collections: one batch processing per task' directly corresponds to the main architectural change: refactoring collection creation to process documents in batches via separate tasks instead of a monolithic single-task execution.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch enhancement/collection_batching

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nishika26 nishika26 changed the title Collections: batch per task and gevent timeout Collections: one batch processing per task Apr 29, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
backend/app/services/collections/helpers.py (1)

84-99: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace the implicit TypeError with an explicit validation error.

Removing the or 0 fallback means a document with file_size_kb=None now crashes inside the batching loop with an opaque unsupported operand type(s) for +: 'int' and 'NoneType' mid-iteration. Callers cannot tell which document is invalid and any batches accumulated up to that point are discarded. A pre-loop validation (or explicit per-doc check) yields a clear message and a deterministic failure point.

🛡️ Proposed fix
     for doc in documents:
-        doc_size_kb = doc.file_size_kb
+        if doc.file_size_kb is None:
+            raise ValueError(
+                f"[batch_documents] Document {doc.id} has no file_size_kb; "
+                "sizes must be backfilled before batching."
+            )
+        doc_size_kb = doc.file_size_kb
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/collections/helpers.py` around lines 84 - 99, The
batching loop in batch_documents (the for doc in documents loop using
current_batch and current_batch_size_kb) can raise an opaque TypeError when
doc.file_size_kb is None; add explicit validation for each doc before using it
(either a pre-loop scan or a per-doc check) that verifies file_size_kb is not
None and is a numeric type, and if invalid raise a clear ValueError that
includes an identifier (e.g., doc.id or doc.name) so callers know which document
failed; perform this validation before updating current_batch_size_kb so
existing batches are preserved and add a short logger.warning or logger.error
with the same diagnostic information when raising.
backend/app/crud/rag/open_ai.py (1)

119-151: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove the unused update method from OpenAIVectorStoreCrud.

This method is not called anywhere in the codebase and has been replaced by update_batch. Additionally, it's missing a return type hint, which violates the coding guideline requiring type hints on all function return values. Removing it eliminates redundant code and the maintenance burden of two divergent upload flows.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/crud/rag/open_ai.py` around lines 119 - 151, Delete the unused
OpenAIVectorStoreCrud.update method (the entire function) since upload logic is
now handled by update_batch; after removal, run a quick search for any remaining
references to OpenAIVectorStoreCrud.update and remove them, and clean up any
now-unused imports or symbols used only by that method (e.g., BytesIO, Document,
CloudStorage) to avoid lints and type-hint violations.
backend/app/services/collections/create_collection.py (2)

174-303: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add type hints for task_instance (and tighten helper hints).

Per the coding guidelines, all function parameters and return values must have type hints. The following are missing/loose:

  • execute_setup_job(... task_instance, ...) -> Nonetask_instance lacks a type
  • execute_batch_job(... task_instance, ...) -> None — same
  • _persist_succeeded_docs(succeeded: list, ...) — should be list[Document]
  • _retry_failed_uploads(vector_store_crud, ..., failed_docs: list, ...)vector_store_crud lacks a type, failed_docs should be list[Document]

task_instance can be typed as celery.Task (or kept as Any from typing if you want to avoid the dependency leak).

As per coding guidelines, "Always add type hints to all function parameters and return values in Python code".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/collections/create_collection.py` around lines 174 -
303, The functions are missing/loose type hints: add an explicit type for
task_instance in both execute_setup_job and execute_batch_job (use celery.Task
or typing.Any if you want to avoid importing Celery), and tighten helper
signatures so _persist_succeeded_docs uses succeeded: list[Document] and
_retry_failed_uploads uses failed_docs: list[Document] and type-hint
vector_store_crud to the actual CRUD class (e.g., VectorStoreCrud) or typing.Any
if that class isn't accessible; also import any needed names (Document, Any,
celery.Task) and update return annotations if necessary.

39-66: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Return type mismatch: declared -> str but returns a UUID.

collection_job_id is a UUID (per the parameter annotation on line 43); returning it directly violates the declared -> str. Cast or change the annotation.

🐛 Proposed fix
-    return collection_job_id
+    return str(collection_job_id)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/collections/create_collection.py` around lines 39 - 66,
The function start_job currently declares a return type of -> str but returns
collection_job_id which is a UUID; fix by either changing the function signature
to return -> UUID or converting the returned value to a string with return
str(collection_job_id). Update any imports/annotations if you choose UUID (e.g.,
ensure UUID is imported) and keep the rest of the logic (calls to
CollectionJobCrud.update and start_create_collection_job) unchanged.
backend/app/services/collections/providers/openai.py (1)

23-28: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update test calls to match new create signature.

The test suite in backend/app/tests/services/collections/providers/test_openai_provider.py has three test functions that call provider.create() with the old three-argument signature:

  • test_create_openai_vector_store_only() (line 40)
  • test_create_openai_with_assistant() (line 79)
  • test_create_propagates_exception() (line 143)

All three pass storage as the second argument and a documents list as the third, but the updated signature is create(collection_request, docs, vector_store_id=None, is_final=False). The tests need to pass the documents list as the second argument, not storage:

  • Change from: provider.create(collection_request, storage, documents)
  • Change to: provider.create(collection_request, documents) (with vector_store_id as named argument if needed)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/collections/providers/openai.py` around lines 23 - 28,
Update the three failing tests so they call the new create signature: replace
calls to provider.create(collection_request, storage, documents) with
provider.create(collection_request, documents) and, if a vector_store_id or
is_final was intended, pass those as named args (e.g.
provider.create(collection_request, documents, vector_store_id=...,
is_final=...)); modify the three test functions in
backend/app/tests/services/collections/providers/test_openai_provider.py
(test_create_openai_vector_store_only, test_create_openai_with_assistant,
test_create_propagates_exception) to pass the documents list as the second
parameter and remove the positional storage argument.
🧹 Nitpick comments (4)
backend/app/models/document.py (1)

49-53: ⚡ Quick win

Align column comment between model and migration.

Migration 055 sets the column comment to "File ID assigned by the LLM provider (e.g. OpenAI file ID) to avoid re-uploading", but the model declares it as "File ID assigned by OpenAI (avoid re-uploading)". Future alembic revision --autogenerate runs may flag this drift as an unintended schema change. Pick one wording and keep both in sync.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/models/document.py` around lines 49 - 53, The model field
openai_file_id's sa_column_kwargs comment string mismatches the migration;
update the Field definition for openai_file_id in the Document model to use the
exact comment used in migration 055 ("File ID assigned by the LLM provider (e.g.
OpenAI file ID) to avoid re-uploading") so the sa_column_kwargs comment and the
migration stay in sync and prevent autogenerate diffs.
backend/app/alembic/versions/055_add_batch_tracking_to_collections_jobs.py (1)

47-55: 💤 Low value

Migration name only mentions collection_jobs, but it also alters document.

The filename and revision message refer to collection_jobs only, while the upgrade also adds document.openai_file_id. Consider splitting into two migrations or renaming/updating the message so the change scope is discoverable from the migration filename and history.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/alembic/versions/055_add_batch_tracking_to_collections_jobs.py`
around lines 47 - 55, The migration
'055_add_batch_tracking_to_collections_jobs.py' declares changes for
collection_jobs but also adds a column to document (op.add_column adding
document.openai_file_id); either split the document change into a separate
migration or rename/update this migration's filename and revision message to
reflect both changes (and update the upgrade/revision docstring) so the history
accurately describes the addition of document.openai_file_id alongside the
collection_jobs alterations.
backend/app/services/collections/providers/openai.py (1)

47-52: ⚡ Quick win

Open one DB session for the whole batch, not one per document.

The current code opens a fresh Session(engine) and constructs a DocumentCrud for every successful upload. For a collection with hundreds/thousands of docs this multiplies connection overhead unnecessarily. A single session outside the loop with per-doc commits (or a single commit at the end if you don't need partial-progress durability) is cleaner.

♻️ Proposed refactor
-    def upload_files(
+    def upload_files(
         self,
         storage: CloudStorage,
         docs: list[Document],
         project_id: int,
     ) -> None:
-        for doc in docs:
-            if self.get_existing_file_id(doc):
-                continue
-            try:
-                content = storage.get(doc.object_store_url)
-                if doc.file_size_kb is None:
-                    doc.file_size_kb = round(len(content) / 1024, 2)
-                f_obj = BytesIO(content)
-                f_obj.name = doc.fname
-                uploaded = self.client.files.create(file=f_obj, purpose="assistants")
-                doc.openai_file_id = uploaded.id
-                with Session(engine) as session:
-                    document_crud = DocumentCrud(session, project_id)
-                    db_doc = document_crud.read_one(doc.id)
-                    db_doc.openai_file_id = uploaded.id
-                    db_doc.file_size_kb = doc.file_size_kb
-                    document_crud.update(db_doc)
-            except Exception as err:
-                ...
+        with Session(engine) as session:
+            document_crud = DocumentCrud(session, project_id)
+            for doc in docs:
+                if self.get_existing_file_id(doc):
+                    continue
+                content = storage.get(doc.object_store_url)
+                if doc.file_size_kb is None:
+                    doc.file_size_kb = round(len(content) / 1024, 2)
+                f_obj = BytesIO(content)
+                f_obj.name = doc.fname
+                uploaded = self.client.files.create(file=f_obj, purpose="assistants")
+                doc.openai_file_id = uploaded.id
+                db_doc = document_crud.read_one(doc.id)
+                db_doc.openai_file_id = uploaded.id
+                db_doc.file_size_kb = doc.file_size_kb
+                document_crud.update(db_doc)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/collections/providers/openai.py` around lines 47 - 52,
The code currently creates a new Session(engine) and DocumentCrud for every
uploaded document; instead open a single Session(engine) outside the upload loop
and reuse it (and a DocumentCrud instance per project_id) for each doc, calling
document_crud.read_one(doc.id), updating db_doc.openai_file_id and
db_doc.file_size_kb, and then document_crud.update(db_doc) inside the loop;
perform either a session.commit() per document for partial durability or one
commit after the loop, and ensure the session is closed once after processing
the entire batch.
backend/app/services/collections/create_collection.py (1)

475-491: ⚡ Quick win

Change except BaseException to except Exception.

BaseException catches KeyboardInterrupt, SystemExit, and GeneratorExit, which should normally be allowed to propagate. Additionally, gevent's Timeout deliberately inherits from BaseException (not Exception), so this generic handler will swallow timeouts that escape the dedicated except Timeout handler above and incorrectly mark the job as failed. Use except Exception instead.

♻️ Proposed change
-    except BaseException as err:
+    except Exception as err:
         logger.error(
             "[create_collection.execute_batch_job] Batch %d failed | job_id=%s, error=%s",
             ...
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/collections/create_collection.py` around lines 475 -
491, The catch-all in create_collection.execute_batch_job currently uses "except
BaseException as err" which improperly catches KeyboardInterrupt/SystemExit and
gevent Timeouts; change that handler to "except Exception as err" so only
regular exceptions are caught (leaving the earlier "except Timeout" and
system-exiting signals to propagate), and keep the existing logging,
_mark_job_failed, and callback logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/app/celery/tasks/job_execution.py`:
- Around line 74-105: The gevent_timeout decorator currently raises TimeoutError
unconditionally in its finally block causing tasks like
run_create_collection_job and run_collection_batch_job to always fail; modify
gevent_timeout (the decorator implementation) so that the Timeout exception is
raised only inside the except Timeout: handler and the finally: block only calls
timeout.cancel() (no raise), ensuring timeout.cancel() is reachable and
successful task completions do not raise TimeoutError.

In `@backend/app/celery/utils.py`:
- Around line 185-208: gevent_timeout currently always raises TimeoutError and
never cancels the gevent Timeout; fix wrapper in gevent_timeout by tracking
whether the gevent Timeout fired (e.g., timed_out flag and optionally store
result/exception), don't unconditionally raise in finally, always call
timeout.cancel() in the finally block, and only raise TimeoutError (or re-raise
the stored Timeout) after timeout.cancel() if timed_out is true; reference
wrapper, Timeout, timeout.cancel(), task_name and func.__name__ to locate where
to apply the change.

In `@backend/app/crud/rag/open_ai.py`:
- Around line 158-163: The docstring for the batch upload method incorrectly
refers to provider_file_id; update it to reference the actual Document attribute
used in the code (doc.openai_file_id) so the docstring matches the
implementation (see the method that calls upload_and_poll / the loop that reads
doc.openai_file_id). Ensure the sentence now states that all docs must have
openai_file_id set before calling this method and return description remains
unchanged.
- Around line 182-190: In OpenAIVectorStoreCrud.update_batch, when
batch.file_counts.failed > 0, don't mark all docs for retry; call the OpenAI
helper client.beta.vector_stores.file_batches.list_files(batch_id=batch.id,
vector_store_id=vector_store_id, filter="failed") to get only failed file
entries, map those failed file identifiers back to the input docs list (using
the same file id/key used when building docs), and extend the failed list with
only those docs so upload_and_poll() is retried only for genuinely failed files
instead of the entire batch.

In `@backend/app/services/collections/create_collection.py`:
- Around line 122-172: The two helper functions _persist_succeeded_docs and
_retry_failed_uploads (and the stale docstring reference to
_upload_batch_with_retry) are dead code and OpenAIVectorStoreCrud is unused;
either wire them into the batch path (execute_setup_job / execute_batch_job) or
remove them. Fix by removing the unused helpers _persist_succeeded_docs and
_retry_failed_uploads and the OpenAIVectorStoreCrud import, and update the
execute_batch_job docstring to not reference _upload_batch_with_retry;
alternatively, if you intend to keep retry logic, add calls from
execute_batch_job/execute_setup_job to _retry_failed_uploads (and ensure
vector_store_crud is passed) and implement or rename _upload_batch_with_retry
accordingly so the docstring matches the implemented function.
- Around line 304-311: Update the Phase 2 docstring to remove the reference to
the non-existent _upload_batch_with_retry and instead describe the actual
behavior: that the code calls provider.create(...) which delegates to
vector_store_crud.update_batch, and that inline retries are handled by
_retry_failed_uploads (if used) or by the underlying vector_store_crud; ensure
the docstring accurately states that failed items are retried via
_retry_failed_uploads or the vector_store_crud retry semantics, and that the
function still checkpoints progress, queues next batch, and finalizes the
collection on the last batch.
- Around line 215-220: The log call in create_collection.execute_setup_job uses
four format specifiers but only passes job_id and len(flat_docs), causing a
runtime TypeError; update the logger.info call to either (A) reduce the format
string to match the two provided args (e.g., remove failed and duration_s
placeholders) or (B) compute and supply the missing values by timing the
upload_files call and getting a failed count (modify upload_files to return a
result struct with failed_count and have execute_setup_job measure duration_s
and pass job_id, len(flat_docs), failed_count, duration_s into logger.info).
Ensure the change references logger.info and the upload_files/flat_docs
variables so the log formatting and values are consistent.
- Around line 243-253: The first batch enqueue call to
start_collection_batch_job is missing the required vector_store_id expected by
execute_batch_job, causing a TypeError; fix it by passing vector_store_id=None
in the start_collection_batch_job invocation (where project_id/job_id/trace_id
are passed) so execute_batch_job receives the argument, or alternatively add a
default vector_store_id: Optional[...] = None to execute_batch_job's signature;
reference start_collection_batch_job and execute_batch_job when making the
change.

In `@backend/app/services/collections/providers/openai.py`:
- Around line 30-59: The upload_files loop in OpenAIProvider.upload_files
currently logs per-document exceptions and continues, leaving docs with None
file_size_kb/openai_file_id and causing downstream TypeError or silent failures;
modify upload_files to either (A) fail-fast by re-raising the caught exception
after logging so callers (e.g., create_collection.execute_setup_job) can stop
and surface the real error, or (B) accumulate per-doc failures into a structured
result (e.g., list of successes and failures) and return that to callers so they
can decide (and avoid passing docs without openai_file_id to
vector_store_crud.update_batch); update the function signature and callers
accordingly (refer to upload_files, create_collection.execute_setup_job, and
vector_store_crud.update_batch) so callers handle the returned error info or the
propagated exception.

---

Outside diff comments:
In `@backend/app/crud/rag/open_ai.py`:
- Around line 119-151: Delete the unused OpenAIVectorStoreCrud.update method
(the entire function) since upload logic is now handled by update_batch; after
removal, run a quick search for any remaining references to
OpenAIVectorStoreCrud.update and remove them, and clean up any now-unused
imports or symbols used only by that method (e.g., BytesIO, Document,
CloudStorage) to avoid lints and type-hint violations.

In `@backend/app/services/collections/create_collection.py`:
- Around line 174-303: The functions are missing/loose type hints: add an
explicit type for task_instance in both execute_setup_job and execute_batch_job
(use celery.Task or typing.Any if you want to avoid importing Celery), and
tighten helper signatures so _persist_succeeded_docs uses succeeded:
list[Document] and _retry_failed_uploads uses failed_docs: list[Document] and
type-hint vector_store_crud to the actual CRUD class (e.g., VectorStoreCrud) or
typing.Any if that class isn't accessible; also import any needed names
(Document, Any, celery.Task) and update return annotations if necessary.
- Around line 39-66: The function start_job currently declares a return type of
-> str but returns collection_job_id which is a UUID; fix by either changing the
function signature to return -> UUID or converting the returned value to a
string with return str(collection_job_id). Update any imports/annotations if you
choose UUID (e.g., ensure UUID is imported) and keep the rest of the logic
(calls to CollectionJobCrud.update and start_create_collection_job) unchanged.

In `@backend/app/services/collections/helpers.py`:
- Around line 84-99: The batching loop in batch_documents (the for doc in
documents loop using current_batch and current_batch_size_kb) can raise an
opaque TypeError when doc.file_size_kb is None; add explicit validation for each
doc before using it (either a pre-loop scan or a per-doc check) that verifies
file_size_kb is not None and is a numeric type, and if invalid raise a clear
ValueError that includes an identifier (e.g., doc.id or doc.name) so callers
know which document failed; perform this validation before updating
current_batch_size_kb so existing batches are preserved and add a short
logger.warning or logger.error with the same diagnostic information when
raising.

In `@backend/app/services/collections/providers/openai.py`:
- Around line 23-28: Update the three failing tests so they call the new create
signature: replace calls to provider.create(collection_request, storage,
documents) with provider.create(collection_request, documents) and, if a
vector_store_id or is_final was intended, pass those as named args (e.g.
provider.create(collection_request, documents, vector_store_id=...,
is_final=...)); modify the three test functions in
backend/app/tests/services/collections/providers/test_openai_provider.py
(test_create_openai_vector_store_only, test_create_openai_with_assistant,
test_create_propagates_exception) to pass the documents list as the second
parameter and remove the positional storage argument.

---

Nitpick comments:
In `@backend/app/alembic/versions/055_add_batch_tracking_to_collections_jobs.py`:
- Around line 47-55: The migration
'055_add_batch_tracking_to_collections_jobs.py' declares changes for
collection_jobs but also adds a column to document (op.add_column adding
document.openai_file_id); either split the document change into a separate
migration or rename/update this migration's filename and revision message to
reflect both changes (and update the upgrade/revision docstring) so the history
accurately describes the addition of document.openai_file_id alongside the
collection_jobs alterations.

In `@backend/app/models/document.py`:
- Around line 49-53: The model field openai_file_id's sa_column_kwargs comment
string mismatches the migration; update the Field definition for openai_file_id
in the Document model to use the exact comment used in migration 055 ("File ID
assigned by the LLM provider (e.g. OpenAI file ID) to avoid re-uploading") so
the sa_column_kwargs comment and the migration stay in sync and prevent
autogenerate diffs.

In `@backend/app/services/collections/create_collection.py`:
- Around line 475-491: The catch-all in create_collection.execute_batch_job
currently uses "except BaseException as err" which improperly catches
KeyboardInterrupt/SystemExit and gevent Timeouts; change that handler to "except
Exception as err" so only regular exceptions are caught (leaving the earlier
"except Timeout" and system-exiting signals to propagate), and keep the existing
logging, _mark_job_failed, and callback logic unchanged.

In `@backend/app/services/collections/providers/openai.py`:
- Around line 47-52: The code currently creates a new Session(engine) and
DocumentCrud for every uploaded document; instead open a single Session(engine)
outside the upload loop and reuse it (and a DocumentCrud instance per
project_id) for each doc, calling document_crud.read_one(doc.id), updating
db_doc.openai_file_id and db_doc.file_size_kb, and then
document_crud.update(db_doc) inside the loop; perform either a session.commit()
per document for partial durability or one commit after the loop, and ensure the
session is closed once after processing the entire batch.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 14146497-9eeb-46d5-94d9-fe7751afed6a

📥 Commits

Reviewing files that changed from the base of the PR and between a9e2ac5 and 2a2e268.

📒 Files selected for processing (12)
  • backend/app/alembic/versions/055_add_batch_tracking_to_collections_jobs.py
  • backend/app/api/docs/documents/upload.md
  • backend/app/celery/tasks/job_execution.py
  • backend/app/celery/utils.py
  • backend/app/crud/rag/open_ai.py
  • backend/app/models/collection_job.py
  • backend/app/models/document.py
  • backend/app/services/collections/create_collection.py
  • backend/app/services/collections/helpers.py
  • backend/app/services/collections/providers/base.py
  • backend/app/services/collections/providers/openai.py
  • backend/app/tests/services/collections/test_helpers.py

Comment thread backend/app/celery/tasks/job_execution.py
Comment thread backend/app/celery/utils.py Outdated
Comment thread backend/app/crud/rag/open_ai.py Outdated
Comment thread backend/app/crud/rag/open_ai.py Outdated
Comment thread backend/app/services/collections/create_collection.py Outdated
Comment thread backend/app/services/collections/create_collection.py Outdated
Comment thread backend/app/services/collections/create_collection.py Outdated
Comment thread backend/app/services/collections/create_collection.py
Comment thread backend/app/services/collections/providers/openai.py
@codecov
Copy link
Copy Markdown

codecov Bot commented May 10, 2026

Codecov Report

❌ Patch coverage is 97.32771% with 19 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/celery/tasks/job_execution.py 0.00% 9 Missing ⚠️
backend/app/celery/utils.py 28.57% 5 Missing ⚠️
backend/app/services/collections/providers/base.py 62.50% 3 Missing ⚠️
...kend/app/services/collections/create_collection.py 97.72% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@nishika26 nishika26 self-assigned this May 11, 2026
@nishika26 nishika26 requested review from Prajna1999 and vprashrex May 11, 2026 03:43
@nishika26 nishika26 added ready-for-review enhancement New feature or request labels May 11, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/services/collections/create_collection.py (1)

42-69: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align start_job's return value with its signature.

The function is annotated as returning str, but it currently returns a UUID. That mismatch leaks into callers at the service boundary and is easy to fix by either returning str(collection_job_id) or changing the annotation to UUID.

Suggested change
-    return collection_job_id
+    return str(collection_job_id)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/collections/create_collection.py` around lines 42 - 69,
The function start_job currently has a return type annotation of str but returns
a UUID (collection_job_id); update the function so the return type matches the
actual value by either converting collection_job_id to a string before returning
(return str(collection_job_id)) or changing the return annotation to UUID;
adjust the signature and any callers if needed to keep types consistent (refer
to start_job and collection_job_id to locate the change).
🧹 Nitpick comments (2)
backend/app/alembic/versions/058_add_batch_tracking_to_collections_jobs.py (1)

19-19: ⚡ Quick win

Add return annotations to the migration hooks.

upgrade and downgrade are new functions, but both are missing return types. Please add -> None so the migration stays consistent with the repo-wide typing rule.

Suggested change
-def upgrade():
+def upgrade() -> None:
     op.add_column(
         "collection_jobs",
         sa.Column(
@@
-def downgrade():
+def downgrade() -> None:
     op.drop_column("collection_jobs", "total_batches")
     op.drop_column("collection_jobs", "current_batch_number")
     op.drop_column("collection_jobs", "documents_uploaded")
     op.drop_column("document", "openai_file_id")

As per coding guidelines, **/*.py: Always add type hints to all function parameters and return values in Python code.

Also applies to: 58-58

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/alembic/versions/058_add_batch_tracking_to_collections_jobs.py`
at line 19, The migration hooks upgrade and downgrade are missing return type
annotations; update the function signatures for both upgrade() and downgrade()
to include -> None (e.g., def upgrade() -> None:) so they conform to the
repo-wide typing rule and maintain consistency across migrations.
backend/app/celery/tasks/job_execution.py (1)

133-150: ⚡ Quick win

Annotate the new Celery task entrypoints.

run_collection_setup_job and run_collection_batch_job are newly added without return annotations. These wrappers currently return None, so please add -> None on both.

Suggested change
 def run_collection_setup_job(
     self, project_id: int, job_id: str, trace_id: str, **kwargs
-):
+) -> None:
@@
 def run_collection_batch_job(
     self, project_id: int, job_id: str, trace_id: str, **kwargs
-):
+) -> None:

As per coding guidelines, **/*.py: Always add type hints to all function parameters and return values in Python code.

Also applies to: 153-170

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/celery/tasks/job_execution.py` around lines 133 - 150, The new
Celery task wrappers run_collection_setup_job and run_collection_batch_job
currently lack return type annotations and by design return None; update both
function signatures to include an explicit return type of -> None (e.g., def
run_collection_setup_job(...) -> None:) so they conform to the project's typing
guideline; locate the two task definitions (run_collection_setup_job and
run_collection_batch_job) and add the -> None annotation to each signature
without changing the function bodies.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/app/celery/utils.py`:
- Around line 123-125: The log message in start_collection_batch_job is using
the wrong prefix ("[start_collection_setup_job]"); update the logger.info call
inside the start_collection_batch_job function to use the correct prefix
"[start_collection_batch_job]" and follow the project's logging guideline format
(e.g., logger.info(f"[start_collection_batch_job] Started job
{mask_string(job_id)} with Celery task {mask_string(task_id)}")) so batch
enqueues are distinguishable in worker logs.

In `@backend/app/crud/rag/open_ai.py`:
- Around line 119-140: The update currently swallows OpenAIError and logs batch
failures instead of stopping the flow; change OpenAIVectorStoreCrud.update so
that any exception from self.client.vector_stores.file_batches.upload_and_poll
(and any case where batch.file_counts.failed > 0 after upload_and_poll) is
propagated as an exception instead of just logging: remove/replace the
logger.error handling that swallows OpenAIError (re-raise or raise a new
descriptive exception including the original err) and add a check after
upload_and_poll that raises a descriptive error when batch.file_counts.failed >
0 (including vector_store_id and failed count) so callers of
OpenAIVectorStoreCrud.update (e.g., OpenAIProvider.create and execute_batch_job)
will abort on upload failures.

In `@backend/app/services/collections/providers/openai.py`:
- Around line 45-60: After files.create succeeds but DocumentCrud.update fails,
we must delete the orphaned provider file to avoid quota/storage leaks: in the
upload_files flow where uploaded = self.client.files.create(...) and
db_doc.openai_file_id/db_doc.file_size_kb are set before calling
document_crud.update(db_doc), catch exceptions from DocumentCrud.update and call
self.client.files.delete(uploaded.id) (wrap delete in its own try/except and log
any deletion failure using logger.error) before re-raising the original error so
the provider file is rolled back; reference the uploaded variable,
DocumentCrud.update, and self.client.files.delete when adding this cleanup.

In `@backend/app/tests/services/collections/test_create_collection.py`:
- Around line 37-56: Add explicit type hints to the test helpers: annotate
_mock_provider_with_size(llm_service_id: str, llm_service_name: str) -> Mock
(import Mock from unittest.mock), annotate the nested helper def
_set_file_size(storage: Any, docs: Iterable[Any], project_id: Any) -> None
(import Any, Iterable from typing), and annotate _patch_session(db: Session) ->
Any (or unittest.mock._patch) to reflect the patcher return; also add the
necessary imports (typing and unittest.mock) and keep references to
get_mock_provider and the patched create_collection.Session to locate the
functions.
- Line 7: The tests import SoftTimeLimitExceeded but never exercise it; add
mirror test cases that raise celery.exceptions.SoftTimeLimitExceeded where the
current tests raise gevent.Timeout so the code paths handled in production (the
except tuple (Timeout, SoftTimeLimitExceeded)) are covered. Locate the existing
tests in test_create_collection.py that currently use gevent.Timeout (the same
test functions/assertions) and add equivalent subtests or parametrize them to
also raise SoftTimeLimitExceeded, asserting the same behavior and cleanup as for
gevent.Timeout.

---

Outside diff comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 42-69: The function start_job currently has a return type
annotation of str but returns a UUID (collection_job_id); update the function so
the return type matches the actual value by either converting collection_job_id
to a string before returning (return str(collection_job_id)) or changing the
return annotation to UUID; adjust the signature and any callers if needed to
keep types consistent (refer to start_job and collection_job_id to locate the
change).

---

Nitpick comments:
In `@backend/app/alembic/versions/058_add_batch_tracking_to_collections_jobs.py`:
- Line 19: The migration hooks upgrade and downgrade are missing return type
annotations; update the function signatures for both upgrade() and downgrade()
to include -> None (e.g., def upgrade() -> None:) so they conform to the
repo-wide typing rule and maintain consistency across migrations.

In `@backend/app/celery/tasks/job_execution.py`:
- Around line 133-150: The new Celery task wrappers run_collection_setup_job and
run_collection_batch_job currently lack return type annotations and by design
return None; update both function signatures to include an explicit return type
of -> None (e.g., def run_collection_setup_job(...) -> None:) so they conform to
the project's typing guideline; locate the two task definitions
(run_collection_setup_job and run_collection_batch_job) and add the -> None
annotation to each signature without changing the function bodies.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2f5f5939-4016-41b6-bcbb-d8df33c6244f

📥 Commits

Reviewing files that changed from the base of the PR and between 2a2e268 and cb97654.

📒 Files selected for processing (9)
  • backend/app/alembic/versions/058_add_batch_tracking_to_collections_jobs.py
  • backend/app/celery/tasks/job_execution.py
  • backend/app/celery/utils.py
  • backend/app/crud/rag/open_ai.py
  • backend/app/services/collections/create_collection.py
  • backend/app/services/collections/providers/openai.py
  • backend/app/tests/services/collections/providers/test_openai_provider.py
  • backend/app/tests/services/collections/test_create_collection.py
  • backend/app/tests/services/collections/test_helpers.py

Comment thread backend/app/celery/utils.py
Comment thread backend/app/crud/rag/open_ai.py
Comment thread backend/app/services/collections/providers/openai.py
Comment thread backend/app/tests/services/collections/test_create_collection.py
Comment thread backend/app/tests/services/collections/test_create_collection.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/app/tests/services/collections/providers/test_openai_provider.py (1)

132-146: ⚡ Quick win

Annotate the new test helpers.

_make_doc() and _patch_session_and_crud() were added without full type annotations, so this new test code is out of sync with the repo’s Python typing rule.

Suggested patch
+from typing import Any
+from types import SimpleNamespace
+
-
-def _make_doc(*, openai_file_id=None, file_size_kb=None):
+def _make_doc(
+    *, openai_file_id: str | None = None, file_size_kb: float | None = None
+) -> SimpleNamespace:
     return SimpleNamespace(
         id=uuid4(),
         fname="test.md",
         object_store_url="s3://bucket/test.md",
         openai_file_id=openai_file_id,
         file_size_kb=file_size_kb,
     )
 
 
-def _patch_session_and_crud():
+def _patch_session_and_crud() -> tuple[Any, Any]:
     """Patches Session and DocumentCrud used inside upload_files."""
     session_patcher = patch("app.services.collections.providers.openai.Session")
     crud_patcher = patch("app.services.collections.providers.openai.DocumentCrud")
     return session_patcher, crud_patcher

As per coding guidelines, "Always add type hints to all function parameters and return values in Python code" and "Use Python 3.11+ with type hints throughout the codebase".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/tests/services/collections/providers/test_openai_provider.py`
around lines 132 - 146, The two test helper functions lack type annotations; add
Python 3.11+ type hints: annotate _make_doc(openai_file_id: Optional[str] =
None, file_size_kb: Optional[int] = None) -> SimpleNamespace (import Optional
from typing and SimpleNamespace from types) and annotate
_patch_session_and_crud() -> tuple[unittest.mock._patch, unittest.mock._patch]
(or -> tuple[Any, Any] with Any imported from typing if you prefer public
types), and update imports accordingly so the new helpers (_make_doc and
_patch_session_and_crud) comply with the repo's typing rules.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/app/crud/rag/open_ai.py`:
- Around line 116-123: The code calls
self.client.vector_stores.file_batches.upload_and_poll with file_ids built from
[doc.openai_file_id for doc in docs] without validating openai_file_id; change
the logic in the method that constructs batch/upload (the block using docs,
openai_file_id, vector_store_id, and upload_and_poll) to filter out any docs
where doc.openai_file_id is falsy before building file_ids, and if any docs were
dropped either log a warning (including identifiers like doc.id or index) or
raise a clear error; after filtering, if the resulting file_ids list is empty,
return early instead of calling upload_and_poll.

---

Nitpick comments:
In `@backend/app/tests/services/collections/providers/test_openai_provider.py`:
- Around line 132-146: The two test helper functions lack type annotations; add
Python 3.11+ type hints: annotate _make_doc(openai_file_id: Optional[str] =
None, file_size_kb: Optional[int] = None) -> SimpleNamespace (import Optional
from typing and SimpleNamespace from types) and annotate
_patch_session_and_crud() -> tuple[unittest.mock._patch, unittest.mock._patch]
(or -> tuple[Any, Any] with Any imported from typing if you prefer public
types), and update imports accordingly so the new helpers (_make_doc and
_patch_session_and_crud) comply with the repo's typing rules.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9446da95-b118-41f1-a29c-08d27ece525b

📥 Commits

Reviewing files that changed from the base of the PR and between cb97654 and fd37d14.

📒 Files selected for processing (5)
  • backend/app/celery/utils.py
  • backend/app/crud/rag/open_ai.py
  • backend/app/services/collections/providers/openai.py
  • backend/app/tests/services/collections/providers/test_openai_provider.py
  • backend/app/tests/services/collections/test_create_collection.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/celery/utils.py

Comment on lines +116 to +123
if not docs:
return

try:
batch = self.client.vector_stores.file_batches.upload_and_poll(
vector_store_id=vector_store_id,
files=files,
files=[],
file_ids=[doc.openai_file_id for doc in docs],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate openai_file_id before calling the batch attach API.

This method now assumes every Document has already been uploaded, but Line 123 still forwards None values straight into file_ids. That turns a local contract violation into a provider-side failure with much worse debugging context.

Suggested patch
     ) -> None:
         if not docs:
             return
+
+        missing_file_ids = [str(doc.id) for doc in docs if not doc.openai_file_id]
+        if missing_file_ids:
+            raise ValueError(
+                "All documents must have openai_file_id before vector store attach: "
+                + ", ".join(missing_file_ids)
+            )
 
         try:
             batch = self.client.vector_stores.file_batches.upload_and_poll(
                 vector_store_id=vector_store_id,
                 files=[],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/crud/rag/open_ai.py` around lines 116 - 123, The code calls
self.client.vector_stores.file_batches.upload_and_poll with file_ids built from
[doc.openai_file_id for doc in docs] without validating openai_file_id; change
the logic in the method that constructs batch/upload (the block using docs,
openai_file_id, vector_store_id, and upload_and_poll) to filter out any docs
where doc.openai_file_id is falsy before building file_ids, and if any docs were
dropped either log a warning (including identifiers like doc.id or index) or
raise a clear error; after filtering, if the resulting file_ids list is empty,
return early instead of calling upload_and_poll.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Collection: Split Celery tasks for uploads Data Presentation: Improve empty file size

1 participant