Problem
sync_notes() and the ingest pipeline add whole documents to ChromaDB
(collection.add(documents=[content], ...)). Long notes become a single oversized
vector, diluting semantic precision and hurting retrieval.
Proposal
Split note content into overlapping chunks before embedding, so retrieval returns
focused passages instead of whole files.
Tasks
Acceptance
A long note is indexed as multiple chunks; a query returns the relevant passage, and sources still resolve to the correct file.
Problem
sync_notes()and the ingest pipeline add whole documents to ChromaDB(
collection.add(documents=[content], ...)). Long notes become a single oversizedvector, diluting semantic precision and hurting retrieval.
Proposal
Split note content into overlapping chunks before embedding, so retrieval returns
focused passages instead of whole files.
Tasks
src/database/vector_db.py(sync) andsrc/pipeline/orchestrator.py(ingest)source+ chunk index in metadata so sources still map back to the origin notesync_notes()delete/update logic to handle multiple chunk-ids per file.envAcceptance
A long note is indexed as multiple chunks; a query returns the relevant passage, and sources still resolve to the correct file.