Skip to content

Chunk long notes before embedding #2

Description

@BeastOfShadow

Problem

sync_notes() and the ingest pipeline add whole documents to ChromaDB
(collection.add(documents=[content], ...)). Long notes become a single oversized
vector, diluting semantic precision and hurting retrieval.

Proposal

Split note content into overlapping chunks before embedding, so retrieval returns
focused passages instead of whole files.

Tasks

  • Add a chunker (e.g. ~500–800 tokens, ~10–15% overlap; split on headings/paragraphs first)
  • Apply it in src/database/vector_db.py (sync) and src/pipeline/orchestrator.py (ingest)
  • Store source + chunk index in metadata so sources still map back to the origin note
  • Update sync_notes() delete/update logic to handle multiple chunk-ids per file
  • Make chunk size / overlap configurable via .env

Acceptance

A long note is indexed as multiple chunks; a query returns the relevant passage, and sources still resolve to the correct file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions