fix(chunker): stop overlap-stride crawl that shatters long-prose files#41
Open
jdubdevs wants to merge 1 commit into
Open
fix(chunker): stop overlap-stride crawl that shatters long-prose files#41jdubdevs wants to merge 1 commit into
jdubdevs wants to merge 1 commit into
Conversation
smart_chunk applied the overlap window even when the emitted chunk was smaller than the overlap. cut_offset - overlap_chars then landed before the chunk's own start, and the .max(start_offset + 1) guard advanced the start by a single character — re-selecting the same nearby high-score break point and crawling forward one char at a time. Long, heading-dense prose files were shattered into hundreds of near-duplicate empty-heading micro-chunks (a 4.5k-word note produced 923 chunks; 907 of them 1-char-offset shrapnel), which (a) made the file unretrievable — its signal split below threshold so it never entered the candidate set — and (b) bloated the index ~10x (451 files held 91% of all chunks). Fix: only step back by the overlap window when the chunk is larger than it; otherwise advance fully to the cut. Guarantees forward progress, no crawl. Validated: the note re-chunks 923 -> 28; all chunks vectorize; a unique-phrase search returns it at rank 1. Adds test_smart_chunk_no_overlap_crawl regression. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
smart_chunkapplies the overlap window even when the chunk it just emitted is smaller than the overlap.cut_offset - overlap_charsthen lands before the chunk's own start, and the.max(start_offset + 1)guard advances the start by a single character — which re-selects the same nearby high-score break point and crawls forward one char at a time.On long, heading-dense prose this shatters a file into hundreds of near-duplicate empty-heading micro-chunks. Observed on a real 4.5k-word note: 923 chunks, 907 of them 1-character-offset shrapnel (e.g.
"…ystem currently authorizing…","…stem currently authorizing…","…tem currently authorizing…").Impact:
Fix
Only step back by the overlap window when the emitted chunk is larger than that window; otherwise advance fully to the cut. Guarantees forward progress, eliminates the crawl. ~6 lines in
smart_chunk.Validation
test_smart_chunk_no_overlap_crawl(bounded chunk count + no degenerate micro-chunks). All 20 chunker tests pass.🤖 Generated with Claude Code