fix(context): prevent OOM crash on large llms-full.txt files#99
fix(context): prevent OOM crash on large llms-full.txt files#99wbisschoff13 wants to merge 1 commit into
Conversation
…s-full.txt files Large markdown files (>1MB) like Cloudflare's llms-full.txt previously caused Node.js heap OOM because remark-parse built a full AST of the entire document. Now they are pre-split by ## headings so each chunk is independently parseable with minimal memory. Fixes: neuledge#99
🦋 Changeset detectedLatest commit: 84864dd The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
moshest
left a comment
There was a problem hiding this comment.
Nice fix. This solves a real crash and the tests are easy to follow.
A few things before merge:
- CI is red, but it's only a formatting nitpick in the test file. Run
pnpm fixand it should go green. - Left one inline note about splitting on
##inside code blocks. - Heads up: a single huge section, or a file with no
##headings at all, would still crash. That's fine for now, but maybe add a quick comment in the code so it's clear this isn't a full fix.
Changeset is already included, so that's covered.
Generated by Claude Code
| let current: string[] = []; | ||
|
|
||
| for (const line of file.content.split("\n")) { | ||
| if (line.startsWith("## ")) { |
There was a problem hiding this comment.
One thing to watch here. If a line inside a code block starts with ## , it gets treated as a heading and the file is split in the wrong place. That could break code samples in exactly the big files this targets. Might be worth skipping lines inside fenced ``` blocks.
Generated by Claude Code
moshest
left a comment
There was a problem hiding this comment.
Following up on my earlier comment — I want to correct myself. I said the remaining crash cases were "fine for now." They're not.
The goal here is no crashes. Right now a file with no ## headings, or a single section that's still too big, will still run out of memory. So the same bug is still reachable, just less often. That shouldn't ship.
Can we make this handle any large file safely? A size-based fallback (split by lines once a chunk is still too big, and for files with no ## at all) would close the gap. Then no input can crash the build.
Splitting on ## is a good start, it just needs to cover the cases it currently misses.
Generated by Claude Code
Large markdown files (>1MB) like Cloudflare's llms-full.txt previously caused Node.js heap OOM because remark-parse built a full AST of the entire document.
Now they are pre-split by
##headings before AST parsing, so each chunk stays small and is independently parseable.Changes
packages/context/src/package-builder.ts: AddedsplitMarkdownByHeadings()function and pre-processing inbuildPackage()that splits large.md/.mdx/.txtfiles by##headingspackages/context/src/package-builder.test.ts: Added 7 tests covering splitting behaviorVerification