feat: import pipeline, search benchmark, asymmetric query prefixes by jp-cruz · Pull Request #4 · LegionForge/jeli

jp-cruz · 2026-07-03T05:54:20Z

Overnight build, validated against a real 2,328-memory corpus in a staging database.

scripts/import_markdown.py — generic bulk importer; every record passes through capture_memory (an importer bypassing the scoped write path would be a second, unguarded door). Section chunking, path→type classification, hash dedupe, idempotent re-runs, report output.
Asymmetric retrieval fix — embed_query() with per-model prefixes (arctic query: ); semantic search uses it.
scripts/bench_search.py — p50/p95 + verify_chain at scale.

Real-corpus results (2,328 memories): chain valid at full count · semantic p50 161ms end-to-end (HNSW itself ≈15ms — query embedding dominates; keep-alive/warm-model is the lever) · fts p50 14.5ms · verify_chain <100ms · top-1 retrieval quality confirmed on live queries.

Field finding for the judicial queue: injection regexes flag ~7% of security-engineering prose (notes that describe attacks trip the patterns). Documented in the import report; refinement belongs to the warn-daemon work, not this PR.

🤖 Generated with Claude Code

…efixes - scripts/import_markdown.py: generic, config-driven bulk importer — every record goes through capture_memory (chained, audited, injection-screened); section-level chunking, path-based type classification, content-hash dedupe (idempotent re-runs), markdown report. No personal paths baked in. - embedding: embed_query() with per-model retrieval prefixes (arctic-embed 'query: ', nomic 'search_query: ') — asymmetric models need query-side conditioning; semantic search now uses it. - scripts/bench_search.py: p50/p95 latency for semantic+fts and verify_chain runtime at store scale. Validated against a real 2,328-memory corpus (staging): chain valid at full count; semantic p50 161ms end-to-end (≈15ms of that is HNSW — query embedding dominates); verify_chain <100ms; retrieval quality sanity-checked. Field finding: injection heuristics flag ~7% of security-engineering prose — documented for the judicial-review queue. Co-Authored-By: Claude Fable 5 <[email protected]>

CI's live-Postgres concurrency test caught it: now() is transaction-START time, and chain writes serialize on the advisory lock AFTER their transactions begin — concurrent writers tie on created_at, so both the find-latest query and the verify walk saw an order that wasn't the chain. - alembic 005: chain_seq (sequence default, claimed inside the locked insert => strictly increasing in true chain order), unique, backfilled for pre-005 rows in (created_at, id) order (sequential data, safe) - prev-hash lookup + verify walk now order by chain_seq - integration stub: embed_query = embed (symmetric) so exact-match distance assertions survive the asymmetric prefix change 131 unit + 5 live integration green; 005 applied to prod+staging; staging chain re-verified valid at 2,328 records post-migration. Co-Authored-By: Claude Fable 5 <[email protected]>

jp-cruz and others added 2 commits July 3, 2026 00:54

jp-cruz merged commit 3daec02 into main Jul 3, 2026
11 checks passed

jp-cruz deleted the feat/import-pipeline branch July 3, 2026 05:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: import pipeline, search benchmark, asymmetric query prefixes#4

feat: import pipeline, search benchmark, asymmetric query prefixes#4
jp-cruz merged 2 commits into
mainfrom
feat/import-pipeline

jp-cruz commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jp-cruz commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant