harden untrusted-input, offline embedders, model_hash gate, CI & provenance#59
Merged
Merged
Conversation
…enance Remediates the high/medium findings from the security/privacy/compliance audit. Each fix ships with a test; full Rust + Python suites green. security - HNSW decode: range-check neighbour ids, entry_point and max_level against n_nodes so a hostile .nest can no longer index out of bounds (OOB panic/DoS) on any search path (ann/codec.rs). - zstd sections: cap decompression at max(64 MiB, 128x compressed) and reject frames declaring more, killing the decompression-bomb OOM at open (zstd_codec.rs). - BM25 v1/v2 decode: bound with_capacity hints from attacker-controlled counts (n_docs/n_terms/df), killing the allocation-bomb abort (bm25/codec.rs). - model_hash honesty gate on the python surface: NestFile exposes model_hash, retrieve accepts expected_model_hash and raises on mismatch, and the flagship forge/retrieve.py verifies by default (was CLI-only; the advertised binding silently lost it). - AVX2 int8 kernel: replace the unaligned *const i64 deref with read_unaligned (was UB on the exact-rerank hot path). privacy - force HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE by default in every sentence-transformers entry point (embed_query, model_fingerprint, nest_build_corpus, convert_legacy); network is opt-in via NEST_ALLOW_DOWNLOAD. Stops an unexpected model fetch mid-run while handling data. - resolve the HF snapshot the loaded revision points to (refs/main), not the alphabetical sorted()[0], so the model fingerprint attests the real model. - EmbeddingCache is keyed by (chunk_id, model): re-embedding a corpus with a different model no longer reuses stale vectors under a new model_hash. compliance / governance - add CI (.github/workflows/ci.yml): build/test/clippy/fmt --locked + python tests + cargo-audit, so the gate is not laptop-only. - track Cargo.lock (pin the rust dep set for reproducibility + CVE triage). - add --version to the CLI; correct SECURITY.md (supported versions, unsafe scope, release provenance) and the README tier-1 citation wording. - doc/data-governance.md: erasure/rotation posture for an immutable distributed .nest, provenance gaps, and encryption-at-rest requirement for sensitive data. - dat/demo/README.md: per-source corpus license bill of materials.
brennercruvinel
approved these changes
Jul 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Audit remediation — security / privacy / compliance (high + medium)
Remediates the high and medium findings from the security/privacy/compliance audit of
HEAD(v0.3.0). Every code fix ships with a test; the full Rust and Python suites are green locally. The verdict of the audit stands — core architecture is sound; the risk was in the operational perimeter (hostile-input hardening, process/CI, provenance), which this PR closes.Security
entry_point/max_levelnever range-checked → OOB panic (DoS) on a hostile.nest, any search pathentry_point<n_nodesand capmax_levelat decode; typed error, mirroring the graph CSR parsercrates/nest-runtime/src/ann/codec.rsopen()max(64 MiB, 128× compressed); reject an over-declaring frame before allocatingcrates/nest-format/src/encoding/zstd_codec.rswith_capacityfrom unvalidatedn_docs/n_terms/df→ allocation-bomb abort atopen()MAX_PREALLOC), grow on demandcrates/nest-runtime/src/bm25/codec.rsmodel_hashhonesty gate lived only in the CLI; the PythonNestFile.retrieveflagship silently lost itNestFile.model_hash, addexpected_model_hashtoretrieve(raises on mismatch), and verify by default inforge/retrieve.pynest-python/src/{nest_file,retrieve_fn}.rs,nest-runtime/src/mmap_file.rs,python/forge/retrieve.pysearch-textcould fetch a model from HuggingFace before validation (network egress mid-run)HF_HUB_OFFLINE/TRANSFORMERS_OFFLINEby default; network is opt-in viaNEST_ALLOW_DOWNLOAD=1python/embed_query.py+ build entry points*const i64(UB on the rerank hot path)ptr::read_unaligned+ a SAFETY notecrates/nest-runtime/src/simd/avx2.rssorted()[0](alphabetical), not the loaded revision → fingerprint can attest the wrong modelrefs/mainpoints to; fail closed if ambiguouspython/model_fingerprint.py,embed_query.pyPrivacy
search-text+ build pathsEmbeddingCachekeyed onchunk_idonly → re-embedding with a different model reused stale vectors under the newmodel_hash(chunk_id, model)(newembeddings_v2table; old cache is re-embedded, never misread) —python/builder.pyCompliance / governance
.github/workflows/ci.yml):build/test/clippy/fmt --locked+ the Python suite +cargo-audit. The quality gate is no longer laptop-only.Cargo.lock(pin the Rust dep set for reproducibility + CVE triage) and build--lockedin CI. (GitHub flags 2 pre-existing dependabot advisories on the dep tree;cargo-auditnow surfaces them in CI — triaging/pinning those specific deps is a follow-up.)SECURITY.md(releases/tags not yet signed, no SBOM) + addnest --versionso a deployed binary is identifiable.doc/data-governance.md: erasure/rotation posture for an immutable, distributed.nest, and encryption-at-rest as a required control for sensitive data.dat/demo/README.md: per-source corpus license bill of materials + redistribution obligations.SECURITY.mdunsafe-scope overclaim (C7) and documented the unkeyed-checksum threat model (S9); the tree is nowruff check-clean.Testing
entry_point; zstd bomb rejected; BM25n_docs/dfbombs.test_offline_guard.py(offline forced by default + opt-in respected);test_builder.pygains the model-keyed-cache and theretrievemodel_hash-gate tests.cargo test --workspace,tests/test_e2e.py,tests/test_builder.py,tests/test_search_text_model_hash.py, clippy/fmt clean. The shipped 119 MB corpus still validates (zstd cap does not touch it — all sections areraw).Out of scope (documented follow-ups — the audit's low/info items)
--argv separator in the embedder shell-out (S8), query-via-stdin to keep PHI out ofps(P4), auto-installed PHI pre-commit hook (P5), tool scratch off/tmp(P6), the stray-ldoc-comment sweep (P7), signed release tags + SBOM generation, and triaging the pre-existing dependabot advisories.No
Co-Authored-By/attribution footers, per repo convention.