Skip to content

harden untrusted-input, offline embedders, model_hash gate, CI & provenance#59

Merged
brennercruvinel merged 4 commits into
mainfrom
hardening/audit-remediation
Jul 1, 2026
Merged

harden untrusted-input, offline embedders, model_hash gate, CI & provenance#59
brennercruvinel merged 4 commits into
mainfrom
hardening/audit-remediation

Conversation

@han-hoff

@han-hoff han-hoff commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Audit remediation — security / privacy / compliance (high + medium)

Remediates the high and medium findings from the security/privacy/compliance audit of HEAD (v0.3.0). Every code fix ships with a test; the full Rust and Python suites are green locally. The verdict of the audit stands — core architecture is sound; the risk was in the operational perimeter (hostile-input hardening, process/CI, provenance), which this PR closes.

Security

# Finding Fix Where
S1 (high) HNSW neighbour ids / entry_point / max_level never range-checked → OOB panic (DoS) on a hostile .nest, any search path range-check every id + entry_point < n_nodes and cap max_level at decode; typed error, mirroring the graph CSR parser crates/nest-runtime/src/ann/codec.rs
S2 (high) zstd sections had no decompressed-size cap → decompression bomb OOM at open() cap at max(64 MiB, 128× compressed); reject an over-declaring frame before allocating crates/nest-format/src/encoding/zstd_codec.rs
S3 (high) BM25 v1/v2 with_capacity from unvalidated n_docs/n_terms/df → allocation-bomb abort at open() bound every capacity hint (MAX_PREALLOC), grow on demand crates/nest-runtime/src/bm25/codec.rs
S4 (high) model_hash honesty gate lived only in the CLI; the Python NestFile.retrieve flagship silently lost it expose NestFile.model_hash, add expected_model_hash to retrieve (raises on mismatch), and verify by default in forge/retrieve.py nest-python/src/{nest_file,retrieve_fn}.rs, nest-runtime/src/mmap_file.rs, python/forge/retrieve.py
S5 (med) search-text could fetch a model from HuggingFace before validation (network egress mid-run) force HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE by default; network is opt-in via NEST_ALLOW_DOWNLOAD=1 python/embed_query.py + build entry points
S6 (med) AVX2 int8 kernel dereferenced an unaligned *const i64 (UB on the rerank hot path) ptr::read_unaligned + a SAFETY note crates/nest-runtime/src/simd/avx2.rs
S7 (med) model snapshot picked via sorted()[0] (alphabetical), not the loaded revision → fingerprint can attest the wrong model resolve the revision refs/main points to; fail closed if ambiguous python/model_fingerprint.py, embed_query.py

Privacy

# Finding Fix
P1 (med) offline / "never leaves the box" claim broke on the search-text + build paths same offline-by-default guard as S5, across all sentence-transformers entry points
P2 (med) EmbeddingCache keyed on chunk_id only → re-embedding with a different model reused stale vectors under the new model_hash key the cache on (chunk_id, model) (new embeddings_v2 table; old cache is re-embedded, never misread) — python/builder.py

Compliance / governance

  • C3 — add CI (.github/workflows/ci.yml): build/test/clippy/fmt --locked + the Python suite + cargo-audit. The quality gate is no longer laptop-only.
  • C4track Cargo.lock (pin the Rust dep set for reproducibility + CVE triage) and build --locked in CI. (GitHub flags 2 pre-existing dependabot advisories on the dep tree; cargo-audit now surfaces them in CI — triaging/pinning those specific deps is a follow-up.)
  • C5 — honest provenance in SECURITY.md (releases/tags not yet signed, no SBOM) + add nest --version so a deployed binary is identifiable.
  • C6 — fix the README citation contradiction ("reopen the exact byte span" → tier-1 stored text + offsets + hashes, matching the code).
  • C1doc/data-governance.md: erasure/rotation posture for an immutable, distributed .nest, and encryption-at-rest as a required control for sensitive data.
  • C2dat/demo/README.md: per-source corpus license bill of materials + redistribution obligations.
  • Bonus (touched-file hygiene): corrected the SECURITY.md unsafe-scope overclaim (C7) and documented the unkeyed-checksum threat model (S9); the tree is now ruff check-clean.

Testing

  • New Rust negative tests (all green): HNSW out-of-range id v1/v2 + entry_point; zstd bomb rejected; BM25 n_docs/df bombs.
  • New Python tests: test_offline_guard.py (offline forced by default + opt-in respected); test_builder.py gains the model-keyed-cache and the retrieve model_hash-gate tests.
  • Existing suites unchanged and green: cargo test --workspace, tests/test_e2e.py, tests/test_builder.py, tests/test_search_text_model_hash.py, clippy/fmt clean. The shipped 119 MB corpus still validates (zstd cap does not touch it — all sections are raw).
  • S6 (AVX2) is x86_64-gated and validated by CI (ubuntu x86_64); not compiled on the ARM dev host.

Out of scope (documented follow-ups — the audit's low/info items)

-- argv separator in the embedder shell-out (S8), query-via-stdin to keep PHI out of ps (P4), auto-installed PHI pre-commit hook (P5), tool scratch off /tmp (P6), the stray-l doc-comment sweep (P7), signed release tags + SBOM generation, and triaging the pre-existing dependabot advisories.

No Co-Authored-By/attribution footers, per repo convention.

…enance

Remediates the high/medium findings from the security/privacy/compliance
audit. Each fix ships with a test; full Rust + Python suites green.

security
- HNSW decode: range-check neighbour ids, entry_point and max_level against
  n_nodes so a hostile .nest can no longer index out of bounds (OOB panic/DoS)
  on any search path (ann/codec.rs).
- zstd sections: cap decompression at max(64 MiB, 128x compressed) and reject
  frames declaring more, killing the decompression-bomb OOM at open
  (zstd_codec.rs).
- BM25 v1/v2 decode: bound with_capacity hints from attacker-controlled
  counts (n_docs/n_terms/df), killing the allocation-bomb abort (bm25/codec.rs).
- model_hash honesty gate on the python surface: NestFile exposes model_hash,
  retrieve accepts expected_model_hash and raises on mismatch, and the flagship
  forge/retrieve.py verifies by default (was CLI-only; the advertised binding
  silently lost it).
- AVX2 int8 kernel: replace the unaligned *const i64 deref with
  read_unaligned (was UB on the exact-rerank hot path).

privacy
- force HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE by default in every
  sentence-transformers entry point (embed_query, model_fingerprint,
  nest_build_corpus, convert_legacy); network is opt-in via NEST_ALLOW_DOWNLOAD.
  Stops an unexpected model fetch mid-run while handling data.
- resolve the HF snapshot the loaded revision points to (refs/main), not the
  alphabetical sorted()[0], so the model fingerprint attests the real model.
- EmbeddingCache is keyed by (chunk_id, model): re-embedding a corpus with a
  different model no longer reuses stale vectors under a new model_hash.

compliance / governance
- add CI (.github/workflows/ci.yml): build/test/clippy/fmt --locked + python
  tests + cargo-audit, so the gate is not laptop-only.
- track Cargo.lock (pin the rust dep set for reproducibility + CVE triage).
- add --version to the CLI; correct SECURITY.md (supported versions, unsafe
  scope, release provenance) and the README tier-1 citation wording.
- doc/data-governance.md: erasure/rotation posture for an immutable distributed
  .nest, provenance gaps, and encryption-at-rest requirement for sensitive data.
- dat/demo/README.md: per-source corpus license bill of materials.
@han-hoff han-hoff requested a review from brennercruvinel July 1, 2026 13:51
@brennercruvinel brennercruvinel merged commit 9fc2d00 into main Jul 1, 2026
7 checks passed
@brennercruvinel brennercruvinel deleted the hardening/audit-remediation branch July 1, 2026 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants