harden untrusted-input, offline embedders, model_hash gate, CI & provenance by han-hoff · Pull Request #59 · hoffresearch/nest

han-hoff · 2026-07-01T13:47:05Z

Audit remediation — security / privacy / compliance (high + medium)

Remediates the high and medium findings from the security/privacy/compliance audit of HEAD (v0.3.0). Every code fix ships with a test; the full Rust and Python suites are green locally. The verdict of the audit stands — core architecture is sound; the risk was in the operational perimeter (hostile-input hardening, process/CI, provenance), which this PR closes.

Security

#	Finding	Fix	Where
S1 (high)	HNSW neighbour ids / `entry_point` / `max_level` never range-checked → OOB panic (DoS) on a hostile `.nest`, any search path	range-check every id + `entry_point` < `n_nodes` and cap `max_level` at decode; typed error, mirroring the graph CSR parser	`crates/nest-runtime/src/ann/codec.rs`
S2 (high)	zstd sections had no decompressed-size cap → decompression bomb OOM at `open()`	cap at `max(64 MiB, 128× compressed)`; reject an over-declaring frame before allocating	`crates/nest-format/src/encoding/zstd_codec.rs`
S3 (high)	BM25 v1/v2 `with_capacity` from unvalidated `n_docs`/`n_terms`/`df` → allocation-bomb abort at `open()`	bound every capacity hint (`MAX_PREALLOC`), grow on demand	`crates/nest-runtime/src/bm25/codec.rs`
S4 (high)	`model_hash` honesty gate lived only in the CLI; the Python `NestFile.retrieve` flagship silently lost it	expose `NestFile.model_hash`, add `expected_model_hash` to `retrieve` (raises on mismatch), and verify by default in `forge/retrieve.py`	`nest-python/src/{nest_file,retrieve_fn}.rs`, `nest-runtime/src/mmap_file.rs`, `python/forge/retrieve.py`
S5 (med)	`search-text` could fetch a model from HuggingFace before validation (network egress mid-run)	force `HF_HUB_OFFLINE`/`TRANSFORMERS_OFFLINE` by default; network is opt-in via `NEST_ALLOW_DOWNLOAD=1`	`python/embed_query.py` + build entry points
S6 (med)	AVX2 int8 kernel dereferenced an unaligned `*const i64` (UB on the rerank hot path)	`ptr::read_unaligned` + a SAFETY note	`crates/nest-runtime/src/simd/avx2.rs`
S7 (med)	model snapshot picked via `sorted()[0]` (alphabetical), not the loaded revision → fingerprint can attest the wrong model	resolve the revision `refs/main` points to; fail closed if ambiguous	`python/model_fingerprint.py`, `embed_query.py`

Privacy

#	Finding	Fix
P1 (med)	offline / "never leaves the box" claim broke on the `search-text` + build paths	same offline-by-default guard as S5, across all sentence-transformers entry points
P2 (med)	`EmbeddingCache` keyed on `chunk_id` only → re-embedding with a different model reused stale vectors under the new `model_hash`	key the cache on `(chunk_id, model)` (new `embeddings_v2` table; old cache is re-embedded, never misread) — `python/builder.py`

Compliance / governance

C3 — add CI (.github/workflows/ci.yml): build/test/clippy/fmt --locked + the Python suite + cargo-audit. The quality gate is no longer laptop-only.
C4 — track Cargo.lock (pin the Rust dep set for reproducibility + CVE triage) and build --locked in CI. (GitHub flags 2 pre-existing dependabot advisories on the dep tree; cargo-audit now surfaces them in CI — triaging/pinning those specific deps is a follow-up.)
C5 — honest provenance in SECURITY.md (releases/tags not yet signed, no SBOM) + add nest --version so a deployed binary is identifiable.
C6 — fix the README citation contradiction ("reopen the exact byte span" → tier-1 stored text + offsets + hashes, matching the code).
C1 — doc/data-governance.md: erasure/rotation posture for an immutable, distributed .nest, and encryption-at-rest as a required control for sensitive data.
C2 — dat/demo/README.md: per-source corpus license bill of materials + redistribution obligations.
Bonus (touched-file hygiene): corrected the SECURITY.md unsafe-scope overclaim (C7) and documented the unkeyed-checksum threat model (S9); the tree is now ruff check-clean.

Testing

New Rust negative tests (all green): HNSW out-of-range id v1/v2 + entry_point; zstd bomb rejected; BM25 n_docs/df bombs.
New Python tests: test_offline_guard.py (offline forced by default + opt-in respected); test_builder.py gains the model-keyed-cache and the retrieve model_hash-gate tests.
Existing suites unchanged and green: cargo test --workspace, tests/test_e2e.py, tests/test_builder.py, tests/test_search_text_model_hash.py, clippy/fmt clean. The shipped 119 MB corpus still validates (zstd cap does not touch it — all sections are raw).
S6 (AVX2) is x86_64-gated and validated by CI (ubuntu x86_64); not compiled on the ARM dev host.

Out of scope (documented follow-ups — the audit's low/info items)

-- argv separator in the embedder shell-out (S8), query-via-stdin to keep PHI out of ps (P4), auto-installed PHI pre-commit hook (P5), tool scratch off /tmp (P6), the stray-l doc-comment sweep (P7), signed release tags + SBOM generation, and triaging the pre-existing dependabot advisories.

No Co-Authored-By/attribution footers, per repo convention.

…enance Remediates the high/medium findings from the security/privacy/compliance audit. Each fix ships with a test; full Rust + Python suites green. security - HNSW decode: range-check neighbour ids, entry_point and max_level against n_nodes so a hostile .nest can no longer index out of bounds (OOB panic/DoS) on any search path (ann/codec.rs). - zstd sections: cap decompression at max(64 MiB, 128x compressed) and reject frames declaring more, killing the decompression-bomb OOM at open (zstd_codec.rs). - BM25 v1/v2 decode: bound with_capacity hints from attacker-controlled counts (n_docs/n_terms/df), killing the allocation-bomb abort (bm25/codec.rs). - model_hash honesty gate on the python surface: NestFile exposes model_hash, retrieve accepts expected_model_hash and raises on mismatch, and the flagship forge/retrieve.py verifies by default (was CLI-only; the advertised binding silently lost it). - AVX2 int8 kernel: replace the unaligned *const i64 deref with read_unaligned (was UB on the exact-rerank hot path). privacy - force HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE by default in every sentence-transformers entry point (embed_query, model_fingerprint, nest_build_corpus, convert_legacy); network is opt-in via NEST_ALLOW_DOWNLOAD. Stops an unexpected model fetch mid-run while handling data. - resolve the HF snapshot the loaded revision points to (refs/main), not the alphabetical sorted()[0], so the model fingerprint attests the real model. - EmbeddingCache is keyed by (chunk_id, model): re-embedding a corpus with a different model no longer reuses stale vectors under a new model_hash. compliance / governance - add CI (.github/workflows/ci.yml): build/test/clippy/fmt --locked + python tests + cargo-audit, so the gate is not laptop-only. - track Cargo.lock (pin the rust dep set for reproducibility + CVE triage). - add --version to the CLI; correct SECURITY.md (supported versions, unsafe scope, release provenance) and the README tier-1 citation wording. - doc/data-governance.md: erasure/rotation posture for an immutable distributed .nest, provenance gaps, and encryption-at-rest requirement for sensitive data. - dat/demo/README.md: per-source corpus license bill of materials.

…D warnings gate)

han-hoff requested a review from brennercruvinel July 1, 2026 13:51

brennercruvinel added 3 commits July 1, 2026 14:24

cargo fmt: wrap two over-width lines to satisfy fmt --check gate

04d07e5

avx2: allow(unused_unsafe) on the register-only srai helper (clippy -…

9ecf82c

…D warnings gate)

Merge branch 'main' into hardening/audit-remediation

d9b35c8

brennercruvinel approved these changes Jul 1, 2026

View reviewed changes

brennercruvinel merged commit 9fc2d00 into main Jul 1, 2026
7 checks passed

brennercruvinel deleted the hardening/audit-remediation branch July 1, 2026 17:53

han-hoff mentioned this pull request Jul 1, 2026

hardening: audit remediation — hostile-input bounds, model_hash gate, dependency advisories, CI gates #60

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

harden untrusted-input, offline embedders, model_hash gate, CI & provenance#59

harden untrusted-input, offline embedders, model_hash gate, CI & provenance#59
brennercruvinel merged 4 commits into
mainfrom
hardening/audit-remediation

han-hoff commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

han-hoff commented Jul 1, 2026

Audit remediation — security / privacy / compliance (high + medium)

Security

Privacy

Compliance / governance

Testing

Out of scope (documented follow-ups — the audit's low/info items)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants