nightshift: doc-drift — Documentation Drift Analysis
Repo: Microck/jarspect
Date: 2026-04-25
Docs analyzed: README.md, docs/benchmarking.md, docs/corpus-calibration.md, docs/false-positives.md
Source files checked: src/ (30 Rust files), Cargo.toml, scripts/
Summary
The README.md and documentation files have drifted from the actual codebase in several measurable ways. The most significant drift is the capability detector count — docs consistently say "8 detectors" but there are now 11. Other drifts include missing detector descriptions, outdated dependency references, and pipeline description inconsistencies.
🔴 HIGH SEVERITY — Factual Errors
1. Detector Count: "8" → 11
Files affected: README.md (lines 51, 75, 95, 164)
The README repeatedly states "8 capability detectors" but the codebase has 11:
| Listed in README (8) |
Actually in code (11) |
DETC-01: Process execution (exec) |
✅ capability_exec.rs |
DETC-02: Network I/O (network) |
✅ capability_network.rs |
DETC-03: Dynamic class loading (dynamic_load) |
✅ capability_dynamic_load.rs |
DETC-04: Filesystem/JAR modification (fs_modify) |
✅ capability_fs_modify.rs |
DETC-05: Persistence (persistence) |
✅ capability_persistence.rs |
DETC-06: Unsafe deserialization (deser) |
✅ capability_deser.rs |
DETC-07: Native/JNI loading (native) |
✅ capability_native.rs |
DETC-08: Credential theft (cred_theft) |
✅ capability_cred_theft.rs |
| — |
❌ Missing from docs: capability_base64_stager.rs (base64 stager detection) |
| — |
❌ Missing from docs: capability_discord_webhook.rs (Discord webhook exfiltration) |
| — |
❌ Missing from docs: capability_remote_code_load.rs (remote code loading) |
Evidence:
$ ls src/detectors/capability_*.rs | wc -l
11
Locations in README:
- Line 51:
+-- Capability detectors 8 detectors (exec, network, dynamic load, fs/jar modify,
- Line 75:
- **8 capability detectors** -- each uses an evidence index with class-scoped correlation gates
- Line 95:
Eight detectors run against an EvidenceIndex built from the extracted bytecode evidence.
- Line 164:
the full bytecode analysis runs: archive traversal, class parsing, YARA scanning, and 8 capability detectors.
2. Detector Table Missing 3 Entries
File: README.md lines 97-106
The capability detectors table (DETC-01 through DETC-08) is missing:
- DETC-09: Base64 stager — detects
new String(Base64.getDecoder().decode(...)) patterns used by stagers to hide URLs/class names (relevant to fractureiser Stage 0 variants)
- DETC-10: Discord webhook — detects Discord webhook URLs (
discord.com/api/webhooks/) which are a common exfiltration channel for game-malware
- DETC-11: Remote code load — detects
URLClassLoader + remote class loading patterns that go beyond the generic dynamic_load detector (more targeted at remote-JAR injection specifically)
3. false-positives.md References Outdated File Paths
File: docs/false-positives.md
The doc says:
Prompt changes in src/verdict.rs
While verdict.rs does contain the prompt, the doc also references:
Detector/prompt framing changes: Capability presence is only set by medium/high detector signals (low-only detector hits get routed to low_signal_indicators) (see src/profile.rs)
This reference is still accurate — profile.rs does have low_signal_indicators at line 27 and the routing logic at lines 83-109. ✅
However, the OptiFine case study references patterns that may have evolved since the doc was written. The doc should include a date stamp for each case study.
🟡 MEDIUM SEVERITY — Inconsistencies
4. README Pipeline Diagram vs Actual scan.rs Logic
File: README.md lines 39-64
The README's pipeline diagram shows:
POST /upload (multipart .jar)
|
POST /scan (upload_id)
In scan.rs, the actual flow has more nuance:
MalwareBazaarMatchMode has two variants: ShortCircuit (default) and ContinueStaticAnalysis (opt-in via env var)
- When
ContinueStaticAnalysis is enabled, the MalwareBazaar match doesn't short-circuit — it continues through bytecode analysis for artifact collection
- The README mentions
JARSPECT_MB_MATCH_CONTINUE_ANALYSIS=1 at line 42 but the diagram doesn't reflect this branch
Recommendation: Update the ASCII diagram to show the optional continue-analysis branch explicitly.
5. README References "Azure OpenAI (gpt-4o)" — May Be Configurable
File: README.md line 67
AI-first -- Azure OpenAI (gpt-4o) analyzes the full capability profile
The AiConfig struct in verdict.rs suggests the model is configurable (it's loaded from environment config, not hardcoded). Hardcoding "gpt-4o" in the README may become stale if the model changes. Consider phrasing as "Azure OpenAI (configurable model)" or noting the specific model as a default.
6. Cargo.toml Edition Claim
File: Cargo.toml line 4
Rust edition 2024 is very recent. The README doesn't mention any Rust edition requirement. While this isn't strictly a doc drift, it's worth noting in a "Building" section that Rust 2024 edition support requires a recent toolchain.
🟢 LOW SEVERITY — Minor Gaps
7. Scripts Referenced in Docs vs Actual Scripts
File: docs/benchmarking.md
The benchmarking doc references:
scripts/modrinth-top-50-scan.sh — ✅ exists
scripts/scan-local-dir.sh — ✅ exists
scripts/select-malwarebazaar-dataset.ts — ✅ exists (referenced in corpus-calibration.md)
scripts/malwarebazaar-download.sh — ✅ exists
However, the actual scripts/ directory also contains:
scripts/aggregate-run.ts — not documented
scripts/render-benchmark-figures.ts — not documented
scripts/demo_run.sh — not documented
scripts/fetch-malwarebazaar-minecraft.sh — not documented
Recommendation: Add a "Scripts Reference" section to docs/benchmarking.md listing all scripts with one-line descriptions.
8. corpus-calibration.md Dated March 2026
File: docs/corpus-calibration.md line 3
Date: 2026-03-05 (updated from 2026-03-03 initial calibration)
This is now ~7 weeks old. If the corpus has been expanded or detector sensitivity has changed (which the addition of 3 new detectors suggests it has), the calibration report should be re-run and updated.
9. README Missing "Building/Running" Section
The README has detailed architecture documentation but lacks a practical "Getting Started" section with:
- Rust toolchain version requirement (2024 edition needs Rust 1.85+)
- Required environment variables (
.env setup)
- How to configure YARA rulepacks (
JARSPECT_RULEPACKS)
- How to configure the AI verdict layer
Action Items
| Priority |
Issue |
Effort |
| 🔴 High |
Update detector count from 8 → 11 in README (4 locations) |
10 min |
| 🔴 High |
Add DETC-09, DETC-10, DETC-11 to capability detectors table |
15 min |
| 🟡 Medium |
Update pipeline diagram to show ContinueStaticAnalysis branch |
10 min |
| 🟡 Medium |
Generalize "gpt-4o" reference to "configurable AI model" |
5 min |
| 🟡 Medium |
Add "Getting Started" / "Building" section to README |
30 min |
| 🟢 Low |
Document all scripts in scripts/ directory |
15 min |
| 🟢 Low |
Re-run corpus calibration and update date |
1-2 hours |
Generated by nightshift — doc-drift analysis
nightshift: doc-drift — Documentation Drift Analysis
Repo: Microck/jarspect
Date: 2026-04-25
Docs analyzed:
README.md,docs/benchmarking.md,docs/corpus-calibration.md,docs/false-positives.mdSource files checked:
src/(30 Rust files),Cargo.toml,scripts/Summary
The README.md and documentation files have drifted from the actual codebase in several measurable ways. The most significant drift is the capability detector count — docs consistently say "8 detectors" but there are now 11. Other drifts include missing detector descriptions, outdated dependency references, and pipeline description inconsistencies.
🔴 HIGH SEVERITY — Factual Errors
1. Detector Count: "8" → 11
Files affected:
README.md(lines 51, 75, 95, 164)The README repeatedly states "8 capability detectors" but the codebase has 11:
exec)capability_exec.rsnetwork)capability_network.rsdynamic_load)capability_dynamic_load.rsfs_modify)capability_fs_modify.rspersistence)capability_persistence.rsdeser)capability_deser.rsnative)capability_native.rscred_theft)capability_cred_theft.rscapability_base64_stager.rs(base64 stager detection)capability_discord_webhook.rs(Discord webhook exfiltration)capability_remote_code_load.rs(remote code loading)Evidence:
Locations in README:
+-- Capability detectors 8 detectors (exec, network, dynamic load, fs/jar modify,- **8 capability detectors** -- each uses an evidence index with class-scoped correlation gatesEight detectors run against an EvidenceIndex built from the extracted bytecode evidence.the full bytecode analysis runs: archive traversal, class parsing, YARA scanning, and 8 capability detectors.2. Detector Table Missing 3 Entries
File:
README.mdlines 97-106The capability detectors table (DETC-01 through DETC-08) is missing:
new String(Base64.getDecoder().decode(...))patterns used by stagers to hide URLs/class names (relevant to fractureiser Stage 0 variants)discord.com/api/webhooks/) which are a common exfiltration channel for game-malwareURLClassLoader+ remote class loading patterns that go beyond the genericdynamic_loaddetector (more targeted at remote-JAR injection specifically)3. false-positives.md References Outdated File Paths
File:
docs/false-positives.mdThe doc says:
While
verdict.rsdoes contain the prompt, the doc also references:This reference is still accurate —
profile.rsdoes havelow_signal_indicatorsat line 27 and the routing logic at lines 83-109. ✅However, the OptiFine case study references patterns that may have evolved since the doc was written. The doc should include a date stamp for each case study.
🟡 MEDIUM SEVERITY — Inconsistencies
4. README Pipeline Diagram vs Actual scan.rs Logic
File:
README.mdlines 39-64The README's pipeline diagram shows:
In
scan.rs, the actual flow has more nuance:MalwareBazaarMatchModehas two variants:ShortCircuit(default) andContinueStaticAnalysis(opt-in via env var)ContinueStaticAnalysisis enabled, the MalwareBazaar match doesn't short-circuit — it continues through bytecode analysis for artifact collectionJARSPECT_MB_MATCH_CONTINUE_ANALYSIS=1at line 42 but the diagram doesn't reflect this branchRecommendation: Update the ASCII diagram to show the optional continue-analysis branch explicitly.
5. README References "Azure OpenAI (gpt-4o)" — May Be Configurable
File:
README.mdline 67The
AiConfigstruct inverdict.rssuggests the model is configurable (it's loaded from environment config, not hardcoded). Hardcoding "gpt-4o" in the README may become stale if the model changes. Consider phrasing as "Azure OpenAI (configurable model)" or noting the specific model as a default.6. Cargo.toml Edition Claim
File:
Cargo.tomlline 4Rust edition 2024 is very recent. The README doesn't mention any Rust edition requirement. While this isn't strictly a doc drift, it's worth noting in a "Building" section that Rust 2024 edition support requires a recent toolchain.
🟢 LOW SEVERITY — Minor Gaps
7. Scripts Referenced in Docs vs Actual Scripts
File:
docs/benchmarking.mdThe benchmarking doc references:
scripts/modrinth-top-50-scan.sh— ✅ existsscripts/scan-local-dir.sh— ✅ existsscripts/select-malwarebazaar-dataset.ts— ✅ exists (referenced in corpus-calibration.md)scripts/malwarebazaar-download.sh— ✅ existsHowever, the actual
scripts/directory also contains:scripts/aggregate-run.ts— not documentedscripts/render-benchmark-figures.ts— not documentedscripts/demo_run.sh— not documentedscripts/fetch-malwarebazaar-minecraft.sh— not documentedRecommendation: Add a "Scripts Reference" section to
docs/benchmarking.mdlisting all scripts with one-line descriptions.8. corpus-calibration.md Dated March 2026
File:
docs/corpus-calibration.mdline 3This is now ~7 weeks old. If the corpus has been expanded or detector sensitivity has changed (which the addition of 3 new detectors suggests it has), the calibration report should be re-run and updated.
9. README Missing "Building/Running" Section
The README has detailed architecture documentation but lacks a practical "Getting Started" section with:
.envsetup)JARSPECT_RULEPACKS)Action Items
scripts/directoryGenerated by nightshift — doc-drift analysis