60-second version (for reviewers). An autonomous Windows-memory DFIR agent that is forensically defensible:
- It cannot alter evidence — destructive actions don't exist in its tool server (
siftpp-spoliation-test: 14/14 attacks refused, evidence hash unchanged).- It catches its own hallucinations — a Skeptic agent independently reruns tools to refute each finding (
confirmed/inferred/refuted) and sends weak ones back for re-investigation. (An independent re-run even refuted its own "DKOM rootkit" confirmation, correctly flagging it as a Volatility symbol artifact —docs/examples/srl-2018-linux/.)- It keeps a tamper-evident chain of custody — a hash-chained audit log where editing any one record is detected (
siftpp-tamper-test).Windows run on a SANS APT memory image: 4 confirmed of 10 findings, 2 self-corrections, evidence integrity verified, 302-record audit (hash chain OK). Independent Linux reproduction refuted one of those confirmed findings (
DKOM/rootkit) as a tool artifact, leaving 3 retained confirmed findings, precision 1.00, recall 0.75, F1 0.86 under the manual-review proxy. Second independent public Windows memory case (DigitalCorpora M57 / Pat): 4 confirmed of 9 findings, 2 self-corrections, 265-record audit (hash chain OK) — confirmed findings match the documented "Advanced Keylogger" (precision/recall 1.00 vs a public answer key). The 5-minute demo video is a live Linux-terminal screencast; the investigation recorded on camera is committed in full atdocs/examples/srl-2018-live/(4 confirmed of 8, 1 self-correction iteration, 230-record chain) — replay it:uv run siftpp-trace docs/examples/srl-2018-live/audit.jsonl --replay. See the architecture diagram and the real report + logs. See it with no API key:uv run siftpp-demo· attack it:uv run siftpp-spoliation-test.
Most autonomous-IR agents compete on speed or breadth. Almost none can prove they never touched the evidence, fewer catch their own hallucinations, and almost none publish their accuracy, including misses. Protocol SIFT++ does all three.
A self-verifying, autonomous DFIR analyst for SANS FIND EVIL! 2026, built around read-only forensic tools, adversarial verification, and tamper-evident audit logs.
Protocol SIFT++ builds on the Protocol SIFT idea and adds the missing accuracy loop: an Investigator proposes findings, a Skeptic independently tries to refute them, and weak findings are sent back for automatic reinvestigation.
What makes it different from a typical "AI finds evil" agent: every safety claim
is provable by attacking it. The agent is architecturally incapable of altering
evidence (14/14 destructive attempts refused, evidence SHA-256 unchanged), its
audit log is tamper-evident (edit one record and verify_chain fails), and every
finding cites the exact tool command plus its output hash. Forensic defensibility,
not a prompt that says "be careful."
| FIND EVIL! criterion | What Protocol SIFT++ does | Verify it yourself |
|---|---|---|
| Autonomous + real-time self-correction (tiebreaker) | Investigator/Skeptic loop, no human in the loop; SANS and M57 runs each forced 2 corrections, and an independent re-run refuted its own confirmed "DKOM rootkit" as a tool artifact | uv run siftpp-trace docs/examples/srl-2018-live/audit.jsonl --replay (the on-camera run); docs/examples/srl-2018-linux/report.md -> Refuted; uv run siftpp-demo (no key) |
| IR accuracy / catches its own hallucinations | Skeptic re-runs tools to refute each finding -> confirmed/inferred/refuted; cross-run correction removed DKOM FP; SANS manual-review proxy F1 0.86; public answer key (M57): confirmed precision/recall/F1 = 1.00 |
docs/ACCURACY_REPORT.md |
| Depth > breadth | Primary SANS APT case, every claim verified, reproduced on Windows and Linux with byte-identical evidence; also reproduced on a second independent public Windows memory case | docs/examples/srl-2018-base-file-memory/ + docs/examples/srl-2018-linux/ + docs/examples/m57-pat-2009-12-05/ |
| Architectural (not prompt) guardrails | Read-only MCP server: no shell, no dump/write/network tool exists - spoliation is impossible by construction | uv run siftpp-spoliation-test -> 14/14 refused, evidence unchanged |
| Audit trail to specific tool executions | Hash-chained append-only log (tamper-evident) + every finding cites command + output SHA-256 | uv run siftpp-tamper-test -> edit detected; verify_chain(...) -> (True, 302) |
| Usability / docs | One command, no API key, runs on Windows + Linux/SIFT; full docs | uv run siftpp-demo |
What most autonomous-IR agents lack — and SIFT++ has: an adversarial Skeptic that challenges every finding, a cryptographic chain of custody (hash-chained audit + per-finding evidence hashes), and safety that's provable by attacking it. A broad agent that can't verify itself just produces more unverified claims, faster.
- Depth over breadth (criterion #3). One primary competition case, fully verified and reproduced, then one second independent public Windows memory case to prove the loop is not case-specific. The loop is tool-, model-, and OS-agnostic (Volatility 3 today; Windows + Linux/SIFT; DeepSeek or Anthropic), so breadth is configuration, not a redesign.
- Platform & framework (per the rules). Built end to end with Claude Code
(under OpenClaw); the runtime is MCP — Claude Code's native tool protocol —
plus an Anthropic-SDK agent loop, i.e. the "comparable agentic architecture"
the rules permit. Runs on the SANS SIFT Workstation: the Linux path is
verified on Ubuntu 22.04 (SIFT's base OS), and the demo screencast is
recorded in that Linux terminal. SIFT quick start:
docs/TRY_IT_OUT.md. - Verification > volume. We chose one case, every claim adversarially checked and reproduced across OS, because the rubric explicitly rewards depth over breadth.
- Terminal- and artifact-native, not a dashboard. Output is structured (
report.json) and tamper-evident (audit.jsonl) so it feeds downstream tooling and stays court-defensible — the forensic idiom, not a demo UI. - Honest accuracy. No public answer key exists for this SANS sample, so instead of a self-graded score we use adversarial verification, cross-run reproduction, and disclosed misses.
Selected SANS sample:
SRL-2018 Compromised Enterprise Network / base-file-memory.7z
Final DeepSeek run on the extracted memory image:
4 confirmed of 10 findings; 2 self-correction iteration(s); evidence integrity verified.
audit log: 302 records, hash chain OK
Cross-platform correction: an independent Linux re-run refuted the Windows
DKOM/rootkit confirmation as a Volatility symbol/KDBG artifact. The corrected
confirmed set used for the accuracy table is therefore 3 true positives, 0 false
positives, and 1 false negative: precision 1.00, recall 0.75, F1 0.86.
The key corrected finding involved ngentask.exe: the Investigator initially
overstated the malware attribution, the Skeptic downgraded it twice, and the
system converged on a narrower confirmed behavioral claim tied to psscan and
netscan evidence.
Second independent public memory case:
DigitalCorpora M57 Pat / pat-2009-12-05.winddramimage
4 confirmed of 9 findings; 2 self-correction iteration(s); evidence integrity verified.
audit log: 265 records, hash chain OK
The M57 run shows the same correction behavior on a separate XP memory image:
the Skeptic downgraded over-strong "exfiltration" and "persistence" claims, and
the system converged on narrower confirmed facts such as ToolKeyloggerDLL.dll
loaded into explorer.exe, ToolKeylogger.exe as an explorer.exe child, and
matching pslist/psscan process sets.
AI-assisted attackers can move quickly, but autonomous responders can also hallucinate. Protocol SIFT++ targets both of FIND EVIL!'s top scoring areas:
- Autonomous execution with real-time self-correction.
- IR accuracy and hallucination catching.
The project is intentionally narrow: one primary SANS Windows memory case plus one public independent Windows memory case, a curated Volatility 3 toolset, strong evidence citations, and a visible correction loop.
flowchart LR
E["Evidence image"] --> G["EvidenceGuard: sha256, size, mtime"]
G --> M["Read-only MCP server"]
M --> V["Volatility 3 allowlist"]
V --> T["Tool output"]
O["Orchestrator"] --> I["Investigator"]
O --> S["Skeptic"]
I --> M
S --> M
I --> F["Draft findings"]
F --> S
S --> R["confirmed / inferred / refuted"]
R -->|weak or refuted| I
O --> A["Audit JSONL hash chain"]
O --> P["report.md + report.json"]
The agents never receive a generic shell. The MCP server exposes only curated read-only Volatility tools and checks evidence integrity around every tool call.
Install dependencies with uv, then run the deterministic local demo:
uv run siftpp-demoProve the forensic guardrails by attacking them (no key needed):
uv run siftpp-spoliation-test
uv run siftpp-tamper-testReplay the investigation shown in the demo video, straight from its tamper-evident audit log (no key needed):
uv run siftpp-trace docs/examples/srl-2018-live/audit.jsonl --replayDownload the selected SANS case:
uv run siftpp-download-caseRun the real investigation with DeepSeek:
uv run siftpp-investigate `
--provider deepseek `
--evidence evidence\srl-2018-base-file-memory\extracted\base-file-memory.img `
--out analysis\srl-2018-base-file-memory `
--case-id srl-2018-base-file-memory `
--offline `
--max-iterations 3Set DEEPSEEK_API_KEY in the environment or an ignored local .env file. Do
not commit API keys.
The real run writes:
analysis/srl-2018-base-file-memory/report.mdanalysis/srl-2018-base-file-memory/report.jsonanalysis/srl-2018-base-file-memory/audit.jsonlanalysis/srl-2018-base-file-memory/mcp-server.jsonl
Verify the audit chain:
uv run python -c `
"from protocol_siftpp.audit import verify_chain; print(verify_chain('analysis/srl-2018-base-file-memory/audit.jsonl'))"Expected:
(True, 302)
- Try-it-out instructions
- Architecture and security boundary
- Dataset documentation
- Accuracy and integrity report
- 5-minute demo script
- Agent execution log summary
- Devpost story draft
- Submission checklist
uv run pytest
uv run ruff check .MIT.
