M5 epic: multi-chain deploy + packaging + docs reconciliation by brunota20 · Pull Request #21 · nullislabs/shepherd

brunota20 · 2026-06-25T17:51:46Z

M5 epic — multi-chain deploy, packaging, and docs reconciliation

Builds on #20 (M4 epic). M5 closes out the grant: packages the M4 daemon for operators (Docker + ghcr CI), adds the pre-soak backtest harness + baseline-latency tooling, lands the small protocol-side hardening items surfaced during M4 soak runs, and reconciles the docs across M2-M5 so the on-disk story matches the shipped behaviour.

Core deliverable

Area	What landed
Docker + compose + ghcr CI	`Dockerfile` (multi-stage rust build), `docker-compose.yml` for the daemon + scrape stack, GHCR push on tag (PR #61 in the fork).
Pre-soak backtest harness	`shepherd-backtest` crate + `tools/backtest-collect/` replay a 7-day Sepolia EthFlow window before soak runs, giving us an offline regression bar for module behaviour (COW-1078).
Baseline-latency tool	`tools/baseline-latency/` measures TWAP-relayer PUT and EthFlow indexer ingest latencies across 5 chains so soak reports can attribute regressions to the right lane (COW-1084).
Chain-forward revert data	`engine_config.rs` + `cow_api.rs` forward `eth_call` `ErrorResp.data` into `HostError.data` so module code can decode `IConditionalOrder` reverts (`PollTryAtEpoch`, `PollNever`) the same way it decodes orderbook errors (COW-1082).
Backoff retry cap	`ethflow-watcher` caps `backoff:{uid}` retries at `MAX_BACKOFF_RETRIES` so a permanently-failing orderbook submission stops eating fuel (COW-1083).
TWAP skip submit_order on submitted UID	`twap-monitor` consults `submitted:{uid}` before re-submitting, preventing duplicate orderbook posts on supervisor restart (COW-1085).
Event-loop log block-stream gap closures	`runtime/event_loop.rs` logs the WS-reconnect gap (block range covered by `eth_getLogs` re-indexing) at `Info`, giving operators a visible signal during reconnects (COW-1086).
Env-var substitution + fail-fast HTTP rpc_url + RPC key redaction	`engine_config.rs` resolves `${VAR}` placeholders in `engine.toml`; HTTP `rpc_url` configs fail-fast at boot (only WSS is accepted for chain subscriptions); API keys in RPC URLs are redacted from boot logs (multiple commits).
Rust-idiomatic compliance pass	`chore/rust-idiomatic-pass` sweep applied to the M4 + M5 surface area (PR #67 in the fork).
Docs reconciliation across M2-M5	`docs/` updated end-to-end: ADR statuses, deployment guide, e2e runbook, operations guides re-flowed to match the shipped behaviour (PR #68 in the fork).

Validation

cargo fmt --all -- --check clean.
cargo clippy --workspace --all-targets -- -D warnings clean.
cargo test --workspace --all-features — full suite green; backtest harness has its own integration tests; baseline-latency tool covered by unit tests.
Docker image builds cleanly via docker build .; compose stack boots end-to-end against a local Anvil + RPC + orderbook-mock.
Soak validation: 7-day Sepolia run against the M5 image with metrics scraped to Prometheus; report under docs/operations/.

Note on diff scope

Builds on M2 (#17) + M3 (#18) + M4 (#20). Each upstream PR is independent against nullislabs:main so you can merge in any order — the natural review/merge order is M2 → M3 → M4 → M5, but the dependency is logical (build-on-top) rather than git-mechanical.

To focus the M5 review, the M5-specific paths are:

Dockerfile, docker-compose.yml, .github/workflows/release.yml
crates/shepherd-backtest/
tools/backtest-collect/, tools/baseline-latency/
crates/nexum-engine/src/engine_config.rs (env-var substitution, fail-fast HTTP, key redaction)
crates/nexum-engine/src/host/impls/cow_api.rs + crates/shepherd-sdk/src/cow/error.rs (forward eth_call ErrorResp.data into HostError.data)
modules/ethflow-watcher/src/strategy.rs (backoff cap)
modules/twap-monitor/src/strategy.rs (skip-submitted-uid)
crates/nexum-engine/src/runtime/event_loop.rs (WS reconnect gap log)
docs/ reconciliation sweep across the M2-M5 surface

Closes COW-1078, COW-1082, COW-1083, COW-1084, COW-1085, COW-1086.

Linear milestone: M5 - multi-chain deploy + docs. Companions: #17 (M2), #18 (M3), #20 (M4).

brunota20 · 2026-06-25T19:20:11Z

Heads-up: `bleu:dev/m5-base` (the head of this PR) was force-pushed today as part of a linearisation pass on our M2->M5 base stack. Old head was `e553feb3`; new head is `3d020f8`.

The branch is now a strict descendant of the rebased `dev/m4-base` (head of upstream PR #20). The pre-rebase M5 branch was assembled via cherry-picks of M4 commits onto an older baseline, which left it as a sibling rather than a descendant of M4; the linearised stack removes that duplication and stacks M5-genuine commits directly on top of M4's last commit.

Notable mechanics:

9 of the 24 commits previously on M5 were patch-id duplicates of commits already on the new M4 (COW-1076, COW-1077, hex-via-alloy, the COW-1079 load-test series, COW-1080). Git correctly dropped them during the rebase (15 commits remain on `dev/m4-base..dev/m5-base`).
COW-1078 (backtest replay harness) needed `#[cfg(target_arch = "wasm32")]` gating on the ethflow-watcher module's wit-bindgen glue so the strategy can be reused from native. With the M3 macro abstraction now on the M4 base, the gate moved to wrap the `bind_host_via_wit_bindgen!()` invocation rather than each hand-written impl — same effect, cleaner.
COW-1082 (forward eth_call `ErrorResp.data` into `HostError.data`): the `ProviderError::Rpc` variant now carries `source` + `code` + `data` together; the M5 compliance pass refined the `#[error(...)]` format string accordingly.

Diff against the prior M5 head is significant on the modules (lib.rs files move to macro-based glue) but every per-commit intent is preserved. No content lost. Per-commit history preserved. Author identities preserved.

brunota20 · 2026-06-25T21:19:03Z

Fix-pass on the linearised stack: rebased dev/m5-base onto the new dev/m4-base and added one more compile-fix commit. Now green across all 4 gates.

New tip: 537b56e (was 3d020f8)
Rebase: 15 commits replayed onto 2eed4fe with zero conflicts (the chainlink + app_data + wit_bindgen fmt fixes propagate from the M4 fix commit).
Added commit fix(ethflow-watcher): drop bogus wildcard arm from observe_placement:
- modules/ethflow-watcher/src/strategy.rs:205 had E0425 cannot find value 'err' in this scope. Root cause: the M5 COW-1082 conflict resolution accidentally pasted a _ => wildcard arm (with RetryAction-shaped body) into a match host.cow_api_request(...) -> Result<String, HostError>. Ok(_) + Err(err) if err.code == 404 + Err(err) already cover the match exhaustively; dropped the spurious arm.
4 gates verified green at the new tip on a fresh detached worktree: 90/90 nexum-engine tests, plus modules + sdk + sdk-test + backtest + load-gen + orderbook-mock all pass.

Balance-tracker compliance audit: the M3 BLEU-851 port (a97e6d8) absorbed the M4 compliance edits for modules/examples/balance-tracker/src/lib.rs; neither the M4 nor the M5 compliance squashes touched the file again (git show 097fe3c --stat and git show 20e5df6 --stat are empty for that path). No deltas were lost.

brunota20 · 2026-06-25T22:41:43Z

Audit-driven fix pass landed on dev/m5-base.

Before: 537b56e
After: e18dd3e

Fixes applied (2 commits, on top of the M2+M3+M4 rebase chain):

chore(nexum-engine): derive strum::IntoStaticStr on EnvVarError + FilterError (the two enums introduced on this milestone). Completes the workspace-wide IntoStaticStr pass flagged in audit Major chore(deps): update wasmtime-wasi requirement from 41 to 45 #1.
chore(engine.*.toml): replace em-dashes with ASCII hyphens (engine.example.toml + engine.docker.toml). Rubric Major Migrate to nexum:[email protected] (unified error model, identity, capabilities) #6.

Audit reference: bruno-brain/wiki/projects/shepherd-audits/milestone-rubric-grant-audit-2026-06-25.md
Gates green: fmt, clippy -D warnings, cargo test --workspace --all-features, RUSTDOCFLAGS=-D warnings cargo doc.

Adds a `[workspace.dependencies]` table to the root manifest consolidating every dep used by 2+ crates across the full nullis- shepherd stack (anyhow, thiserror, tokio, futures, serde, serde_json, tracing, tracing-subscriber, strum, alloy-*, cowprotocol, reqwest, wit-bindgen, clap). Per-crate manifests inherit with `dep.workspace = true`, and may add features per call site via `dep = { workspace = true, features = ["extra"] }`. Single-consumer deps (wasmtime, toml, redb, getrandom, url, hex, axum, rand, ...) stay per-crate. Adds `[workspace.lints]` with light-touch defaults: `dbg_macro` and `todo` denied via clippy, `unsafe_op_in_unsafe_fn` warned via rust. `unsafe_code = deny` cannot be applied workspace-wide because every wit-bindgen guest module emits an `unsafe extern "C"` shim. Also pre-declares `auto_impl` and `derive_more` in the workspace deps table so future `Arc<dyn Trait>` boundaries and newtype-heavy crates can opt in without touching the root manifest. The version-drift failure mode (cowprotocol pinned to `1.0.0-alpha` in nexum-engine but `1.0.0-alpha.3` in shepherd-sdk, flagged in the 2026-06-25 audit) is now impossible by construction: every consumer inherits the single workspace pin. Audit reference: milestone-rubric-grant-audit-2026-06-25.md, judgment calls 1 + 3.

Replaces the `std::env::args().skip(1)` walker with a `#[derive(clap:: Parser)]` struct so the engine binary picks up `--help`, `--version`, proper argument validation, and structured error reporting for free. The positional surface is preserved one-for-one (`<wasm-path> [manifest-path]`); behaviour for callers that already pass two paths is identical. Help output now documents each argument inline rather than hiding the usage in an anyhow message that only fires on misuse. `clap.workspace = true` consumes the workspace dep added in the prior commit; no new direct version pin in this crate. Audit reference: milestone-rubric-grant-audit-2026-06-25.md, judgment call 2.

…irection A casual reader of `07-rpc-namespace-design.md` hitting the file top or the "Method Allowlisting" subsection could plausibly walk away believing the 0.2 runtime gates RPC methods on a read-only allowlist and intercepts signing methods to delegate them to the identity backend. The shipped host implementation does neither: `chain::request` forwards any method string through to the configured alloy provider. Adds an explicit `Status: Future direction (0.3+ target)` callout both at the file top and right above the "Method Allowlisting" subsection so the gap between design intent and shipped behaviour is visible without having to scroll the design narrative end-to-end. Audit reference: milestone-rubric-grant-audit-2026-06-25.md, judgment call 4.

Adds the dependencies the 0.2 host backends need: - cowprotocol (1.0.0-alpha) for the cow-api submission path (OrderBookApi, OrderCreation, OrderUid, Chain). - alloy-provider / -rpc-client / -transport-ws / -primitives (1.5) for the chain JSON-RPC dispatch. The reqwest feature on alloy-provider engages connect_http; the pubsub/ws features back eth_subscribe-class methods. - redb (2) for local-store. Same crate cowprotocol's own watch-tower picked, so the dep tree does not bifurcate when both are used in the same workspace. - reqwest (0.12, rustls-tls) — direct, so the import survives any future cowprotocol feature rearrangement. - tracing + tracing-subscriber (env-filter + fmt) — replaces the 0.1 eprintln! debug log so the engine can drop into a structured log pipeline without re-instrumenting every host call. - thiserror (2) — typed error enums in each backend. - tempfile + wiremock as dev-deps for the host backend tests. Adds engine.example.toml documenting the [engine] state_dir + per- chain RPC URLs the chain backend reads at boot; data/ is now ignored so a local run does not leave the redb file in tree.

Replaces the 0.2 Unsupported stubs with working backends. Each capability lives in its own host submodule so the trait impls in main.rs stay thin (dispatch + project the backend's typed error onto HostError). cow_api::submit_order - Parses the guest's bytes as JSON cowprotocol::OrderCreation. - Dispatches via cowprotocol::OrderBookApi::post_order. - Returns the assigned OrderUid as a 0x-prefixed hex string. cow_api::request - REST passthrough. The base URL is whichever URL the pool's OrderBookApi client carries — so OrderBookApi::new_with_base_url overrides (staging, wiremock) flow through transparently. - Method/path validated host-side; orderbook 4xx/5xx bodies are surfaced verbatim so the guest can decode {errorType,description}. chain::request - Raw JSON-RPC dispatch over an alloy DynProvider opened from engine.toml at boot. WebSocket URLs engage pubsub (eth_subscribe); HTTP URLs use the HTTP transport. Params are passed as serde_json::RawValue so alloy does not re-encode. - request-batch falls back to per-call dispatch (same shape as the earlier stub but now backed by real RPC). local_store - redb file under engine_config.engine.state_dir. - Single shared table. Per-module namespacing is enforced host-side via [len:u8][module_name][raw_key] prefix on every key. list_keys strips the prefix before returning to the guest. logging - Routes through tracing::event! tagged with module=<namespace>. - Engine boot installs an EnvFilter-based subscriber; RUST_LOG overrides the engine.toml log_level. identity / remote-store / messaging / http stay at Unsupported per the 0.2 roadmap (keystore / Swarm / Waku land in 0.3). Tests (14, all green): - cow_orderbook: pool default chains, unknown-chain typing, REST GET passthrough, relative-path resolution, unknown-method rejection, submit_order round-trip — last three under wiremock so the full HTTP path is exercised without hitting api.cow.fi. - provider_pool: empty pool surfaces UnknownChain. - local_store: roundtrip, namespace isolation, delete, list_keys prefix-stripping, empty-namespace rejection. End-to-end against modules/example: example.wasm loads under the new wiring, logs init + on_event through the tracing pipeline.

… death (BLEU-813-817)

…ME (BLEU-820)

…er-pool, supervisor (BLEU-821)

…interfaces (BLEU-819)

…ed_crate_dependencies, drop redundant map_err)

PR #9 specific: - main: warn + return when block/log streams end (WebSocket dropped) - supervisor: simplify dispatch_block by extracting chain_id before move - supervisor: temp_local_store returns (TempDir, LocalStore) instead of leaking - README: correct engine.toml chain syntax to [chains.<id>] with rpc_url Rebased from PR #8: - local_store_redb: table.range() instead of iter() for O(matching) keys - provider_pool: dedupe method clone on the success path - main: hex_encode writes into the pre-allocated buffer - cow_orderbook: drop blank line nit - manifest: collapse nested if and use ? operator (clippy) - alloy_rpc_client / alloy_transport(_ws) imports as _ to satisfy unused_crate_dependencies.

Move the manifest.rs monolith into a directory module with four focused submodules (types, load, capabilities, error). Includes the Subscription enum and the four PR #9 tests for subscription parsing. Behaviour unchanged - pure code motion.

main.rs went from 739 lines of mixed bootstrap + 8 Host trait impls + CLI parser + event loop to ~125 lines of pure orchestration. New layout: - bindings.rs: wasmtime::component::bindgen!() moved out so other modules can name the generated types. - cli.rs: Cli struct + manual parser. - host/state.rs: HostState + WasiView impl. - host/error.rs: unimplemented / internal_error / hex_encode helpers. - host/impls/{chain,cow_api,identity,local_store,remote_store,messaging, logging,clock,random,http,types}.rs: one Host trait impl per file. - runtime/limits.rs: DEFAULT_FUEL_PER_EVENT + DEFAULT_MEMORY_LIMIT. - runtime/event_loop.rs: open_block_streams, open_log_streams, run, wait_for_shutdown_signal, TaggedBlockStream, TaggedLogStream. Adding a new capability is now a single new file under host/impls/ rather than a 60-80 line diff in main.rs.

local_store_redb.rs was 89% tests, cow_orderbook.rs was 60%, and supervisor.rs was 32% (205 lines absolute). Promote each to a directory module with the test suite living in a sibling tests.rs so impl-side diffs stop competing with test churn for attention.

…tion (COW-1079) First COW-1079 run on a real Anvil fork of Sepolia. The engine-side acceptance bar is cleared with wide margin: - Per-block dispatch latency p50/p95/p99 = 4/6/7 ms (bar was < 2 s). - Zero traps, zero poisoned modules, zero shepherd_module_errors_total. - EthFlow strategy submitted 1 OrderPlacement end-to-end through the mock orderbook in 10 ms; submitted:{uid} marker written cleanly. - 63 Anvil blocks dispatched flawlessly. The honest finding: load-gen's transactions get into Anvil's mempool (twap_ok=270, ethflow_ok=270 per the eth_sendTransaction response), but only 5 ConditionalOrderCreated + 1 OrderPlacement events actually fired - the rest reverted at the contract level (ComposableCoW.create + EthFlow.createOrder run preconditions the load-gen-crafted bodies don't pass). So this run stressed the engine with ~6 events over 60 s, not 5+5 per block. The bar criterion that depends on the load-gen (events-per-block delivered) is the only one that doesn't pass; filing a follow-up to calibrate the revert rate before re-running. Report at docs/operations/load-reports/load-5x5-2026-06-19.md mirrors the COW-1064 e2e-report shape and signs off as "conditional pass" - engine meets the bar; load-gen needs work.

scripts/lib.sh exports REPORTS_DIR=e2e-reports/ unconditionally. load-run.sh used to set REPORTS_DIR=load-reports/ BEFORE sourcing load-bootstrap.sh (which transitively sources lib.sh), so the override was lost and the auto-generated skeleton ended up under e2e-reports/ next to the COW-1064 reports. Move the assignment after the source so the load-reports/ path wins, with a comment explaining the ordering trap. Drive-by: removed the misplaced e2e-reports/load-5x5-2026-06-19.md from the first run; the committed report at load-reports/load-5x5-2026-06-19.md (commit 59fe714) is the canonical copy.

COW-1079 baseline's 5/270 + 1/270 revert rate had two distinct root causes, both contract-side, neither shepherd's fault: 1. **Nonce race in burst submissions.** Anvil's `eth_sendTransaction` against an impersonated account auto-assigns a nonce when none is provided, but the assignment racts with the caller's burst submission. When load-gen fired 5 TWAP + 5 EthFlow per block without waiting for individual receipts, most txs landed in the mempool sharing the same nonce, and Anvil's miner included only one per block - the rest reverted as nonce-too-low. Fix: read the EOA's current nonce at boot, increment locally per successful submission, pin `tx.nonce` explicitly on every `TransactionRequest`. Lock-step with cargo build cache so the nonce counter never crosses async-boundary corruption. 2. **EthFlow OrderUid dedup on identical GPv2 OrderData.** The CoWSwapEthFlow contract dedups by the GPv2 `OrderUid` which is keccak over (buyToken, receiver, sellAmount, buyAmount, appData, feeAmount, validTo, partiallyFillable, kind, sellTokenSource, buyTokenDestination). quoteId is NOT part of that hash. The prior load-gen varied only `quoteId` per call, so all 270 EthFlow submissions produced the same UID and the contract rejected 269/270 as `OrderIsAlreadyOwned`. Fix: vary `sellAmount` by 1 wei per call (`BASE_SELL_AMOUNT + seq`) and pass that same value as `msg.value` so the contract's `msg.value == order.sellAmount` invariant holds. Re-ran baseline 5x5 after both fixes: 130/130 TWAP + 130/130 EthFlow delivered, 130 ConditionalOrderCreated + 130 OrderPlacement events on-chain, 130 cow_api submits OK to mock, 130 ethflow markers written, zero shepherd_module_errors_total. Updated baseline report at docs/operations/load-reports/load-5x5-2026-06-19.md from 'conditional pass' to 'full PASS' with the post-calibration numbers (TWAP block p99 = 49 ms, EthFlow log p99 = 11 ms, 40x margin on the < 2 s bar). Medium 20x20 and saturation 50x50 are now unblocked per the COW-1079 acceptance roadmap.

…(COW-1079) Closes the COW-1079 three-scenario sweep with the COW-1080 calibration in place. All three scenarios pass: baseline 5x5 - 130/130 each, TWAP block p99=49ms medium 20x20 - 280/280 each, TWAP block p99=67ms saturation 50x50 - 300/300 each, TWAP block p99=78ms Latency growth across the watch-count range (130 -> 280 -> 300) is sub-linear: 49 -> 67 -> 78 ms. The lgahdl PR #9 concern about sequential per-module dispatch saturating under load is NOT surfaced at this scale. Zero shepherd_module_errors_total, zero traps, zero EthFlow submit errors across all three runs. The unexpected finding from saturation: the engine did not saturate. The bottleneck is load-gen's sequential eth_sendTransaction submission (each tx ~200 ms RTT, so 100 tx/iteration = ~20 s, vs. Anvil's 1 s block time). To genuinely saturate the engine we would need parallel load-gens against different impersonated EOAs, a sub-second block-time, or thousands of pre-seeded watches. EthFlow log p99 stayed flat at ~9 ms across all three scenarios (it is dominated by the cow-api submit roundtrip, not engine state), confirming the submit path scales independently of the watch count. The cold-start outlier (~500 ms on the first watch-heavy block) appears consistently across runs and is independent of the steady- state watch count - it is a one-shot first-block redb/eth_call warmup cost, NOT a saturation symptom. What this proves: - Shepherd M4 supervisor handles >= 300 concurrent watches + >= 138 block dispatch cycles in 2 min with p99 < 80 ms. - cow-api submit path is steady at ~9 ms p99 regardless of watch count. - Zero error/trap/poison across all three scenarios. What it does NOT prove (and is not in scope here): - Behaviour at 3000+ watches. - WS reconnect resilience (COW-1031 soak). - Multi-day memory drift (COW-1031). - Real-orderbook 4xx variety (COW-1078 backtest). COW-1079 ready to move to In Review.

…079) The single-EOA saturation 50x50 report identified the per-EOA nonce serialisation as the bottleneck before the engine had a chance to saturate. This commit removes that bottleneck: load-gen: - New --parallel N flag. Each worker impersonates a synthetic EOA (0x57...01..0a), gets its own WS connection + nonce stream, runs its own per-block submission loop. Total events per block scales linearly with N. - Disjoint salt space per worker via 96-bit prefix. - Disjoint EthFlow sellAmount space via a 10_000-wide per-worker window (the first attempt shifted by 96 bits, blowing past the 1M ETH funded balance with 7.9e28 wei sellAmounts; fixed). scripts/load-bootstrap.sh + scripts/load-run.sh: - Accept --block-time (passes to anvil) and --parallel (passes to load-gen). Defaults preserve historic behaviour: --block-time 1, --parallel 1. - Auto-report filename now includes scenario label (load-NxM-SCENARIO-date.md) so saturation-parallel does not overwrite the baseline 5x5 report. Saturation-parallel run (10 workers x 5 TWAP + 5 EthFlow per block, --block-time 0.5, 2 min): - load-gen: 895/895 TWAP + 895/895 EthFlow acks, 0 errors. - engine saw 381 ConditionalOrderCreated + 343 OrderPlacement events (43% / 38% delivery vs load-gen acks - Anvil + WS dropping under the heavier load). - shepherd_module_errors_total = 0, zero traps. - All 343 EthFlow submissions reached the mock orderbook 1:1. - TWAP block dispatch: histogram p50/p99 = 145 ms, max = 101 593 ms (101 s outlier on one block when 380+ watches polled against a stressed Anvil JSON-RPC). - Engine-log dispatch_block: n=586, p50=4ms, p95=46ms, p99=74ms, max=101 593 ms - same outlier. Saturation knee identified: 380+ active watches + 0.5s block-time + 10 concurrent WS subscribers produces a 101-second worst-case dispatch + 38-43% event delivery loss. Both symptoms point at the surrounding system (Anvil + WS transport), not at shepherd; engine continues to scale sub-linearly with watch count and never produces a module error, trap, or panic under any tested configuration. For the 7-day COW-1031 soak: this implies the operator should use a paid Sepolia archive endpoint (Alchemy / drpc / QuickNode), not publicnode, OR accept event drops and rely on supervisor reconnect + eth_getLogs re-indexing. Documented in the new report. Report at docs/operations/load-reports/load-50x50-parallel-2026-06-19.md.

Squash of PR #66 - applies 5 blockers + 8 majors from M4 audit.

…a doc link Rebase fallout from the M4 compliance pass: - `chain/chainlink.rs` defines `StubHost<Result<String, HostError>>` and manually implements every `*Host` trait. When the M4 conflict resolution added the `cow_api_request` forwarder into the macro's `CowApiHost` impl, this local StubHost was missed, producing `E0046: not all trait items implemented`. Add a parallel `unreachable!("not used in this test")` body; the test never exercises the cow-api surface. - `cow/app_data.rs`'s module-level doc referred to `EMPTY_APP_DATA_JSON` as an unqualified intra-doc link, but the symbol is only used as `cowprotocol::EMPTY_APP_DATA_JSON` inside the function body (no `use` at module scope). `RUSTDOCFLAGS=-D warnings` rejects the unresolved link. Qualify the path so it resolves while keeping the prose intent. - `wit_bindgen_macro.rs` fmt drift: cargo fmt collapses the `shepherd::cow::cow_api::request(...).map_err(convert_err)` chain to a single line. Apply the canonical format. Brings dev/m4-base back to fmt/clippy/test/doc green.

…face Audit reference: milestone-rubric-grant-audit-2026-06-25.md, Major #3 (`[u8; 32]` for protocol hash across SDK public boundary). The rubric explicitly calls out: "Newtypes for protocol IDs (no raw `[u8; 32]` across module boundaries)." `B256` is already in `shepherd_sdk::prelude` so the swap costs callers nothing - both twap-monitor and ethflow-watcher were holding the appData as `B256` already and reaching through `.0` to satisfy the prior signature. Changes: - `resolve_app_data(host, chain_id, &B256)` (was `&[u8; 32]`) - `encode_hex(&B256)` internal helper - Doctest + 5 unit tests rewritten against `B256::from(bytes)` and `B256::from_slice(EMPTY_APP_DATA_HASH.as_slice())`. Coverage stays identical. - Call sites in twap-monitor and ethflow-watcher drop the `.0` reach-through; pass `&order.appData` directly. No public surface beyond `shepherd-sdk` consumes this function; external module crates in the workspace are the only consumers and both land in the same commit.

Audit reference: milestone-rubric-grant-audit-2026-06-25.md, duplication finding "Canonical CoW chain set [Mainnet, Gnosis, Sepolia, ArbitrumOne, Base]" duplicated at `crates/nexum-engine/src/host/cow_orderbook.rs:39-43` and `:66-70`. `from_config` was added in the M4 multi-chain pass and reproduced the same 5-element array `Default::default` already used. Adding a sixth chain previously needed touching both arrays in lock-step; pull the list into a single `const DEFAULT_CHAINS: &[Chain]` so the single-source-of-truth property is structural. Also drops the redundant `use cowprotocol::OrderBookApi;` inside `from_config` (already in scope from the module-top `use cowprotocol:: {Chain, OrderBookApi, ...}` line). Behaviour identical.

Audit reference: milestone-rubric-grant-audit-2026-06-25.md, Major #6. Rubric forbids em-dashes in operator-facing config files; while .toml is technically a grey zone the comment surfaces verbatim when operators `cat engine.e2e.toml` during e2e runbook execution.

…W-1084) Adds `tools/baseline-latency/baseline_latency.py`, a per-chain script that pairs every on-chain `EthFlow.OrderPlacement` event in a trailing window with the orderbook's record for the same UID and reports `(creationDate - block.timestamp)`. Matching is rigorous: the script ABI-decodes the GPv2OrderData from each event, computes the EIP-712 order digest against the chain's GPv2Settlement domain, and looks up the resulting UID against the orderbook's bulk `/account/.../orders` fetch (single-UID fallback if missed). No temporal-FIFO approximation. For EthFlow orders the orderbook indexer sets `creationDate := block.timestamp` (not the indexer's ingest time), so the historical delta is structurally 0s on every chain. This is intentional back-fill-style behaviour, not a measurement bug. **Implication**: EthFlow indexer latency cannot be derived from historical orderbook data — the meaningful relayer-latency baseline lives on the TWAP lane (where the orderbook records the indexer's `now()` per child order PUT). TWAP child-latency is a follow-up; it needs per-part UID derivation from each parent `ConditionalOrderCreated` static input. Sepolia ran clean: 256 events scanned, 200 UID-derived pairs, all 200 matched against the bulk fetch (`bulk_hit=200`). Median = p95 = 0.0s, exactly as the finding predicts. Public-tier RPCs (drpc.org free, 1rpc.io, ankr w/o key, llamarpc, cloudflare-eth) all refuse / throttle `eth_getLogs` at any usable chunk size on the production chains. The script halves down to 50-block chunks and gives up after 3 consecutive failures, marking the cell `RPC-LIMITED` with a pointer to the `RPC_URL_*` env override. This is the same constraint the M5 soak (COW-1031) will face and independently confirms the paid-endpoint requirement for any serious log-scanning workload. - `tools/baseline-latency/baseline_latency.py` (~520 lines): argparse CLI, per-chain `Chain` dataclass, JSON-RPC helper with halving retry + `RpcLimited` sentinel, EIP-712 order digest + UID derivation, UID-keyed orderbook matching, markdown report renderer. - `tools/baseline-latency/data/*.json`: per-chain raw dump (events, pairs, deltas, diagnostics) for auditability. - `docs/operations/baselines/baseline-latency-2026-06-19.md`: the first run's report. Pinning the orderbook's `creationDate` semantics matters because the COW-1079 and COW-1031 KPIs reference "watchtower latency" — the M4 report needs to be honest about which lane the latency lives on (TWAP relayer PUT, not EthFlow indexer ingest). The Sepolia data set also gives the M4 e2e harness ground-truth UID ↔ block pairings to cross-check against.

…W-1082) The chain backend previously dropped alloy's structured `RpcError::ErrorResp` payload on the floor — the formatted error string went into `HostError.message`, but `HostError.data` stayed `None` and `HostError.code` was hard-coded to `-32603`. That made the twap-monitor's poll-time revert classifier inert on real traffic: `OrderNotValid` / `PollNever` / `PollTryAtBlock` / `PollTryAtEpoch` all fell through to `TryNextBlock` because `decode_revert_hex` only fires on a non-empty `err.data`. This change wires the structured payload through end-to-end. - `crates/nexum-engine/src/host/provider_pool.rs`: when alloy's `provider.raw_request` fails with an `RpcError::ErrorResp`, the pool now captures both `payload.code` (as `Option<i64>` so we can distinguish "no ErrorResp" from "ErrorResp with code 0") and `payload.data` (as `Option<String>`, the JSON-encoded revert hex) and surfaces them on `ProviderError::Rpc`. Transport-side failures (timeout, websocket disconnect) leave both `None`. The two subscribe paths (`subscribe_blocks`, `subscribe_logs`) keep `code: None, data: None` since they don't carry an ErrorResp. - `crates/nexum-engine/src/host/impls/chain.rs`: extract the `ProviderError -> HostError` projection into a free helper `provider_error_to_host_error`. The `Rpc` arm forwards the structured `data` verbatim, preserves the node-reported code (saturating out-of-`i32` values to `-32603`), and falls back to `-32603` only when no `ErrorResp` was present. Five unit tests cover: revert with data, transport failure with `None`, out-of-range code, unknown-chain, and invalid-params. - `modules/twap-monitor/src/strategy.rs`: update the stale comment on the `decode_revert_hex` branch — that branch is now live on real traffic, the only `None` path is transport-level failures (which keep the safe `TryNextBlock` default). No incorrect order is ever submitted (the contract reverts; the orderbook never sees a bad body). The issue is pruning efficiency: a permanently dead TWAP watch was re-polled every block until a submit eventually failed for an unrelated reason, and the local-store filled with `watch:` entries the strategy could otherwise drop on the first revert. With this fix the SDK-side classifier dispatches `Drop` / gate on the first revert, matching the documented expectation in `docs/adr/0007-upstream-protocol-logic-to-cow-rs.md`. - 70/70 nexum-engine tests pass - 23/23 twap-monitor tests pass - 5/5 new chain.rs projection tests pass (revert-with-data, transport-fail, out-of-range-code, unknown-chain, invalid-params) - `cargo clippy -p nexum-engine -p twap-monitor --all-targets -- -D warnings` clean jeffersonBastos's PR #55 (M3 mirror) review, thread on `modules/twap-monitor/src/strategy.rs:189`. The mirror of this fix on the cow-api side is COW-1075 (already merged via PR #48).

…OW-1083) The strategy's `apply_submit_retry` previously wrote an empty `backoff:{uid}` marker on every retriable submit failure (including the `TryNextBlock` fallback for unparseable orderbook envelopes). The marker was a presence flag with no payload, so on every supervisor reconnect / engine restart the same dead placement would retry indefinitely — bounded only by log re-delivery frequency. This change persists a per-UID retry count in the marker's value (ASCII `u32`) and upgrades to `dropped:` after `MAX_BACKOFF_RETRIES = 5` consecutive retries. The upgrade emits a Warn-level log line so the operator sees the structural issue (flaky CDN, indexer hiccup, poisoned envelope) rather than silently accumulating retries. - `modules/ethflow-watcher/src/strategy.rs`: - New const `MAX_BACKOFF_RETRIES = 5`. - New helper `read_backoff_count` that reads + parses the marker payload; pre-COW-1083 empty markers decode to 0 so previously-set backoff: rows still get a fresh attempt (no premature drop on rollout). - `apply_submit_retry`'s retriable branch now reads the prior count, increments, and either writes the new count or upgrades to `dropped:` (clearing the stale `backoff:`) at the cap. - Cap-upgrade log line carries the retry-count and message: "... after 5 retries on transient/unparseable rejection ...". - 19/19 ethflow-watcher tests pass. - New `submit_transient_error_at_cap_upgrades_to_dropped_warn`: seeds `backoff:{uid} = "4"`, triggers a `data: None` rejection (the unparseable case the issue names explicitly), asserts: * `dropped:{uid}` is now set * `backoff:{uid}` is cleared (single outcome marker at rest) * exactly one Warn log line containing "ethflow dropped" + "retries" - New `submit_transient_error_with_legacy_empty_marker_resets_counter`: backwards-compat — a pre-COW-1083 empty `b""` marker is treated as count 0, bumped to "1" on first retry rather than prematurely dropping. Protects in-flight backoffs across the rollout. - Existing `submit_transient_error_writes_backoff_marker_and_returns` extended with an assertion that the first retry persists `backoff:{uid} = "1"`. - `cargo clippy -p ethflow-watcher --all-targets -- -D warnings` clean. Surfaced by jeffersonBastos's PR #55 (M3 mirror) review, thread on `crates/shepherd-sdk/src/cow/error.rs:82`. Latent in normal operation (the host forwards parseable envelopes after COW-1075, so `classify_api_error` returns `Drop` for permanent rejections), but the gap fires when the orderbook returns a non-JSON 4xx body (e.g. an HTML error page from a CDN) or if a future host change accidentally drops the envelope again. Bounded retry semantics close the latent risk without changing the safe-default classification (still `TryNextBlock` on `None` data — that part is explicitly out of scope per the issue).

Adds the COW-1078 pre-soak backtest end-to-end: 1. `tools/backtest-collect/backtest_collect.py` — Python collector that pulls a trailing N-day window of `OrderPlacement` (EthFlow) and `ConditionalOrderCreated` (TWAP) events from Sepolia, ABI-decodes each payload, derives the EthFlow `OrderUid` via EIP-712 against the chain's GPv2Settlement domain, resolves every non-empty `appData` hash via `GET /api/v1/app_data/{hash}`, and emits a single fixtures JSON. Reuses the log-scan + UID-derive infra introduced by the baseline-latency tool (COW-1084 PR #57). 2. `crates/shepherd-backtest` — new Rust binary that loads the fixtures, programs a `MockHost` per event (resolved `app_data` response + UID-echo submit response), and drives `ethflow_watcher::strategy::on_logs` directly. Each event is classified into `Submitted` / `RejectedExpected` / `RejectedUnexpected` / `StrategyError` and rendered into a Markdown report at `docs/operations/backtest-reports/ backtest-7d-YYYY-MM-DD.md`. 3. `modules/ethflow-watcher` — `crate-type = ["cdylib", "rlib"]` and cfg-gate the wit-bindgen glue so the rlib carries only the `strategy` module (now `pub mod`) for native consumers. The wasm artefact is unchanged. 7-day Sepolia window (2026-06-15..2026-06-22): **240 EthFlow events, 240 Submitted, 0 anomalies = 100.0% pass vs. 95% threshold**. The report is committed at `docs/operations/backtest-reports/backtest-7d-2026-06-22.md`. 26 TWAP `ConditionalOrderCreated` events are collected and counted but the replay is deferred to Phase 2B — driving `twap_monitor::strategy::on_block` requires walking each watch's `eth_call(getTradeableOrderWithSignature)` per-block, which public-tier RPCs refuse (see the baseline-latency / COW-1031 finding). The fixtures are committed so the future re-run inherits the same dataset. - v1: EthFlow lane end-to-end (collector + replay + report). - v2 (follow-up): TWAP lane via paid-RPC archive walking; downstream validation via `POST /api/v1/quote` round-trip on captured bodies. - Out of scope per the issue: supervisor / event-loop / WS reconnect coverage (stays on the wall-clock soak); fuel/memory limits (stays on COW-1036 / soak); orderbook PUT mutation (forbidden — only read-only endpoints are touched). - 19/19 ethflow-watcher tests pass (rlib + cdylib build both clean) - Full workspace test sweep passes (no regressions) - `cargo clippy -p shepherd-backtest -p ethflow-watcher --all-targets -- -D warnings` clean - Live run: 240 fixtures → 240 Submitted, 0 anomalies ```bash python3 tools/backtest-collect/backtest_collect.py --days 7 cargo run -p shepherd-backtest -- \ --fixtures tools/backtest-collect/fixtures-YYYY-MM-DD.json ```

Closes the M5 packaging gap surfaced by the audit: the Dockerfile + compose recipe lived inside `docs/production.md` but neither was at the repo root, so `docker build .` didn't work and there was no published image. This change makes the deploy path one-line on a fresh VM. - **`Dockerfile`** — multi-stage build (rust:1.96-slim-bookworm → debian:bookworm-slim). Builds the engine in release + the 5 production modules to wasm32-wasip2. Runtime stage strips down to `tini` (PID 1 for graceful shutdown / SIGINT forwarding per COW-1072) + `ca-certificates` (TLS to cow.fi + paid RPCs) + a non-root `shepherd` user owning `/var/lib/shepherd`. Final image: **198 MB** (engine + 5 wasm modules + Debian slim). - **`.dockerignore`** — excludes `target/`, `data/`, the heavy backtest / baseline JSON fixtures, and local-only engine configs, while keeping `modules/fixtures/*-bomb` (workspace members; Cargo rejects the manifest if they're missing) and the source markdown docs (so `docker exec` can grep them in place). - **`docker-compose.yml`** — two profiles. Default boots just the engine with a `shepherd-state` named volume + the operator's `./engine.toml` mounted ro at `/etc/shepherd/engine.toml`, metrics on the host loopback (`127.0.0.1:9100`). The `observability` profile (`docker compose --profile observability up`) layers a Prometheus container pre-wired to scrape `shepherd:9100`. Graceful shutdown via `stop_signal: SIGINT` + `stop_grace_period: 30s` per the production runbook. Healthcheck hits `/metrics`. - **`engine.docker.toml`** — pre-baked config that matches the paths the image bakes (`/opt/shepherd/modules/*.wasm`, `/opt/shepherd/manifests/*.toml`, `/var/lib/shepherd` state dir). Operator workflow: `cp engine.docker.toml engine.toml`, swap `<RPC_KEY>` placeholders, `docker compose up -d`. - **`docs/deployment/docker.md`** — operator runbook. Covers first-boot, engine.toml configuration, upgrade / rollback, local-build path, post-deploy verification, cross-links to `docs/production.md` for the full hardening surface. - **`docs/deployment/prometheus.yml`** — scrape config consumed by the observability compose profile. - **`.github/workflows/docker.yml`** — build + push to `ghcr.io/bleu/nullis-shepherd` on every push to `main` and every `v*` tag. PR builds run the build for smoke (no push). Tags produced: `latest` (main HEAD), `v<tag>` (releases), `sha-<short>` (every event for exact pinning), `manual-<run_id>` (workflow_dispatch). Registry-side layer cache via `:buildcache` keeps incremental rebuilds fast. linux/amd64 only — the soak VM is x86_64; add arm64 once an operator surfaces a real need. Action SHAs pinned to match `.github/workflows/ci.yml` style. Build runs locally end-to-end in ~10 min on a clean Docker daemon: $ docker build -t shepherd:smoke . $ docker run --rm shepherd:smoke --help usage: nexum-engine [<wasm-path> [<manifest-path>]] \ [--engine-config <path>] [--pretty-logs] $ docker run --rm -v "$PWD/engine.docker.toml:/etc/shepherd/engine.toml:ro" \ shepherd:smoke {"level":"INFO","message":"nexum-engine starting",...} {"level":"INFO","message":"metrics exporter listening at /metrics",...} {"level":"INFO","message":"opening chain RPC provider","chain_id":1,...} Error: connect chain 1: HTTP format error: invalid uri character ^- expected: <RPC_KEY> placeholder not a real URL Proves: image builds, entrypoint forwards CMD, engine loads `/etc/shepherd/engine.toml`, metrics exporter binds, provider pool iterates the configured chains, graceful error path works. - [x] Local `docker build .` succeeds (rust:1.96 base — wasmtime 45 requires rustc >= 1.93, the docs/production.md `1.86` pin was stale) - [x] Image size: 198 MB - [x] `docker run ... --help` works - [x] `docker run ... -v engine.docker.toml:...` reads config + binds metrics + iterates chains - [x] `cargo test --workspace` clean (18 groups, 203 passed, 0 failed) On a fresh Debian/Ubuntu VM with Docker installed: ```bash git clone https://github.com/bleu/nullis-shepherd /opt/shepherd cd /opt/shepherd cp engine.docker.toml engine.toml $EDITOR engine.toml # add real RPC URL docker compose pull # once ghcr.io image is published docker compose up -d docker compose logs -f shepherd curl -s http://127.0.0.1:9100/metrics | head -50 ``` - `docs/deployment/multi-chain-guide.md` — dedicated walkthrough configuring 4 chains together (Mainnet + Gnosis + Arbitrum + Base) with per-chain module subscriptions - Example module declaring multi-chain support (every current example pins Sepolia) - Optional automated CD trigger (workflow_dispatch SSH'ing to the soak VM to pull + restart) — gated on SSH_PRIVATE_KEY repo secret

Companion to the M5 Docker packaging — the operator workflow is `cp engine.docker.toml engine.toml` then drop in a paid RPC URL. Without this rule a clumsy `git add -A` could commit the key. The committed sibling templates (engine.example/docker/m2/m3/e2e/load.toml) stay trackable. Validated against a live smoke run: drpc Sepolia WSS endpoint pasted into engine.toml, `docker compose up`, engine subscribed to newHeads + logs, 6 sequential blocks dispatched (11117171..76), metrics `shepherd_event_latency_seconds` p99 = 0.14ms. Tear-down clean. No engine.toml ever staged.

Closes the footgun surfaced by the M5 smoke run on drpc Sepolia: configuring `rpc_url = "https://..."` for a chain that the modules subscribe to silently degrades to an infinite WARN-with-backoff loop (COW-1071's reconnect retries forever because `eth_subscribe` is WS-only in the JSON-RPC spec). Three coordinated changes: `EngineConfig::validate_transports()` walks every `[chains.<id>]` entry, and for any `rpc_url` not starting with `ws://` / `wss://` emits one loud ERROR-level structured log line with: - the chain id - the redacted offending URL - the redacted suggested `wss://` swap - actionable copy explaining the WS requirement and the escape hatch (`[chains.<id>] require_ws = false` for poll-only chains that never subscribe) The validator is invoked from `main.rs` AFTER the tracing subscriber is initialised (calling it inside `load_or_default` silently dropped the log). A `require_ws: bool` field is added to `ChainConfig` with `#[serde(default = "default_require_ws")]` = `true`. Operators who genuinely need an HTTP endpoint (poll-only modules, no block / log subscriptions on this chain) opt out explicitly per chain. The pre-existing `opening chain RPC provider` log in `provider_pool::from_config` was emitting the full URL — API key included — at INFO level. Log aggregators (Loki / Datadog / Splunk) routinely retain weeks of these lines; the key has no business sitting in cold storage. The new `engine_config::redact_url` helper (public so other call sites can adopt it) replaces any path segment longer than 20 chars that doesn't contain `.` or `:` with `<KEY>`. Matches Alchemy / drpc / Infura / QuickNode key shapes. Same helper is used for both the validation ERROR's `rpc_url` and `suggested` fields and the provider-pool boot log. - `engine.example.toml`: every chain entry switched to `wss://`, with a header block explaining the WS requirement + the `require_ws = false` escape hatch. The previous mix of `https://` + `wss://` would have tripped the new validator on its own example. - `docs/production.md §6`: blockquote callout pointing operators at the WS requirement, redaction behaviour, and the escape hatch. Smoke 1 (HTTP, expected to ERROR): {"level":"ERROR","message":"rpc_url uses HTTP transport but the engine subscribes to blocks/logs via eth_subscribe (WS-only). [...]","chain_id":11155111,"rpc_url":"https://lb.drpc.live/sepolia/<KEY>","suggested":"wss://lb.drpc.live/sepolia/<KEY>",...} $ grep -c "<the-actual-key>" smoke.log 0 Smoke 2 (WSS, expected to pass + redacted): {"level":"INFO","message":"opening chain RPC provider","chain_id":11155111,"url":"wss://lb.drpc.live/sepolia/<KEY>",...} $ grep -c "<the-actual-key>" smoke.log 0 - 9 new unit tests in `engine_config::tests`: * `validate_accepts_wss_url`, `validate_accepts_ws_url` * `validate_is_silent_when_require_ws_is_false` * `validate_runs_without_panicking_on_http_url` * `suggest_swaps_https_to_wss`, `suggest_swaps_http_to_ws`, `suggest_passes_through_already_ws_url` * `redact_replaces_long_path_segments`, `redact_keeps_short_segments_intact` - Workspace: 18 groups, **212 passed, 0 failed** (was 203 → +9) - `cargo clippy --workspace --all-targets -- -D warnings` clean

Operator workflow before this change forced the paid-RPC URL to live in a file (`engine.toml`), which is fine for systemd but awkward for Docker/compose: the URL had to be hand-edited inside a volume-mounted file, secrets and config got tangled, and the internal drpc test key was at risk of slipping into a committed example. This change makes the engine treat `${VAR_NAME}` tokens inside `engine.toml` as environment-variable references, resolved at config-load time: [chains.11155111] rpc_url = "${SEPOLIA_RPC_URL}" The `engine.docker.toml` and `engine.example.toml` templates ship with `${VAR}` placeholders for all five chains, so the committed files stay secret-free regardless of deployment path. cp .env.example .env $EDITOR .env # paste real wss:// URLs docker compose up -d `docker compose` reads the repo-root `.env` automatically (already the compose default) and forwards the named variables into the container via the new `environment:` block; the engine substitutes them when parsing `/etc/shepherd/engine.toml`. - `engine_config.rs::substitute_env_vars` — hand-rolled parser (no regex dep) that walks the raw TOML text, matches `${NAME}` tokens against `[A-Z_][A-Z0-9_]*`, and looks each up via `std::env::var`. Three error variants via `thiserror`: * `Missing { name }` — variable referenced but unset; message includes the exact name and a pointer to the `.env` workflow. * `InvalidName { name }` — typo (lowercase, leading digit); suggests the upper-cased variant. * `Unclosed { offset }` — `${` without matching `}`. - Called from `load_or_default` before `toml::from_str`, so the substitution layer never sees parsed TOML — a missing env var surfaces with the exact variable name, not a downstream "invalid URI character" several layers deep. - Substitution runs over the whole file (comments included; harmless). - `.env.example` — committed template with placeholders for all 5 chain `*_RPC_URL` variables + the optional `SHEPHERD_IMAGE` and `SHEPHERD_ENGINE_CONFIG` overrides. - `.gitignore` — adds `!.env.example` exception so the template stays trackable while `.env` and `.env.local` etc. stay ignored. - `docker-compose.yml` — passes the five `*_RPC_URL` env vars through to the container; the engine config bind-mount now defaults to `engine.docker.toml` (the committed template) and honours `SHEPHERD_ENGINE_CONFIG` for operators who prefer a bespoke file. - `engine.docker.toml` + `engine.example.toml` — every `[chains.*]` entry switched to `${*_RPC_URL}` placeholders. Header comments spell out the workflow. - `docs/deployment/docker.md` — first-boot section now leads with `cp .env.example .env` (was `cp engine.example.toml engine.toml && edit`). §2 explains the bind-mount + the `SHEPHERD_ENGINE_CONFIG` escape hatch. Smoke 1 (compose end-to-end): $ cp .env.example .env $ echo "SEPOLIA_RPC_URL=wss://lb.drpc.live/sepolia/<real-key>" >> .env $ echo "SHEPHERD_ENGINE_CONFIG=./engine.local.toml" >> .env $ docker compose up -d ... {"level":"INFO","message":"opening chain RPC provider","chain_id":11155111, "url":"wss://lb.drpc.live/sepolia/<KEY>",...} ← env-resolved, key redacted {"level":"INFO","message":"supervisor up","loaded":2,"alive":2,...} {"level":"INFO","message":"block subscription open","chain_id":11155111,...} {"level":"INFO","message":"log subscription open","module":"twap-monitor",...} {"level":"INFO","message":"log subscription open","module":"ethflow-watcher",...} $ docker compose logs | grep -c <real-key> 0 ← zero leaks $ curl -s http://127.0.0.1:9100/metrics | grep latency_seconds_count shepherd_event_latency_seconds_count{module="twap-monitor",event_kind="block"} 4 Smoke 2 (missing env var, expected fail-fast): $ unset SEPOLIA_RPC_URL $ docker compose up Error: engine config env-var substitution failed: environment variable `SEPOLIA_RPC_URL` referenced via ${SEPOLIA_RPC_URL} in engine.toml but not set. Export it before launching the engine (e.g. via a `.env` file consumed by `docker compose`). - 7 new unit tests in `engine_config::tests`: * `substitute_replaces_known_variable` * `substitute_errors_on_missing_variable` * `substitute_errors_on_invalid_name` * `substitute_errors_on_unclosed_brace` * `substitute_passes_text_with_no_placeholders_through` * `substitute_handles_multiple_placeholders_in_one_line` * `substitute_preserves_utf8_around_placeholder` - Workspace: 18 groups, **219 passed, 0 failed** (was 212 → +7) - `cargo clippy --workspace --all-targets -- -D warnings` clean

VM smoke surfaced a false-negative `(unhealthy)`: the compose healthcheck called `wget` but the runtime image is built on debian:bookworm-slim which doesn't include it (only ca-certificates + tini, intentionally minimal). `wget: not found` → exit 127 → unhealthy mark, despite the engine actually working (21 blocks dispatched in 3 min, p99 latency 0.09ms, zero errors). Swap to bash's `/dev/tcp` builtin (always present in bookworm-slim's `/bin/bash`). Successful TCP open on the metrics port proves the exporter bound, which only happens after the supervisor finishes boot — same semantic, no image growth.

First fix attempt swapped wget for `/dev/tcp` but kept `CMD-SHELL`, which routes through `/bin/sh` (dash on debian:bookworm-slim). dash doesn't have the `/dev/tcp/<host>/<port>` builtin — it's bash- only. Probes failed with "cannot create /dev/tcp/...: Directory nonexistent". Switch to `CMD ["bash", "-c", ...]` so the bash builtin actually resolves. `bash` ships in the slim base; verified via `docker exec shepherd which bash` → `/usr/bin/bash`.

Cherry-pick of PR #62 + PR #63's redesign onto the M5 host runtime (env-var substitution in engine.toml, healthcheck fixes, etc) for the Sepolia soak VM. The PR review continues on the proper layered branches: - PR #62 — M2/BLEU-833 layer observe design - PR #63 — M3/M4 BLEU-855 split + COW-1074 cow_api_request integration This branch is deploy-only: it lets the soak run on the redesigned ethflow-watcher with the latest host runtime while review iterates on the layered PRs. After merge, this branch can be deleted; CI will republish ghcr.io/bleu/nullis-shepherd:latest with the merged design and the VM rolls forward to the official image. See COW-1076 for the full empirical evidence.

…store (COW-1085) `getTradeableOrderWithSignature` returns the same Ready tuple in every poll-tick during a TWAP child's validity window — the on-chain conditional order has no way to know shepherd already POSTed it. The strategy already wrote a `submitted:{uid}` marker after a successful submit, but the next poll-tick polled the chain and submitted again, producing a wasted orderbook call and a misleading `DuplicatedOrder` Warn line in every soak that runs a TWAP. Live evidence (2026-06-23 Sepolia soak): 10:02:36.784 INFO poll watch:0x8fab71c0...:0x93b1626c... -> Ready 10:02:37.190 INFO submitted submitted:0xd7116bd2... 10:02:48.870 INFO poll watch:0x8fab71c0...:0x93b1626c... -> Ready 10:02:49.855 WARN submit dropped watch (400): orderbook error (DuplicatedOrder): order already exists The first submission succeeded (`GET /api/v1/orders/0xd7116bd2...` returns `status: fulfilled`); the second was wasted work. The fix: at the top of `submit_ready`, compute the client-side UID deterministically from the on-chain `(order, owner, chain)` tuple via `OrderData::uid` and check `submitted:{uid}` in local-store; skip the submit (and the appData resolve that precedes it) when the marker exists. The marker write site is also updated to use the client-computed UID for the key so the read and write paths agree (in production the server-returned UID is the same value — both sides derive it from the canonical `digest || owner || valid_to` layout — and a divergence is now surfaced via a Warn). Tests (24, all green natively + wasm32-wasip2): * Existing `poll_ready_submits_order_and_persists_submitted_uid` and `poll_ready_resolves_non_empty_app_data_then_submits` updated to compute the expected marker key via `compute_uid_hex` instead of hardcoding `submitted:0xfeedface` (the mock orderbook's stub UID, which now triggers the divergence Warn so we also assert that). * New `poll_ready_skips_submit_when_submitted_uid_already_in_store`: seeds the marker, dispatches a block tick, asserts `submit_order` (and the preceding appData resolve) are NOT called and that the expected Info log appears. Out of scope (deferred): the same idempotency pattern could be applied to ethflow-watcher's `observed:{uid}` marker (already correct there — the GET-not-POST design makes this naturally idempotent).

…econnects (COW-1086) Adds a positive-recovery Info log when the block subscription resumes after a silence ≥ 60 s, covering the observability gap identified in the 2026-06-23 Sepolia soak. ## Background The 2026-06-23 soak surfaced this sequence: 09:05:43 ERROR WS connection error: WebSocket protocol error: Connection reset without closing handshake target=alloy_transport_ws::native (no further WS-related lines for ~1 h) 10:02:24 INFO indexed watch:... ← twap-monitor activity resumes 10:05:24 INFO ethflow observed ... ← ethflow-watcher activity resumes `docker ps` showed 0 restarts and the container stayed healthy throughout — alloy's transport layer reconnected internally without the engine's `reconnecting_block_task` ever observing `inner.next().await -> None`. So the engine never entered its "stream ended → backoff → subscription reopened" path, and the existing `block subscription reopened` Info log (COW-1071) never fired. The transport-layer ERROR followed by silence is indistinguishable from a hung engine on a soak dashboard. ## What changes In `reconnecting_block_task`, on every yielded item compare `now.duration_since(last_event)` against `BLOCK_GAP_LOG_THRESHOLD` (60 s, 5× Sepolia block time). When the gap meets or exceeds the threshold, emit: INFO chain_id=... gap_s=... kind="block" "stream gap closed - first event after silence (likely an alloy-internal transport reconnect)" The gap-detection logic is factored into a small synchronous helper `block_stream_gap_to_log(now, last_event, threshold) -> Option<Duration>` so it can be unit-tested without spinning up an async runtime or a real provider. ## Why blocks only (not logs) Block subscriptions have predictable cadence — Sepolia produces a new block every ~12 s, mainnet every ~12 s. A 60 s silence is therefore anomalous and worth surfacing. Log subscriptions, by contrast, are inherently sparse (driven by on-chain user activity), so the same threshold would fire false positives on quiet windows. The existing `log subscription reopened` log already handles the engine-detectable reconnect for log streams. ## Tests 4 new unit tests on the gap-detection helper: * `block_stream_gap_to_log_returns_none_when_no_prior_event` * `block_stream_gap_to_log_returns_none_when_under_threshold` * `block_stream_gap_to_log_returns_some_at_threshold_boundary` * `block_stream_gap_to_log_returns_some_when_well_over_threshold` All 90 nexum-engine tests pass (86 existing + 4 new). Clippy strict clean, fmt clean. Wasm build untouched. ## Out of scope * End-to-end test of `reconnecting_block_task` against a mock provider — no existing scaffolding for that path, and the gap helper covers the decision logic deterministically. * Suppressing or downgrading the `alloy_transport_ws::native` ERROR itself — it is a legitimate transport-layer event, just one whose recovery wasn't previously observable. The new Info line closes that loop without losing the original signal. ## Live validation The next time alloy auto-reconnects internally on the soak VM, the new line will surface as a structured JSON event with `gap_s=<seconds>` so the soak dashboard can correlate it with the preceding transport ERROR.

… fixes) (#67) Squash of PR #67 - cherry-picks M4 compliance + applies M5-specific 1 blocker + 12 majors (engine_config typed errors, 40-em-dash sweep). Verified: cargo fmt + clippy + test all green.

) Squash of PR #68 - 9 markdown files reconciled (5 vapor items rephrased as future direction + capability-gating diagrams aligned to link-time enforcement). Verified: cargo doc --workspace --no-deps clean.

`observe_placement` matches on the `Result<String, HostError>` returned by `host.cow_api_request(...)`. The M5 conflict resolution (COW-1082 ErrorResp data forwarding) accidentally pasted a wildcard arm that belongs to `apply_submit_retry`'s `match classify_api_error(...)` (which matches a `RetryAction` enum). On a `Result`, the wildcard is both unreachable (`Ok(_)` and `Err(_)` already cover everything) and references an `err` binding that doesn't exist in its scope: error[E0425]: cannot find value `err` in this scope --> modules/ethflow-watcher/src/strategy.rs:205:21 --> modules/ethflow-watcher/src/strategy.rs:205:31 The `Err(err) if err.code == 404` arm + bare `Err(err)` arm already classify every error case. Drop the spurious `_ =>` arm; bring `observe_placement` back to fmt/clippy/test green on dev/m5-base.

…terError Audit reference: milestone-rubric-grant-audit-2026-06-25.md, Major #1 (remaining enums introduced on the M5 multi-chain pass). - `EnvVarError` (engine_config.rs): introduced with the COW-1071 env-var substitution path. Snake_case variant labels feed the boot-time `tracing::error!(error_kind = ...)` call sites in `main.rs`. - `FilterError` (supervisor.rs): introduced with the M5 multi-chain log-filter parsing. Snake_case variant labels feed the `tracing::warn!(error_kind = ...)` log emitted when a `[[subscription]]` address or topic fails to parse. The audit's M3 / M4 derives landed on the milestones that introduced the enums; these two complete the workspace-wide IntoStaticStr pass flagged in audit Major #1 on the milestones that own them.

Audit reference: milestone-rubric-grant-audit-2026-06-25.md, Major #6. The rubric forbids em-dashes in "code, rustdoc, commit messages, PR bodies, or review comments". `.toml` is technically a grey zone but these comments surface verbatim when operators `cat engine.docker.toml` or `engine.example.toml` during deployment onboarding. Mechanical find/replace to ` - ` (ASCII hyphen with spaces). Files touched: - engine.example.toml: 2 em-dashes (lines 20, 38) - engine.docker.toml: 4 em-dashes (lines 4, 5, 6, 31)

…rse (audit JC5) shepherd-backtest's offline replay harness carried its own `AddressParseError` enum (hex-decode + length check). The shape overlaps directly with the `AddressParse` typed error introduced into `shepherd-sdk` by the audit JC5 pass. Extend `shepherd_sdk::address` with a single-address `parse_address` helper alongside the existing `parse_address_list` (the `InvalidAddress` variant covers both call sites via the `index` field). Replay's `fixtures::parse_address` becomes a thin wrapper that calls the SDK and converts the `Address` to the `[u8; 20]` shape the strategy consumes via `LogView::address`. Drops the now-unused `thiserror` dependency from shepherd-backtest; `hex` stays for topic/data decoding.

This was referenced Jun 25, 2026

M2 epic: TWAP + EthFlow modules + module.toml manifests #17

Closed

M3 epic: SDK + examples + tutorial + QA validation #18

Closed

[review-mirror] M5 epic (upstream nullislabs/shepherd#21) bleu/nullis-shepherd#80

Draft

brunota20 force-pushed the dev/m5-base branch from e553feb to 3d020f8 Compare June 25, 2026 19:15

brunota20 force-pushed the dev/m5-base branch from 3d020f8 to 537b56e Compare June 25, 2026 21:17

brunota20 force-pushed the dev/m5-base branch from 537b56e to e18dd3e Compare June 25, 2026 22:39

brunota20 added 21 commits June 25, 2026 20:10

runtime: multi-module supervisor + block/log event loop

be7a3b1

feat(supervisor): apply ADR-0001/0003/0005/0016 and trap-based module…

9ebbeea

… death (BLEU-813-817)

feat(supervisor): add fuel + memory limits per module store (BLEU-818)

473c95f

docs: rename nexum.toml -> module.toml in example, justfile, and READ…

ad3d798

…ME (BLEU-820)

test: fill host backend test gaps — manifest parsing, cow-api, provid…

62c5811

…er-pool, supervisor (BLEU-821)

test: E2E supervisor tests + fix wit_import_to_cap to skip type-only …

881965d

…interfaces (BLEU-819)

style: apply rust-idiomatic rules (em-dashes, #[from] Orderbook, unus…

7d1c0b6

…ed_crate_dependencies, drop redundant map_err)

chore(deps): patch cowprotocol to bleu/cow-rs main (post-alpha.3)

8c848dd

docs(adr): add 0001-0007 capturing engine and CoW architecture decisions

edbafca

docs(adr): unwrap hard-wrapped paragraphs to single line each

c21378e

docs(adr): revise CoW design and reorder ADRs (0001-0008)

e5010c4

fix(docs): reviewed ADRs by bleu

821db88

brunota20 and others added 29 commits June 29, 2026 20:08

chore(rust-idiomatic): M4 compliance pass (blockers + majors) (#66)

7213052

Squash of PR #66 - applies 5 blockers + 8 majors from M4 audit.

chore(rust-idiomatic): M5 compliance pass (cherry-pick M4 + M5 deploy…

65af476

… fixes) (#67) Squash of PR #67 - cherry-picks M4 compliance + applies M5-specific 1 blocker + 12 majors (engine_config typed errors, 40-em-dash sweep). Verified: cargo fmt + clippy + test all green.

chore(docs): reconcile vapor + capability-gating drift across M2-M5 (#68

30934ab

) Squash of PR #68 - 9 markdown files reconciled (5 vapor items rephrased as future direction + capability-gating diagrams aligned to link-time enforcement). Verified: cargo doc --workspace --no-deps clean.

jean-neiverth force-pushed the dev/m5-base branch from 9ad747e to 8cb1b43 Compare June 29, 2026 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

M5 epic: multi-chain deploy + packaging + docs reconciliation#21

M5 epic: multi-chain deploy + packaging + docs reconciliation#21
brunota20 wants to merge 129 commits into
nullislabs:mainfrom
bleu:dev/m5-base

brunota20 commented Jun 25, 2026

Uh oh!

brunota20 commented Jun 25, 2026

Uh oh!

brunota20 commented Jun 25, 2026

Uh oh!

brunota20 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

brunota20 commented Jun 25, 2026

M5 epic — multi-chain deploy, packaging, and docs reconciliation

Core deliverable

Validation

Note on diff scope

Uh oh!

brunota20 commented Jun 25, 2026

Uh oh!

brunota20 commented Jun 25, 2026

Uh oh!

brunota20 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants