Skip to content

M5 epic: multi-chain deploy + packaging + docs reconciliation#21

Draft
brunota20 wants to merge 129 commits into
nullislabs:mainfrom
bleu:dev/m5-base
Draft

M5 epic: multi-chain deploy + packaging + docs reconciliation#21
brunota20 wants to merge 129 commits into
nullislabs:mainfrom
bleu:dev/m5-base

Conversation

@brunota20

Copy link
Copy Markdown

M5 epic — multi-chain deploy, packaging, and docs reconciliation

Builds on #20 (M4 epic). M5 closes out the grant: packages the M4 daemon for operators (Docker + ghcr CI), adds the pre-soak backtest harness + baseline-latency tooling, lands the small protocol-side hardening items surfaced during M4 soak runs, and reconciles the docs across M2-M5 so the on-disk story matches the shipped behaviour.

Core deliverable

Area What landed
Docker + compose + ghcr CI Dockerfile (multi-stage rust build), docker-compose.yml for the daemon + scrape stack, GHCR push on tag (PR #61 in the fork).
Pre-soak backtest harness shepherd-backtest crate + tools/backtest-collect/ replay a 7-day Sepolia EthFlow window before soak runs, giving us an offline regression bar for module behaviour (COW-1078).
Baseline-latency tool tools/baseline-latency/ measures TWAP-relayer PUT and EthFlow indexer ingest latencies across 5 chains so soak reports can attribute regressions to the right lane (COW-1084).
Chain-forward revert data engine_config.rs + cow_api.rs forward eth_call ErrorResp.data into HostError.data so module code can decode IConditionalOrder reverts (PollTryAtEpoch, PollNever) the same way it decodes orderbook errors (COW-1082).
Backoff retry cap ethflow-watcher caps backoff:{uid} retries at MAX_BACKOFF_RETRIES so a permanently-failing orderbook submission stops eating fuel (COW-1083).
TWAP skip submit_order on submitted UID twap-monitor consults submitted:{uid} before re-submitting, preventing duplicate orderbook posts on supervisor restart (COW-1085).
Event-loop log block-stream gap closures runtime/event_loop.rs logs the WS-reconnect gap (block range covered by eth_getLogs re-indexing) at Info, giving operators a visible signal during reconnects (COW-1086).
Env-var substitution + fail-fast HTTP rpc_url + RPC key redaction engine_config.rs resolves ${VAR} placeholders in engine.toml; HTTP rpc_url configs fail-fast at boot (only WSS is accepted for chain subscriptions); API keys in RPC URLs are redacted from boot logs (multiple commits).
Rust-idiomatic compliance pass chore/rust-idiomatic-pass sweep applied to the M4 + M5 surface area (PR #67 in the fork).
Docs reconciliation across M2-M5 docs/ updated end-to-end: ADR statuses, deployment guide, e2e runbook, operations guides re-flowed to match the shipped behaviour (PR #68 in the fork).

Validation

  • cargo fmt --all -- --check clean.
  • cargo clippy --workspace --all-targets -- -D warnings clean.
  • cargo test --workspace --all-features — full suite green; backtest harness has its own integration tests; baseline-latency tool covered by unit tests.
  • Docker image builds cleanly via docker build .; compose stack boots end-to-end against a local Anvil + RPC + orderbook-mock.
  • Soak validation: 7-day Sepolia run against the M5 image with metrics scraped to Prometheus; report under docs/operations/.

Note on diff scope

Builds on M2 (#17) + M3 (#18) + M4 (#20). Each upstream PR is independent against nullislabs:main so you can merge in any order — the natural review/merge order is M2 → M3 → M4 → M5, but the dependency is logical (build-on-top) rather than git-mechanical.

To focus the M5 review, the M5-specific paths are:

  • Dockerfile, docker-compose.yml, .github/workflows/release.yml
  • crates/shepherd-backtest/
  • tools/backtest-collect/, tools/baseline-latency/
  • crates/nexum-engine/src/engine_config.rs (env-var substitution, fail-fast HTTP, key redaction)
  • crates/nexum-engine/src/host/impls/cow_api.rs + crates/shepherd-sdk/src/cow/error.rs (forward eth_call ErrorResp.data into HostError.data)
  • modules/ethflow-watcher/src/strategy.rs (backoff cap)
  • modules/twap-monitor/src/strategy.rs (skip-submitted-uid)
  • crates/nexum-engine/src/runtime/event_loop.rs (WS reconnect gap log)
  • docs/ reconciliation sweep across the M2-M5 surface

Closes COW-1078, COW-1082, COW-1083, COW-1084, COW-1085, COW-1086.

Linear milestone: M5 - multi-chain deploy + docs. Companions: #17 (M2), #18 (M3), #20 (M4).

@brunota20

Copy link
Copy Markdown
Author

Heads-up: `bleu:dev/m5-base` (the head of this PR) was force-pushed today as part of a linearisation pass on our M2->M5 base stack. Old head was `e553feb3`; new head is `3d020f8`.

The branch is now a strict descendant of the rebased `dev/m4-base` (head of upstream PR #20). The pre-rebase M5 branch was assembled via cherry-picks of M4 commits onto an older baseline, which left it as a sibling rather than a descendant of M4; the linearised stack removes that duplication and stacks M5-genuine commits directly on top of M4's last commit.

Notable mechanics:

  • 9 of the 24 commits previously on M5 were patch-id duplicates of commits already on the new M4 (COW-1076, COW-1077, hex-via-alloy, the COW-1079 load-test series, COW-1080). Git correctly dropped them during the rebase (15 commits remain on `dev/m4-base..dev/m5-base`).
  • COW-1078 (backtest replay harness) needed `#[cfg(target_arch = "wasm32")]` gating on the ethflow-watcher module's wit-bindgen glue so the strategy can be reused from native. With the M3 macro abstraction now on the M4 base, the gate moved to wrap the `bind_host_via_wit_bindgen!()` invocation rather than each hand-written impl — same effect, cleaner.
  • COW-1082 (forward eth_call `ErrorResp.data` into `HostError.data`): the `ProviderError::Rpc` variant now carries `source` + `code` + `data` together; the M5 compliance pass refined the `#[error(...)]` format string accordingly.

Diff against the prior M5 head is significant on the modules (lib.rs files move to macro-based glue) but every per-commit intent is preserved. No content lost. Per-commit history preserved. Author identities preserved.

@brunota20

Copy link
Copy Markdown
Author

Fix-pass on the linearised stack: rebased dev/m5-base onto the new dev/m4-base and added one more compile-fix commit. Now green across all 4 gates.

  • New tip: 537b56e (was 3d020f8)
  • Rebase: 15 commits replayed onto 2eed4fe with zero conflicts (the chainlink + app_data + wit_bindgen fmt fixes propagate from the M4 fix commit).
  • Added commit fix(ethflow-watcher): drop bogus wildcard arm from observe_placement:
    • modules/ethflow-watcher/src/strategy.rs:205 had E0425 cannot find value 'err' in this scope. Root cause: the M5 COW-1082 conflict resolution accidentally pasted a _ => wildcard arm (with RetryAction-shaped body) into a match host.cow_api_request(...) -> Result<String, HostError>. Ok(_) + Err(err) if err.code == 404 + Err(err) already cover the match exhaustively; dropped the spurious arm.
  • 4 gates verified green at the new tip on a fresh detached worktree: 90/90 nexum-engine tests, plus modules + sdk + sdk-test + backtest + load-gen + orderbook-mock all pass.

Balance-tracker compliance audit: the M3 BLEU-851 port (a97e6d8) absorbed the M4 compliance edits for modules/examples/balance-tracker/src/lib.rs; neither the M4 nor the M5 compliance squashes touched the file again (git show 097fe3c --stat and git show 20e5df6 --stat are empty for that path). No deltas were lost.

@brunota20

Copy link
Copy Markdown
Author

Audit-driven fix pass landed on dev/m5-base.

Before: 537b56e
After: e18dd3e

Fixes applied (2 commits, on top of the M2+M3+M4 rebase chain):

Audit reference: bruno-brain/wiki/projects/shepherd-audits/milestone-rubric-grant-audit-2026-06-25.md
Gates green: fmt, clippy -D warnings, cargo test --workspace --all-features, RUSTDOCFLAGS=-D warnings cargo doc.

brunota20 added 21 commits June 25, 2026 20:10
Adds a `[workspace.dependencies]` table to the root manifest
consolidating every dep used by 2+ crates across the full nullis-
shepherd stack (anyhow, thiserror, tokio, futures, serde, serde_json,
tracing, tracing-subscriber, strum, alloy-*, cowprotocol, reqwest,
wit-bindgen, clap). Per-crate manifests inherit with `dep.workspace
= true`, and may add features per call site via `dep = { workspace
= true, features = ["extra"] }`. Single-consumer deps (wasmtime,
toml, redb, getrandom, url, hex, axum, rand, ...) stay per-crate.

Adds `[workspace.lints]` with light-touch defaults: `dbg_macro` and
`todo` denied via clippy, `unsafe_op_in_unsafe_fn` warned via rust.
`unsafe_code = deny` cannot be applied workspace-wide because every
wit-bindgen guest module emits an `unsafe extern "C"` shim.

Also pre-declares `auto_impl` and `derive_more` in the workspace deps
table so future `Arc<dyn Trait>` boundaries and newtype-heavy crates
can opt in without touching the root manifest.

The version-drift failure mode (cowprotocol pinned to `1.0.0-alpha`
in nexum-engine but `1.0.0-alpha.3` in shepherd-sdk, flagged in the
2026-06-25 audit) is now impossible by construction: every consumer
inherits the single workspace pin.

Audit reference: milestone-rubric-grant-audit-2026-06-25.md, judgment
calls 1 + 3.
Replaces the `std::env::args().skip(1)` walker with a `#[derive(clap::
Parser)]` struct so the engine binary picks up `--help`, `--version`,
proper argument validation, and structured error reporting for free.

The positional surface is preserved one-for-one (`<wasm-path>
[manifest-path]`); behaviour for callers that already pass two paths
is identical. Help output now documents each argument inline rather
than hiding the usage in an anyhow message that only fires on misuse.

`clap.workspace = true` consumes the workspace dep added in the
prior commit; no new direct version pin in this crate.

Audit reference: milestone-rubric-grant-audit-2026-06-25.md, judgment
call 2.
…irection

A casual reader of `07-rpc-namespace-design.md` hitting the file
top or the "Method Allowlisting" subsection could plausibly walk
away believing the 0.2 runtime gates RPC methods on a read-only
allowlist and intercepts signing methods to delegate them to the
identity backend. The shipped host implementation does neither:
`chain::request` forwards any method string through to the
configured alloy provider.

Adds an explicit `Status: Future direction (0.3+ target)` callout
both at the file top and right above the "Method Allowlisting"
subsection so the gap between design intent and shipped behaviour
is visible without having to scroll the design narrative end-to-end.

Audit reference: milestone-rubric-grant-audit-2026-06-25.md, judgment
call 4.
Adds the dependencies the 0.2 host backends need:

- cowprotocol (1.0.0-alpha) for the cow-api submission path
  (OrderBookApi, OrderCreation, OrderUid, Chain).
- alloy-provider / -rpc-client / -transport-ws / -primitives (1.5)
  for the chain JSON-RPC dispatch. The reqwest feature on
  alloy-provider engages connect_http; the pubsub/ws features back
  eth_subscribe-class methods.
- redb (2) for local-store. Same crate cowprotocol's own watch-tower
  picked, so the dep tree does not bifurcate when both are used in
  the same workspace.
- reqwest (0.12, rustls-tls) — direct, so the import survives any
  future cowprotocol feature rearrangement.
- tracing + tracing-subscriber (env-filter + fmt) — replaces the 0.1
  eprintln! debug log so the engine can drop into a structured log
  pipeline without re-instrumenting every host call.
- thiserror (2) — typed error enums in each backend.
- tempfile + wiremock as dev-deps for the host backend tests.

Adds engine.example.toml documenting the [engine] state_dir + per-
chain RPC URLs the chain backend reads at boot; data/ is now
ignored so a local run does not leave the redb file in tree.
Replaces the 0.2 Unsupported stubs with working backends. Each
capability lives in its own host submodule so the trait impls in
main.rs stay thin (dispatch + project the backend's typed error
onto HostError).

cow_api::submit_order
  - Parses the guest's bytes as JSON cowprotocol::OrderCreation.
  - Dispatches via cowprotocol::OrderBookApi::post_order.
  - Returns the assigned OrderUid as a 0x-prefixed hex string.

cow_api::request
  - REST passthrough. The base URL is whichever URL the pool's
    OrderBookApi client carries — so OrderBookApi::new_with_base_url
    overrides (staging, wiremock) flow through transparently.
  - Method/path validated host-side; orderbook 4xx/5xx bodies are
    surfaced verbatim so the guest can decode {errorType,description}.

chain::request
  - Raw JSON-RPC dispatch over an alloy DynProvider opened from
    engine.toml at boot. WebSocket URLs engage pubsub (eth_subscribe);
    HTTP URLs use the HTTP transport. Params are passed as
    serde_json::RawValue so alloy does not re-encode.
  - request-batch falls back to per-call dispatch (same shape as the
    earlier stub but now backed by real RPC).

local_store
  - redb file under engine_config.engine.state_dir.
  - Single shared table. Per-module namespacing is enforced
    host-side via [len:u8][module_name][raw_key] prefix on every
    key. list_keys strips the prefix before returning to the guest.

logging
  - Routes through tracing::event! tagged with module=<namespace>.
  - Engine boot installs an EnvFilter-based subscriber; RUST_LOG
    overrides the engine.toml log_level.

identity / remote-store / messaging / http stay at Unsupported per
the 0.2 roadmap (keystore / Swarm / Waku land in 0.3).

Tests (14, all green):
  - cow_orderbook: pool default chains, unknown-chain typing, REST
    GET passthrough, relative-path resolution, unknown-method
    rejection, submit_order round-trip — last three under wiremock
    so the full HTTP path is exercised without hitting api.cow.fi.
  - provider_pool: empty pool surfaces UnknownChain.
  - local_store: roundtrip, namespace isolation, delete, list_keys
    prefix-stripping, empty-namespace rejection.

End-to-end against modules/example: example.wasm loads under the
new wiring, logs init + on_event through the tracing pipeline.
…ed_crate_dependencies, drop redundant map_err)
PR #9 specific:
- main: warn + return when block/log streams end (WebSocket dropped)
- supervisor: simplify dispatch_block by extracting chain_id before move
- supervisor: temp_local_store returns (TempDir, LocalStore) instead of leaking
- README: correct engine.toml chain syntax to [chains.<id>] with rpc_url

Rebased from PR #8:
- local_store_redb: table.range() instead of iter() for O(matching) keys
- provider_pool: dedupe method clone on the success path
- main: hex_encode writes into the pre-allocated buffer
- cow_orderbook: drop blank line nit
- manifest: collapse nested if and use ? operator (clippy)
- alloy_rpc_client / alloy_transport(_ws) imports as _ to satisfy
  unused_crate_dependencies.
Move the manifest.rs monolith into a directory module with four
focused submodules (types, load, capabilities, error). Includes the
Subscription enum and the four PR #9 tests for subscription parsing.

Behaviour unchanged - pure code motion.
main.rs went from 739 lines of mixed bootstrap + 8 Host trait impls +
CLI parser + event loop to ~125 lines of pure orchestration. New
layout:

- bindings.rs: wasmtime::component::bindgen!() moved out so other
  modules can name the generated types.
- cli.rs: Cli struct + manual parser.
- host/state.rs: HostState + WasiView impl.
- host/error.rs: unimplemented / internal_error / hex_encode helpers.
- host/impls/{chain,cow_api,identity,local_store,remote_store,messaging,
  logging,clock,random,http,types}.rs: one Host trait impl per file.
- runtime/limits.rs: DEFAULT_FUEL_PER_EVENT + DEFAULT_MEMORY_LIMIT.
- runtime/event_loop.rs: open_block_streams, open_log_streams, run,
  wait_for_shutdown_signal, TaggedBlockStream, TaggedLogStream.

Adding a new capability is now a single new file under host/impls/
rather than a 60-80 line diff in main.rs.
local_store_redb.rs was 89% tests, cow_orderbook.rs was 60%, and
supervisor.rs was 32% (205 lines absolute). Promote each to a directory
module with the test suite living in a sibling tests.rs so impl-side
diffs stop competing with test churn for attention.
brunota20 and others added 29 commits June 29, 2026 20:08
…tion (COW-1079)

First COW-1079 run on a real Anvil fork of Sepolia. The engine-side
acceptance bar is cleared with wide margin:

- Per-block dispatch latency p50/p95/p99 = 4/6/7 ms (bar was < 2 s).
- Zero traps, zero poisoned modules, zero shepherd_module_errors_total.
- EthFlow strategy submitted 1 OrderPlacement end-to-end through the
  mock orderbook in 10 ms; submitted:{uid} marker written cleanly.
- 63 Anvil blocks dispatched flawlessly.

The honest finding: load-gen's transactions get into Anvil's mempool
(twap_ok=270, ethflow_ok=270 per the eth_sendTransaction response),
but only 5 ConditionalOrderCreated + 1 OrderPlacement events
actually fired - the rest reverted at the contract level
(ComposableCoW.create + EthFlow.createOrder run preconditions the
load-gen-crafted bodies don't pass).

So this run stressed the engine with ~6 events over 60 s, not
5+5 per block. The bar criterion that depends on the load-gen
(events-per-block delivered) is the only one that doesn't pass;
filing a follow-up to calibrate the revert rate before re-running.

Report at docs/operations/load-reports/load-5x5-2026-06-19.md
mirrors the COW-1064 e2e-report shape and signs off as
"conditional pass" - engine meets the bar; load-gen needs work.
scripts/lib.sh exports REPORTS_DIR=e2e-reports/ unconditionally.
load-run.sh used to set REPORTS_DIR=load-reports/ BEFORE sourcing
load-bootstrap.sh (which transitively sources lib.sh), so the
override was lost and the auto-generated skeleton ended up under
e2e-reports/ next to the COW-1064 reports.

Move the assignment after the source so the load-reports/ path
wins, with a comment explaining the ordering trap.

Drive-by: removed the misplaced e2e-reports/load-5x5-2026-06-19.md
from the first run; the committed report at
load-reports/load-5x5-2026-06-19.md (commit 59fe714) is the
canonical copy.
COW-1079 baseline's 5/270 + 1/270 revert rate had two distinct
root causes, both contract-side, neither shepherd's fault:

1. **Nonce race in burst submissions.** Anvil's `eth_sendTransaction`
   against an impersonated account auto-assigns a nonce when none
   is provided, but the assignment racts with the caller's burst
   submission. When load-gen fired 5 TWAP + 5 EthFlow per block
   without waiting for individual receipts, most txs landed in the
   mempool sharing the same nonce, and Anvil's miner included only
   one per block - the rest reverted as nonce-too-low.
   Fix: read the EOA's current nonce at boot, increment locally per
   successful submission, pin `tx.nonce` explicitly on every
   `TransactionRequest`. Lock-step with cargo build cache so the
   nonce counter never crosses async-boundary corruption.

2. **EthFlow OrderUid dedup on identical GPv2 OrderData.** The
   CoWSwapEthFlow contract dedups by the GPv2 `OrderUid` which is
   keccak over (buyToken, receiver, sellAmount, buyAmount, appData,
   feeAmount, validTo, partiallyFillable, kind, sellTokenSource,
   buyTokenDestination). quoteId is NOT part of that hash. The
   prior load-gen varied only `quoteId` per call, so all 270 EthFlow
   submissions produced the same UID and the contract rejected
   269/270 as `OrderIsAlreadyOwned`.
   Fix: vary `sellAmount` by 1 wei per call (`BASE_SELL_AMOUNT + seq`)
   and pass that same value as `msg.value` so the contract's
   `msg.value == order.sellAmount` invariant holds.

Re-ran baseline 5x5 after both fixes: 130/130 TWAP + 130/130
EthFlow delivered, 130 ConditionalOrderCreated + 130 OrderPlacement
events on-chain, 130 cow_api submits OK to mock, 130 ethflow markers
written, zero shepherd_module_errors_total. Updated baseline report
at docs/operations/load-reports/load-5x5-2026-06-19.md from
'conditional pass' to 'full PASS' with the post-calibration
numbers (TWAP block p99 = 49 ms, EthFlow log p99 = 11 ms, 40x margin
on the < 2 s bar).

Medium 20x20 and saturation 50x50 are now unblocked per the
COW-1079 acceptance roadmap.
…(COW-1079)

Closes the COW-1079 three-scenario sweep with the COW-1080 calibration
in place. All three scenarios pass:

  baseline 5x5  - 130/130 each, TWAP block p99=49ms
  medium 20x20  - 280/280 each, TWAP block p99=67ms
  saturation 50x50 - 300/300 each, TWAP block p99=78ms

Latency growth across the watch-count range (130 -> 280 -> 300) is
sub-linear: 49 -> 67 -> 78 ms. The lgahdl PR #9 concern about
sequential per-module dispatch saturating under load is NOT surfaced
at this scale.

Zero shepherd_module_errors_total, zero traps, zero EthFlow submit
errors across all three runs.

The unexpected finding from saturation: the engine did not saturate.
The bottleneck is load-gen's sequential eth_sendTransaction
submission (each tx ~200 ms RTT, so 100 tx/iteration = ~20 s, vs.
Anvil's 1 s block time). To genuinely saturate the engine we would
need parallel load-gens against different impersonated EOAs, a
sub-second block-time, or thousands of pre-seeded watches.

EthFlow log p99 stayed flat at ~9 ms across all three scenarios
(it is dominated by the cow-api submit roundtrip, not engine state),
confirming the submit path scales independently of the watch count.

The cold-start outlier (~500 ms on the first watch-heavy block)
appears consistently across runs and is independent of the steady-
state watch count - it is a one-shot first-block redb/eth_call
warmup cost, NOT a saturation symptom.

What this proves:
  - Shepherd M4 supervisor handles >= 300 concurrent watches +
    >= 138 block dispatch cycles in 2 min with p99 < 80 ms.
  - cow-api submit path is steady at ~9 ms p99 regardless of watch
    count.
  - Zero error/trap/poison across all three scenarios.

What it does NOT prove (and is not in scope here):
  - Behaviour at 3000+ watches.
  - WS reconnect resilience (COW-1031 soak).
  - Multi-day memory drift (COW-1031).
  - Real-orderbook 4xx variety (COW-1078 backtest).

COW-1079 ready to move to In Review.
…079)

The single-EOA saturation 50x50 report identified the per-EOA nonce
serialisation as the bottleneck before the engine had a chance to
saturate. This commit removes that bottleneck:

load-gen:
- New --parallel N flag. Each worker impersonates a synthetic EOA
  (0x57...01..0a), gets its own WS connection + nonce stream, runs
  its own per-block submission loop. Total events per block scales
  linearly with N.
- Disjoint salt space per worker via 96-bit prefix.
- Disjoint EthFlow sellAmount space via a 10_000-wide per-worker
  window (the first attempt shifted by 96 bits, blowing past the
  1M ETH funded balance with 7.9e28 wei sellAmounts; fixed).

scripts/load-bootstrap.sh + scripts/load-run.sh:
- Accept --block-time (passes to anvil) and --parallel (passes to
  load-gen). Defaults preserve historic behaviour: --block-time 1,
  --parallel 1.
- Auto-report filename now includes scenario label
  (load-NxM-SCENARIO-date.md) so saturation-parallel does not
  overwrite the baseline 5x5 report.

Saturation-parallel run (10 workers x 5 TWAP + 5 EthFlow per block,
--block-time 0.5, 2 min):
- load-gen: 895/895 TWAP + 895/895 EthFlow acks, 0 errors.
- engine saw 381 ConditionalOrderCreated + 343 OrderPlacement events
  (43% / 38% delivery vs load-gen acks - Anvil + WS dropping under
  the heavier load).
- shepherd_module_errors_total = 0, zero traps.
- All 343 EthFlow submissions reached the mock orderbook 1:1.
- TWAP block dispatch: histogram p50/p99 = 145 ms, max = 101 593 ms
  (101 s outlier on one block when 380+ watches polled against a
  stressed Anvil JSON-RPC).
- Engine-log dispatch_block: n=586, p50=4ms, p95=46ms, p99=74ms,
  max=101 593 ms - same outlier.

Saturation knee identified: 380+ active watches + 0.5s block-time +
10 concurrent WS subscribers produces a 101-second worst-case
dispatch + 38-43% event delivery loss. Both symptoms point at the
surrounding system (Anvil + WS transport), not at shepherd; engine
continues to scale sub-linearly with watch count and never produces
a module error, trap, or panic under any tested configuration.

For the 7-day COW-1031 soak: this implies the operator should use
a paid Sepolia archive endpoint (Alchemy / drpc / QuickNode), not
publicnode, OR accept event drops and rely on supervisor reconnect
+ eth_getLogs re-indexing. Documented in the new report.

Report at docs/operations/load-reports/load-50x50-parallel-2026-06-19.md.
Squash of PR #66 - applies 5 blockers + 8 majors from M4 audit.
…a doc link

Rebase fallout from the M4 compliance pass:

- `chain/chainlink.rs` defines `StubHost<Result<String, HostError>>` and
  manually implements every `*Host` trait. When the M4 conflict resolution
  added the `cow_api_request` forwarder into the macro's `CowApiHost`
  impl, this local StubHost was missed, producing `E0046: not all trait
  items implemented`. Add a parallel `unreachable!("not used in this
  test")` body; the test never exercises the cow-api surface.

- `cow/app_data.rs`'s module-level doc referred to `EMPTY_APP_DATA_JSON`
  as an unqualified intra-doc link, but the symbol is only used as
  `cowprotocol::EMPTY_APP_DATA_JSON` inside the function body (no `use`
  at module scope). `RUSTDOCFLAGS=-D warnings` rejects the unresolved
  link. Qualify the path so it resolves while keeping the prose intent.

- `wit_bindgen_macro.rs` fmt drift: cargo fmt collapses the
  `shepherd::cow::cow_api::request(...).map_err(convert_err)` chain to
  a single line. Apply the canonical format.

Brings dev/m4-base back to fmt/clippy/test/doc green.
…face

Audit reference: milestone-rubric-grant-audit-2026-06-25.md, Major #3
(`[u8; 32]` for protocol hash across SDK public boundary).

The rubric explicitly calls out: "Newtypes for protocol IDs (no raw
`[u8; 32]` across module boundaries)." `B256` is already in
`shepherd_sdk::prelude` so the swap costs callers nothing - both
twap-monitor and ethflow-watcher were holding the appData as `B256`
already and reaching through `.0` to satisfy the prior signature.

Changes:
- `resolve_app_data(host, chain_id, &B256)` (was `&[u8; 32]`)
- `encode_hex(&B256)` internal helper
- Doctest + 5 unit tests rewritten against `B256::from(bytes)` and
  `B256::from_slice(EMPTY_APP_DATA_HASH.as_slice())`. Coverage stays
  identical.
- Call sites in twap-monitor and ethflow-watcher drop the `.0`
  reach-through; pass `&order.appData` directly.

No public surface beyond `shepherd-sdk` consumes this function;
external module crates in the workspace are the only consumers and
both land in the same commit.
Audit reference: milestone-rubric-grant-audit-2026-06-25.md,
duplication finding "Canonical CoW chain set
[Mainnet, Gnosis, Sepolia, ArbitrumOne, Base]" duplicated at
`crates/nexum-engine/src/host/cow_orderbook.rs:39-43` and `:66-70`.

`from_config` was added in the M4 multi-chain pass and reproduced the
same 5-element array `Default::default` already used. Adding a sixth
chain previously needed touching both arrays in lock-step; pull the
list into a single `const DEFAULT_CHAINS: &[Chain]` so the
single-source-of-truth property is structural.

Also drops the redundant `use cowprotocol::OrderBookApi;` inside
`from_config` (already in scope from the module-top `use cowprotocol::
{Chain, OrderBookApi, ...}` line). Behaviour identical.
Audit reference: milestone-rubric-grant-audit-2026-06-25.md, Major #6.
Rubric forbids em-dashes in operator-facing config files; while
.toml is technically a grey zone the comment surfaces verbatim when
operators `cat engine.e2e.toml` during e2e runbook execution.
…W-1084)

Adds `tools/baseline-latency/baseline_latency.py`, a per-chain script
that pairs every on-chain `EthFlow.OrderPlacement` event in a trailing
window with the orderbook's record for the same UID and reports
`(creationDate - block.timestamp)`. Matching is rigorous: the script
ABI-decodes the GPv2OrderData from each event, computes the EIP-712
order digest against the chain's GPv2Settlement domain, and looks up
the resulting UID against the orderbook's bulk `/account/.../orders`
fetch (single-UID fallback if missed). No temporal-FIFO approximation.

For EthFlow orders the orderbook indexer sets
`creationDate := block.timestamp` (not the indexer's ingest time), so
the historical delta is structurally 0s on every chain. This is
intentional back-fill-style behaviour, not a measurement bug.
**Implication**: EthFlow indexer latency cannot be derived from
historical orderbook data — the meaningful relayer-latency baseline
lives on the TWAP lane (where the orderbook records the indexer's
`now()` per child order PUT). TWAP child-latency is a follow-up; it
needs per-part UID derivation from each parent
`ConditionalOrderCreated` static input.

Sepolia ran clean: 256 events scanned, 200 UID-derived pairs, all 200
matched against the bulk fetch (`bulk_hit=200`). Median = p95 = 0.0s,
exactly as the finding predicts.

Public-tier RPCs (drpc.org free, 1rpc.io, ankr w/o key, llamarpc,
cloudflare-eth) all refuse / throttle `eth_getLogs` at any usable
chunk size on the production chains. The script halves down to
50-block chunks and gives up after 3 consecutive failures, marking
the cell `RPC-LIMITED` with a pointer to the `RPC_URL_*` env override.
This is the same constraint the M5 soak (COW-1031) will face and
independently confirms the paid-endpoint requirement for any
serious log-scanning workload.

- `tools/baseline-latency/baseline_latency.py` (~520 lines):
  argparse CLI, per-chain `Chain` dataclass, JSON-RPC helper with
  halving retry + `RpcLimited` sentinel, EIP-712 order digest +
  UID derivation, UID-keyed orderbook matching, markdown report
  renderer.
- `tools/baseline-latency/data/*.json`: per-chain raw dump (events,
  pairs, deltas, diagnostics) for auditability.
- `docs/operations/baselines/baseline-latency-2026-06-19.md`: the
  first run's report.

Pinning the orderbook's `creationDate` semantics matters because the
COW-1079 and COW-1031 KPIs reference "watchtower latency" — the M4
report needs to be honest about which lane the latency lives on
(TWAP relayer PUT, not EthFlow indexer ingest). The Sepolia data set
also gives the M4 e2e harness ground-truth UID ↔ block pairings to
cross-check against.
…W-1082)

The chain backend previously dropped alloy's structured
`RpcError::ErrorResp` payload on the floor — the formatted error
string went into `HostError.message`, but `HostError.data` stayed
`None` and `HostError.code` was hard-coded to `-32603`. That made the
twap-monitor's poll-time revert classifier inert on real traffic:
`OrderNotValid` / `PollNever` / `PollTryAtBlock` / `PollTryAtEpoch`
all fell through to `TryNextBlock` because `decode_revert_hex` only
fires on a non-empty `err.data`.

This change wires the structured payload through end-to-end.

- `crates/nexum-engine/src/host/provider_pool.rs`: when alloy's
  `provider.raw_request` fails with an `RpcError::ErrorResp`, the
  pool now captures both `payload.code` (as `Option<i64>` so we can
  distinguish "no ErrorResp" from "ErrorResp with code 0") and
  `payload.data` (as `Option<String>`, the JSON-encoded revert hex)
  and surfaces them on `ProviderError::Rpc`. Transport-side failures
  (timeout, websocket disconnect) leave both `None`. The two
  subscribe paths (`subscribe_blocks`, `subscribe_logs`) keep
  `code: None, data: None` since they don't carry an ErrorResp.
- `crates/nexum-engine/src/host/impls/chain.rs`: extract the
  `ProviderError -> HostError` projection into a free helper
  `provider_error_to_host_error`. The `Rpc` arm forwards the
  structured `data` verbatim, preserves the node-reported code
  (saturating out-of-`i32` values to `-32603`), and falls back to
  `-32603` only when no `ErrorResp` was present. Five unit tests
  cover: revert with data, transport failure with `None`,
  out-of-range code, unknown-chain, and invalid-params.
- `modules/twap-monitor/src/strategy.rs`: update the stale comment
  on the `decode_revert_hex` branch — that branch is now live on
  real traffic, the only `None` path is transport-level failures
  (which keep the safe `TryNextBlock` default).

No incorrect order is ever submitted (the contract reverts; the
orderbook never sees a bad body). The issue is pruning efficiency:
a permanently dead TWAP watch was re-polled every block until a
submit eventually failed for an unrelated reason, and the
local-store filled with `watch:` entries the strategy could
otherwise drop on the first revert. With this fix the SDK-side
classifier dispatches `Drop` / gate on the first revert, matching
the documented expectation in `docs/adr/0007-upstream-protocol-logic-to-cow-rs.md`.

- 70/70 nexum-engine tests pass
- 23/23 twap-monitor tests pass
- 5/5 new chain.rs projection tests pass (revert-with-data,
  transport-fail, out-of-range-code, unknown-chain, invalid-params)
- `cargo clippy -p nexum-engine -p twap-monitor --all-targets
  -- -D warnings` clean

jeffersonBastos's PR #55 (M3 mirror) review, thread on
`modules/twap-monitor/src/strategy.rs:189`. The mirror of this fix
on the cow-api side is COW-1075 (already merged via PR #48).
…OW-1083)

The strategy's `apply_submit_retry` previously wrote an empty
`backoff:{uid}` marker on every retriable submit failure (including
the `TryNextBlock` fallback for unparseable orderbook envelopes). The
marker was a presence flag with no payload, so on every supervisor
reconnect / engine restart the same dead placement would retry
indefinitely — bounded only by log re-delivery frequency.

This change persists a per-UID retry count in the marker's value
(ASCII `u32`) and upgrades to `dropped:` after `MAX_BACKOFF_RETRIES =
5` consecutive retries. The upgrade emits a Warn-level log line so
the operator sees the structural issue (flaky CDN, indexer hiccup,
poisoned envelope) rather than silently accumulating retries.

- `modules/ethflow-watcher/src/strategy.rs`:
  - New const `MAX_BACKOFF_RETRIES = 5`.
  - New helper `read_backoff_count` that reads + parses the marker
    payload; pre-COW-1083 empty markers decode to 0 so previously-set
    backoff: rows still get a fresh attempt (no premature drop on
    rollout).
  - `apply_submit_retry`'s retriable branch now reads the prior
    count, increments, and either writes the new count or upgrades
    to `dropped:` (clearing the stale `backoff:`) at the cap.
  - Cap-upgrade log line carries the retry-count and message: "...
    after 5 retries on transient/unparseable rejection ...".

- 19/19 ethflow-watcher tests pass.
- New `submit_transient_error_at_cap_upgrades_to_dropped_warn`:
  seeds `backoff:{uid} = "4"`, triggers a `data: None` rejection
  (the unparseable case the issue names explicitly), asserts:
    * `dropped:{uid}` is now set
    * `backoff:{uid}` is cleared (single outcome marker at rest)
    * exactly one Warn log line containing "ethflow dropped" +
      "retries"
- New `submit_transient_error_with_legacy_empty_marker_resets_counter`:
  backwards-compat — a pre-COW-1083 empty `b""` marker is treated
  as count 0, bumped to "1" on first retry rather than prematurely
  dropping. Protects in-flight backoffs across the rollout.
- Existing `submit_transient_error_writes_backoff_marker_and_returns`
  extended with an assertion that the first retry persists
  `backoff:{uid} = "1"`.
- `cargo clippy -p ethflow-watcher --all-targets -- -D warnings`
  clean.

Surfaced by jeffersonBastos's PR #55 (M3 mirror) review, thread on
`crates/shepherd-sdk/src/cow/error.rs:82`. Latent in normal
operation (the host forwards parseable envelopes after COW-1075, so
`classify_api_error` returns `Drop` for permanent rejections), but
the gap fires when the orderbook returns a non-JSON 4xx body
(e.g. an HTML error page from a CDN) or if a future host change
accidentally drops the envelope again. Bounded retry semantics
close the latent risk without changing the safe-default
classification (still `TryNextBlock` on `None` data — that part is
explicitly out of scope per the issue).
Adds the COW-1078 pre-soak backtest end-to-end:

1. `tools/backtest-collect/backtest_collect.py` — Python collector
   that pulls a trailing N-day window of `OrderPlacement` (EthFlow)
   and `ConditionalOrderCreated` (TWAP) events from Sepolia,
   ABI-decodes each payload, derives the EthFlow `OrderUid` via
   EIP-712 against the chain's GPv2Settlement domain, resolves every
   non-empty `appData` hash via `GET /api/v1/app_data/{hash}`, and
   emits a single fixtures JSON. Reuses the log-scan + UID-derive
   infra introduced by the baseline-latency tool (COW-1084 PR #57).

2. `crates/shepherd-backtest` — new Rust binary that loads the
   fixtures, programs a `MockHost` per event (resolved `app_data`
   response + UID-echo submit response), and drives
   `ethflow_watcher::strategy::on_logs` directly. Each event is
   classified into `Submitted` / `RejectedExpected` /
   `RejectedUnexpected` / `StrategyError` and rendered into a
   Markdown report at `docs/operations/backtest-reports/
   backtest-7d-YYYY-MM-DD.md`.

3. `modules/ethflow-watcher` — `crate-type = ["cdylib", "rlib"]`
   and cfg-gate the wit-bindgen glue so the rlib carries only the
   `strategy` module (now `pub mod`) for native consumers. The
   wasm artefact is unchanged.

7-day Sepolia window (2026-06-15..2026-06-22): **240 EthFlow events,
240 Submitted, 0 anomalies = 100.0% pass vs. 95% threshold**. The
report is committed at
`docs/operations/backtest-reports/backtest-7d-2026-06-22.md`.

26 TWAP `ConditionalOrderCreated` events are collected and counted
but the replay is deferred to Phase 2B — driving
`twap_monitor::strategy::on_block` requires walking each watch's
`eth_call(getTradeableOrderWithSignature)` per-block, which
public-tier RPCs refuse (see the baseline-latency / COW-1031
finding). The fixtures are committed so the future re-run inherits
the same dataset.

- v1: EthFlow lane end-to-end (collector + replay + report).
- v2 (follow-up): TWAP lane via paid-RPC archive walking; downstream
  validation via `POST /api/v1/quote` round-trip on captured
  bodies.
- Out of scope per the issue: supervisor / event-loop / WS reconnect
  coverage (stays on the wall-clock soak); fuel/memory limits (stays
  on COW-1036 / soak); orderbook PUT mutation (forbidden — only
  read-only endpoints are touched).

- 19/19 ethflow-watcher tests pass (rlib + cdylib build both clean)
- Full workspace test sweep passes (no regressions)
- `cargo clippy -p shepherd-backtest -p ethflow-watcher --all-targets
  -- -D warnings` clean
- Live run: 240 fixtures → 240 Submitted, 0 anomalies

```bash
python3 tools/backtest-collect/backtest_collect.py --days 7
cargo run -p shepherd-backtest -- \
    --fixtures tools/backtest-collect/fixtures-YYYY-MM-DD.json
```
Closes the M5 packaging gap surfaced by the audit: the Dockerfile +
compose recipe lived inside `docs/production.md` but neither was at
the repo root, so `docker build .` didn't work and there was no
published image. This change makes the deploy path one-line on a
fresh VM.

- **`Dockerfile`** — multi-stage build (rust:1.96-slim-bookworm →
  debian:bookworm-slim). Builds the engine in release + the 5
  production modules to wasm32-wasip2. Runtime stage strips down to
  `tini` (PID 1 for graceful shutdown / SIGINT forwarding per
  COW-1072) + `ca-certificates` (TLS to cow.fi + paid RPCs) + a
  non-root `shepherd` user owning `/var/lib/shepherd`. Final image:
  **198 MB** (engine + 5 wasm modules + Debian slim).

- **`.dockerignore`** — excludes `target/`, `data/`, the heavy
  backtest / baseline JSON fixtures, and local-only engine configs,
  while keeping `modules/fixtures/*-bomb` (workspace members; Cargo
  rejects the manifest if they're missing) and the source markdown
  docs (so `docker exec` can grep them in place).

- **`docker-compose.yml`** — two profiles. Default boots just the
  engine with a `shepherd-state` named volume + the operator's
  `./engine.toml` mounted ro at `/etc/shepherd/engine.toml`, metrics
  on the host loopback (`127.0.0.1:9100`). The `observability`
  profile (`docker compose --profile observability up`) layers a
  Prometheus container pre-wired to scrape `shepherd:9100`. Graceful
  shutdown via `stop_signal: SIGINT` + `stop_grace_period: 30s` per
  the production runbook. Healthcheck hits `/metrics`.

- **`engine.docker.toml`** — pre-baked config that matches the
  paths the image bakes (`/opt/shepherd/modules/*.wasm`,
  `/opt/shepherd/manifests/*.toml`, `/var/lib/shepherd` state
  dir). Operator workflow: `cp engine.docker.toml engine.toml`,
  swap `<RPC_KEY>` placeholders, `docker compose up -d`.

- **`docs/deployment/docker.md`** — operator runbook. Covers
  first-boot, engine.toml configuration, upgrade / rollback,
  local-build path, post-deploy verification, cross-links to
  `docs/production.md` for the full hardening surface.

- **`docs/deployment/prometheus.yml`** — scrape config consumed by
  the observability compose profile.

- **`.github/workflows/docker.yml`** — build + push to
  `ghcr.io/bleu/nullis-shepherd` on every push to `main` and every
  `v*` tag. PR builds run the build for smoke (no push). Tags
  produced: `latest` (main HEAD), `v<tag>` (releases),
  `sha-<short>` (every event for exact pinning), `manual-<run_id>`
  (workflow_dispatch). Registry-side layer cache via
  `:buildcache` keeps incremental rebuilds fast. linux/amd64 only —
  the soak VM is x86_64; add arm64 once an operator surfaces a
  real need. Action SHAs pinned to match `.github/workflows/ci.yml`
  style.

Build runs locally end-to-end in ~10 min on a clean Docker daemon:

  $ docker build -t shepherd:smoke .
  $ docker run --rm shepherd:smoke --help
    usage: nexum-engine [<wasm-path> [<manifest-path>]] \
                       [--engine-config <path>] [--pretty-logs]
  $ docker run --rm -v "$PWD/engine.docker.toml:/etc/shepherd/engine.toml:ro" \
        shepherd:smoke
    {"level":"INFO","message":"nexum-engine starting",...}
    {"level":"INFO","message":"metrics exporter listening at /metrics",...}
    {"level":"INFO","message":"opening chain RPC provider","chain_id":1,...}
    Error: connect chain 1: HTTP format error: invalid uri character
                                  ^- expected: <RPC_KEY> placeholder not a real URL

Proves: image builds, entrypoint forwards CMD, engine loads
`/etc/shepherd/engine.toml`, metrics exporter binds, provider pool
iterates the configured chains, graceful error path works.

- [x] Local `docker build .` succeeds (rust:1.96 base — wasmtime 45
      requires rustc >= 1.93, the docs/production.md `1.86` pin was
      stale)
- [x] Image size: 198 MB
- [x] `docker run ... --help` works
- [x] `docker run ... -v engine.docker.toml:...` reads config + binds
      metrics + iterates chains
- [x] `cargo test --workspace` clean (18 groups, 203 passed, 0 failed)

On a fresh Debian/Ubuntu VM with Docker installed:

```bash
git clone https://github.com/bleu/nullis-shepherd /opt/shepherd
cd /opt/shepherd
cp engine.docker.toml engine.toml
$EDITOR engine.toml              # add real RPC URL
docker compose pull              # once ghcr.io image is published
docker compose up -d
docker compose logs -f shepherd
curl -s http://127.0.0.1:9100/metrics | head -50
```

- `docs/deployment/multi-chain-guide.md` — dedicated walkthrough
  configuring 4 chains together (Mainnet + Gnosis + Arbitrum + Base)
  with per-chain module subscriptions
- Example module declaring multi-chain support (every current
  example pins Sepolia)
- Optional automated CD trigger (workflow_dispatch SSH'ing to the
  soak VM to pull + restart) — gated on SSH_PRIVATE_KEY repo secret
Companion to the M5 Docker packaging — the operator workflow is `cp
engine.docker.toml engine.toml` then drop in a paid RPC URL. Without
this rule a clumsy `git add -A` could commit the key. The committed
sibling templates (engine.example/docker/m2/m3/e2e/load.toml) stay
trackable.

Validated against a live smoke run: drpc Sepolia WSS endpoint pasted
into engine.toml, `docker compose up`, engine subscribed to
newHeads + logs, 6 sequential blocks dispatched (11117171..76),
metrics `shepherd_event_latency_seconds` p99 = 0.14ms. Tear-down
clean. No engine.toml ever staged.
Closes the footgun surfaced by the M5 smoke run on drpc Sepolia:
configuring `rpc_url = "https://..."` for a chain that the modules
subscribe to silently degrades to an infinite WARN-with-backoff loop
(COW-1071's reconnect retries forever because `eth_subscribe` is
WS-only in the JSON-RPC spec). Three coordinated changes:

`EngineConfig::validate_transports()` walks every `[chains.<id>]`
entry, and for any `rpc_url` not starting with `ws://` / `wss://`
emits one loud ERROR-level structured log line with:
  - the chain id
  - the redacted offending URL
  - the redacted suggested `wss://` swap
  - actionable copy explaining the WS requirement and the escape
    hatch (`[chains.<id>] require_ws = false` for poll-only chains
    that never subscribe)

The validator is invoked from `main.rs` AFTER the tracing
subscriber is initialised (calling it inside `load_or_default`
silently dropped the log).

A `require_ws: bool` field is added to `ChainConfig` with
`#[serde(default = "default_require_ws")]` = `true`. Operators who
genuinely need an HTTP endpoint (poll-only modules, no block / log
subscriptions on this chain) opt out explicitly per chain.

The pre-existing `opening chain RPC provider` log in
`provider_pool::from_config` was emitting the full URL — API key
included — at INFO level. Log aggregators (Loki / Datadog / Splunk)
routinely retain weeks of these lines; the key has no business
sitting in cold storage. The new `engine_config::redact_url` helper
(public so other call sites can adopt it) replaces any path segment
longer than 20 chars that doesn't contain `.` or `:` with `<KEY>`.
Matches Alchemy / drpc / Infura / QuickNode key shapes.

Same helper is used for both the validation ERROR's `rpc_url` and
`suggested` fields and the provider-pool boot log.

- `engine.example.toml`: every chain entry switched to `wss://`,
  with a header block explaining the WS requirement + the
  `require_ws = false` escape hatch. The previous mix of `https://`
  + `wss://` would have tripped the new validator on its own example.
- `docs/production.md §6`: blockquote callout pointing operators at
  the WS requirement, redaction behaviour, and the escape hatch.

Smoke 1 (HTTP, expected to ERROR):
  {"level":"ERROR","message":"rpc_url uses HTTP transport but the engine subscribes to blocks/logs via eth_subscribe (WS-only). [...]","chain_id":11155111,"rpc_url":"https://lb.drpc.live/sepolia/<KEY>","suggested":"wss://lb.drpc.live/sepolia/<KEY>",...}
  $ grep -c "<the-actual-key>" smoke.log
  0

Smoke 2 (WSS, expected to pass + redacted):
  {"level":"INFO","message":"opening chain RPC provider","chain_id":11155111,"url":"wss://lb.drpc.live/sepolia/<KEY>",...}
  $ grep -c "<the-actual-key>" smoke.log
  0

- 9 new unit tests in `engine_config::tests`:
    * `validate_accepts_wss_url`, `validate_accepts_ws_url`
    * `validate_is_silent_when_require_ws_is_false`
    * `validate_runs_without_panicking_on_http_url`
    * `suggest_swaps_https_to_wss`, `suggest_swaps_http_to_ws`,
      `suggest_passes_through_already_ws_url`
    * `redact_replaces_long_path_segments`,
      `redact_keeps_short_segments_intact`
- Workspace: 18 groups, **212 passed, 0 failed** (was 203 → +9)
- `cargo clippy --workspace --all-targets -- -D warnings` clean
Operator workflow before this change forced the paid-RPC URL to
live in a file (`engine.toml`), which is fine for systemd but
awkward for Docker/compose: the URL had to be hand-edited inside a
volume-mounted file, secrets and config got tangled, and the
internal drpc test key was at risk of slipping into a committed
example. This change makes the engine treat `${VAR_NAME}` tokens
inside `engine.toml` as environment-variable references, resolved
at config-load time:

    [chains.11155111]
    rpc_url = "${SEPOLIA_RPC_URL}"

The `engine.docker.toml` and `engine.example.toml` templates ship
with `${VAR}` placeholders for all five chains, so the committed
files stay secret-free regardless of deployment path.

    cp .env.example .env
    $EDITOR .env                # paste real wss:// URLs
    docker compose up -d

`docker compose` reads the repo-root `.env` automatically (already
the compose default) and forwards the named variables into the
container via the new `environment:` block; the engine substitutes
them when parsing `/etc/shepherd/engine.toml`.

- `engine_config.rs::substitute_env_vars` — hand-rolled parser
  (no regex dep) that walks the raw TOML text, matches `${NAME}`
  tokens against `[A-Z_][A-Z0-9_]*`, and looks each up via
  `std::env::var`. Three error variants via `thiserror`:
    * `Missing { name }` — variable referenced but unset; message
      includes the exact name and a pointer to the `.env` workflow.
    * `InvalidName { name }` — typo (lowercase, leading digit);
      suggests the upper-cased variant.
    * `Unclosed { offset }` — `${` without matching `}`.
- Called from `load_or_default` before `toml::from_str`, so the
  substitution layer never sees parsed TOML — a missing env var
  surfaces with the exact variable name, not a downstream
  "invalid URI character" several layers deep.
- Substitution runs over the whole file (comments included; harmless).

- `.env.example` — committed template with placeholders for all 5
  chain `*_RPC_URL` variables + the optional `SHEPHERD_IMAGE` and
  `SHEPHERD_ENGINE_CONFIG` overrides.
- `.gitignore` — adds `!.env.example` exception so the template
  stays trackable while `.env` and `.env.local` etc. stay ignored.
- `docker-compose.yml` — passes the five `*_RPC_URL` env vars
  through to the container; the engine config bind-mount now
  defaults to `engine.docker.toml` (the committed template) and
  honours `SHEPHERD_ENGINE_CONFIG` for operators who prefer a
  bespoke file.
- `engine.docker.toml` + `engine.example.toml` — every `[chains.*]`
  entry switched to `${*_RPC_URL}` placeholders. Header comments
  spell out the workflow.
- `docs/deployment/docker.md` — first-boot section now leads with
  `cp .env.example .env` (was `cp engine.example.toml engine.toml
  && edit`). §2 explains the bind-mount + the
  `SHEPHERD_ENGINE_CONFIG` escape hatch.

Smoke 1 (compose end-to-end):
  $ cp .env.example .env
  $ echo "SEPOLIA_RPC_URL=wss://lb.drpc.live/sepolia/<real-key>" >> .env
  $ echo "SHEPHERD_ENGINE_CONFIG=./engine.local.toml" >> .env
  $ docker compose up -d
  ...
  {"level":"INFO","message":"opening chain RPC provider","chain_id":11155111,
   "url":"wss://lb.drpc.live/sepolia/<KEY>",...}      ← env-resolved, key redacted
  {"level":"INFO","message":"supervisor up","loaded":2,"alive":2,...}
  {"level":"INFO","message":"block subscription open","chain_id":11155111,...}
  {"level":"INFO","message":"log subscription open","module":"twap-monitor",...}
  {"level":"INFO","message":"log subscription open","module":"ethflow-watcher",...}

  $ docker compose logs | grep -c <real-key>
  0                                                       ← zero leaks

  $ curl -s http://127.0.0.1:9100/metrics | grep latency_seconds_count
  shepherd_event_latency_seconds_count{module="twap-monitor",event_kind="block"} 4

Smoke 2 (missing env var, expected fail-fast):
  $ unset SEPOLIA_RPC_URL
  $ docker compose up
  Error: engine config env-var substitution failed: environment variable
  `SEPOLIA_RPC_URL` referenced via ${SEPOLIA_RPC_URL} in engine.toml but
  not set. Export it before launching the engine (e.g. via a `.env`
  file consumed by `docker compose`).

- 7 new unit tests in `engine_config::tests`:
    * `substitute_replaces_known_variable`
    * `substitute_errors_on_missing_variable`
    * `substitute_errors_on_invalid_name`
    * `substitute_errors_on_unclosed_brace`
    * `substitute_passes_text_with_no_placeholders_through`
    * `substitute_handles_multiple_placeholders_in_one_line`
    * `substitute_preserves_utf8_around_placeholder`
- Workspace: 18 groups, **219 passed, 0 failed** (was 212 → +7)
- `cargo clippy --workspace --all-targets -- -D warnings` clean
VM smoke surfaced a false-negative `(unhealthy)`: the compose
healthcheck called `wget` but the runtime image is built on
debian:bookworm-slim which doesn't include it (only ca-certificates
+ tini, intentionally minimal). `wget: not found` → exit 127 →
unhealthy mark, despite the engine actually working (21 blocks
dispatched in 3 min, p99 latency 0.09ms, zero errors).

Swap to bash's `/dev/tcp` builtin (always present in
bookworm-slim's `/bin/bash`). Successful TCP open on the metrics
port proves the exporter bound, which only happens after the
supervisor finishes boot — same semantic, no image growth.
First fix attempt swapped wget for `/dev/tcp` but kept `CMD-SHELL`,
which routes through `/bin/sh` (dash on debian:bookworm-slim).
dash doesn't have the `/dev/tcp/<host>/<port>` builtin — it's bash-
only. Probes failed with "cannot create /dev/tcp/...: Directory
nonexistent".

Switch to `CMD ["bash", "-c", ...]` so the bash builtin actually
resolves. `bash` ships in the slim base; verified via
`docker exec shepherd which bash` → `/usr/bin/bash`.
Cherry-pick of PR #62 + PR #63's redesign onto the M5 host runtime
(env-var substitution in engine.toml, healthcheck fixes, etc) for the
Sepolia soak VM. The PR review continues on the proper layered
branches:

  - PR #62 — M2/BLEU-833 layer observe design
  - PR #63 — M3/M4 BLEU-855 split + COW-1074 cow_api_request integration

This branch is deploy-only: it lets the soak run on the redesigned
ethflow-watcher with the latest host runtime while review iterates on
the layered PRs. After merge, this branch can be deleted; CI will
republish ghcr.io/bleu/nullis-shepherd:latest with the merged design
and the VM rolls forward to the official image.

See COW-1076 for the full empirical evidence.
…store (COW-1085)

`getTradeableOrderWithSignature` returns the same Ready tuple in every
poll-tick during a TWAP child's validity window — the on-chain
conditional order has no way to know shepherd already POSTed it. The
strategy already wrote a `submitted:{uid}` marker after a successful
submit, but the next poll-tick polled the chain and submitted again,
producing a wasted orderbook call and a misleading
`DuplicatedOrder` Warn line in every soak that runs a TWAP.

Live evidence (2026-06-23 Sepolia soak):

    10:02:36.784  INFO  poll watch:0x8fab71c0...:0x93b1626c... -> Ready
    10:02:37.190  INFO  submitted submitted:0xd7116bd2...
    10:02:48.870  INFO  poll watch:0x8fab71c0...:0x93b1626c... -> Ready
    10:02:49.855  WARN  submit dropped watch (400): orderbook error
                       (DuplicatedOrder): order already exists

The first submission succeeded (`GET /api/v1/orders/0xd7116bd2...`
returns `status: fulfilled`); the second was wasted work.

The fix: at the top of `submit_ready`, compute the client-side UID
deterministically from the on-chain `(order, owner, chain)` tuple via
`OrderData::uid` and check `submitted:{uid}` in local-store; skip the
submit (and the appData resolve that precedes it) when the marker
exists. The marker write site is also updated to use the
client-computed UID for the key so the read and write paths agree
(in production the server-returned UID is the same value — both sides
derive it from the canonical `digest || owner || valid_to` layout —
and a divergence is now surfaced via a Warn).

Tests (24, all green natively + wasm32-wasip2):

  * Existing `poll_ready_submits_order_and_persists_submitted_uid` and
    `poll_ready_resolves_non_empty_app_data_then_submits` updated to
    compute the expected marker key via `compute_uid_hex` instead of
    hardcoding `submitted:0xfeedface` (the mock orderbook's stub UID,
    which now triggers the divergence Warn so we also assert that).
  * New `poll_ready_skips_submit_when_submitted_uid_already_in_store`:
    seeds the marker, dispatches a block tick, asserts
    `submit_order` (and the preceding appData resolve) are NOT called
    and that the expected Info log appears.

Out of scope (deferred): the same idempotency pattern could be
applied to ethflow-watcher's `observed:{uid}` marker (already correct
there — the GET-not-POST design makes this naturally idempotent).
…econnects (COW-1086)

Adds a positive-recovery Info log when the block subscription
resumes after a silence ≥ 60 s, covering the observability gap
identified in the 2026-06-23 Sepolia soak.

## Background

The 2026-06-23 soak surfaced this sequence:

    09:05:43  ERROR  WS connection error: WebSocket protocol error:
                     Connection reset without closing handshake
                     target=alloy_transport_ws::native
    (no further WS-related lines for ~1 h)
    10:02:24  INFO   indexed watch:...     ← twap-monitor activity resumes
    10:05:24  INFO   ethflow observed ...  ← ethflow-watcher activity resumes

`docker ps` showed 0 restarts and the container stayed healthy
throughout — alloy's transport layer reconnected internally without
the engine's `reconnecting_block_task` ever observing
`inner.next().await -> None`. So the engine never entered its
"stream ended → backoff → subscription reopened" path, and the
existing `block subscription reopened` Info log (COW-1071) never
fired. The transport-layer ERROR followed by silence is
indistinguishable from a hung engine on a soak dashboard.

## What changes

In `reconnecting_block_task`, on every yielded item compare
`now.duration_since(last_event)` against `BLOCK_GAP_LOG_THRESHOLD`
(60 s, 5× Sepolia block time). When the gap meets or exceeds the
threshold, emit:

    INFO  chain_id=... gap_s=... kind="block"
          "stream gap closed - first event after silence
           (likely an alloy-internal transport reconnect)"

The gap-detection logic is factored into a small synchronous helper
`block_stream_gap_to_log(now, last_event, threshold) -> Option<Duration>`
so it can be unit-tested without spinning up an async runtime or a
real provider.

## Why blocks only (not logs)

Block subscriptions have predictable cadence — Sepolia produces a
new block every ~12 s, mainnet every ~12 s. A 60 s silence is
therefore anomalous and worth surfacing. Log subscriptions, by
contrast, are inherently sparse (driven by on-chain user activity),
so the same threshold would fire false positives on quiet windows.
The existing `log subscription reopened` log already handles the
engine-detectable reconnect for log streams.

## Tests

4 new unit tests on the gap-detection helper:

  * `block_stream_gap_to_log_returns_none_when_no_prior_event`
  * `block_stream_gap_to_log_returns_none_when_under_threshold`
  * `block_stream_gap_to_log_returns_some_at_threshold_boundary`
  * `block_stream_gap_to_log_returns_some_when_well_over_threshold`

All 90 nexum-engine tests pass (86 existing + 4 new). Clippy strict
clean, fmt clean. Wasm build untouched.

## Out of scope

* End-to-end test of `reconnecting_block_task` against a mock
  provider — no existing scaffolding for that path, and the gap
  helper covers the decision logic deterministically.
* Suppressing or downgrading the `alloy_transport_ws::native` ERROR
  itself — it is a legitimate transport-layer event, just one whose
  recovery wasn't previously observable. The new Info line closes
  that loop without losing the original signal.

## Live validation

The next time alloy auto-reconnects internally on the soak VM, the
new line will surface as a structured JSON event with
`gap_s=<seconds>` so the soak dashboard can correlate it with the
preceding transport ERROR.
… fixes) (#67)

Squash of PR #67 - cherry-picks M4 compliance + applies M5-specific 1 blocker + 12 majors (engine_config typed errors, 40-em-dash sweep). Verified: cargo fmt + clippy + test all green.
)

Squash of PR #68 - 9 markdown files reconciled (5 vapor items rephrased as future direction + capability-gating diagrams aligned to link-time enforcement). Verified: cargo doc --workspace --no-deps clean.
`observe_placement` matches on the `Result<String, HostError>` returned
by `host.cow_api_request(...)`. The M5 conflict resolution (COW-1082
ErrorResp data forwarding) accidentally pasted a wildcard arm that
belongs to `apply_submit_retry`'s `match classify_api_error(...)` (which
matches a `RetryAction` enum). On a `Result`, the wildcard is both
unreachable (`Ok(_)` and `Err(_)` already cover everything) and
references an `err` binding that doesn't exist in its scope:

    error[E0425]: cannot find value `err` in this scope
       --> modules/ethflow-watcher/src/strategy.rs:205:21
       --> modules/ethflow-watcher/src/strategy.rs:205:31

The `Err(err) if err.code == 404` arm + bare `Err(err)` arm already
classify every error case. Drop the spurious `_ =>` arm; bring
`observe_placement` back to fmt/clippy/test green on dev/m5-base.
…terError

Audit reference: milestone-rubric-grant-audit-2026-06-25.md, Major #1
(remaining enums introduced on the M5 multi-chain pass).

- `EnvVarError` (engine_config.rs): introduced with the COW-1071 env-var
  substitution path. Snake_case variant labels feed the boot-time
  `tracing::error!(error_kind = ...)` call sites in `main.rs`.
- `FilterError` (supervisor.rs): introduced with the M5 multi-chain
  log-filter parsing. Snake_case variant labels feed the
  `tracing::warn!(error_kind = ...)` log emitted when a
  `[[subscription]]` address or topic fails to parse.

The audit's M3 / M4 derives landed on the milestones that introduced
the enums; these two complete the workspace-wide IntoStaticStr pass
flagged in audit Major #1 on the milestones that own them.
Audit reference: milestone-rubric-grant-audit-2026-06-25.md, Major #6.

The rubric forbids em-dashes in "code, rustdoc, commit messages, PR
bodies, or review comments". `.toml` is technically a grey zone but
these comments surface verbatim when operators `cat
engine.docker.toml` or `engine.example.toml` during deployment
onboarding. Mechanical find/replace to ` - ` (ASCII hyphen with
spaces).

Files touched:
- engine.example.toml: 2 em-dashes (lines 20, 38)
- engine.docker.toml: 4 em-dashes (lines 4, 5, 6, 31)
…rse (audit JC5)

shepherd-backtest's offline replay harness carried its own
`AddressParseError` enum (hex-decode + length check). The shape
overlaps directly with the `AddressParse` typed error introduced
into `shepherd-sdk` by the audit JC5 pass.

Extend `shepherd_sdk::address` with a single-address `parse_address`
helper alongside the existing `parse_address_list` (the `InvalidAddress`
variant covers both call sites via the `index` field). Replay's
`fixtures::parse_address` becomes a thin wrapper that calls the SDK
and converts the `Address` to the `[u8; 20]` shape the strategy
consumes via `LogView::address`.

Drops the now-unused `thiserror` dependency from shepherd-backtest;
`hex` stays for topic/data decoding.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants