perf(stages): parallelize MerkleExecute for pipeline initial sync

### Describe the feature

## Motivation

The pipeline `MerkleStage::Execution` computes per-account storage roots serially on the same thread that walks the account trie. On mainnet, a single XEN-class account can take 60+ seconds for its storage trie alone, blocking everything else.

reth already has `ParallelStateRoot` (used by the engine tree's payload processor) which dispatches per-account storage-root computation to a worker pool. Wiring it into the pipeline `MerkleExecute` lets that single-account work overlap with the rest of the chunk.

## Bench

Hetzner i7-8700 (12 threads), mainnet pipeline sync. The size of the parallelism win scales with the weight of single-account storage tries in the workload — serial must compute every per-account storage root on the main thread, while parallel absorbs large accounts onto workers.

| Workload | Variant | MerkleExecute | Δ |
|---|---|---:|---:|
| 2000 blocks (24885314 → 24887313) | Baseline reth 2.0.0 serial | 38.0 s | — |
| | `ParallelMerkleExecutionStage` (this proposal) | 30.2 s | **−21 %** |
| 7000 blocks (24885314 → 24892314) | Baseline reth 2.0.0 serial | 112.4 s | — |
| | `ParallelMerkleExecutionStage` (this proposal) | 85.8 s | **−24 %** |
| 13811 blocks (24885313 → 24901124), includes an Heavy-class account with ~60 s storage trie | Baseline reth 2.0.0 serial | 814 s | — |
| | `ParallelMerkleExecutionStage` (this proposal) | 494 s | **−39 %** |

The two smaller workloads come from isolated stage-replay runs (no engine API load, warm OS page cache). The 13811-block workload comes from a full pipeline sync with Lighthouse attached (cold cache); the absolute numbers are larger than the isolated runs due to cache state, but the relative speedup is what matters.

Strict isolated A/B (Lighthouse stopped, `--debug.tip` driven, 2000 blocks) of `ParallelMerkleExecutionStage` vs its closest serial-equivalent baseline matches within 0.006 % noise (160.406 s vs 160.415 s). The wall-clock win comes purely from parallelism, not an algorithm change.

## Plan

Four PRs, intentionally small and independently mergeable. The first three are pre-existing bugs that benefit the serial path too; the fourth is the actual perf work and has a compile-time dependency on the second.

- [x] **[fix(trie): length-prefix `StoredSubNode` inner node in compact codec](https://github.com/paradigmxyz/reth/pull/23634)** — heuristic-based decoder misreads `state_mask` values whose big-endian u16 encoding collides with a valid branch-node compact length (e.g. 6, 38, 70, 550), corrupting the wrapping `MerkleCheckpoint`'s storage root payload. Add an explicit `u16` length prefix.

- [x] **[fix(trie): persist `should_check_walker_key` across resume](https://github.com/paradigmxyz/reth/pull/23635)** — `TrieNodeIter::should_check_walker_key` is set after yielding a branch so the next iteration knows not to re-emit it. The flag lived in memory only; pause/resume across an `IntermediateRootState` checkpoint dropped it, the resumed walker re-emitted the previous branch and panicked `HashBuilder::add_branch` with `key == self.key`. Plumb the flag through `IntermediateRootState`, `MerkleCheckpoint`, and the Compact codec.

- [x] **[fix(cli): `reth stage dump` commits merkle checkpoint per chunk](https://github.com/paradigmxyz/reth/pull/23636)** — `dump_merkle_stage`'s replay loop reused a single provider and never saved the per-chunk checkpoint, looping forever on the first chunk. Open a fresh provider per iteration, save the returned `output.checkpoint`, commit, then advance.

- [ ] **perf(trie,stages): parallel `MerkleExecute` via rayon-dispatched `ParallelStateRoot`** — adds `ParallelMerkleExecutionStage` driving `ParallelStateRoot`. Three changes to `ParallelStateRoot::calculate`:

  1. `tokio::spawn_blocking` → `rayon::spawn` so storage-root tasks ride the global work-stealing pool sized to CPU cores. Matches the parallelism style of `sender_recovery` and `hashing_storage`.
  2. Bounded frontier (`max_in_flight_storage_root_tasks`) instead of eager up-front spawn of every target.
  3. Fallback path uses the account's real storage prefix set instead of `Default::default()` (the empty-set fallback used to silently return the on-disk root and drop the chunk's storage updates when the frontier throttled the worker).

  The new stage processes one `incremental_threshold`-block chunk per execute call and falls back to the serial `MerkleStage::Execution` when the range is empty, exceeds `rebuild_threshold`, starts at block 1, or modifies fewer than 1024 storage tries. Pipeline wiring via `.set(...)` in `setup.rs` replaces the default serial stage at the `StageId::MerkleExecute` slot. Unwind continues to use `MerkleStage::Unwind` unchanged.

  Equivalence guarded by two regression tests: `parallel_trie_updates_match_serial` (single-shot; bit-for-bit identical `(root, TrieUpdates)`) and `parallel_multi_chunk_trie_tables_match_serial` (multi-chunk; final `AccountsTrie` / `StoragesTrie` contents match what serial `MerkleStage::Execution` produces on the same seed).

  **Depends on the second fix above**: the new stage's serial fallback path (`MerkleStage::Execution`, used when a chunk modifies fewer than 1024 storage tries) calls `StorageRootMerkleCheckpoint::new` with the new `should_check_walker_key` parameter, so this PR cannot compile without that fix applied. Will rebase and drop the duplicated commit once PR B merges.

## Scope notes

- The engine tree's live-sync Merkle path is **unchanged**. The new stage only replaces the pipeline's serial `MerkleStage::Execution` variant for the incremental path.
- Unwind is unchanged. The new stage skips on unwind; the existing `MerkleStage::Unwind` continues to handle reverts.
- `ParallelStateRoot::root_with_progress` (intermediate progress for long chunks) is intentionally not exposed by the new stage — benchmarking shows per-`Progress` overhead in parallel is roughly two orders of magnitude higher than serial, so the chunk runs single-shot.

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(stages): parallelize MerkleExecute for pipeline initial sync #23632

Describe the feature

Motivation

Bench

Plan

Scope notes

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Workload	Variant	MerkleExecute	Δ
2000 blocks (24885314 → 24887313)	Baseline reth 2.0.0 serial	38.0 s	—
	`ParallelMerkleExecutionStage` (this proposal)	30.2 s	−21 %
7000 blocks (24885314 → 24892314)	Baseline reth 2.0.0 serial	112.4 s	—
	`ParallelMerkleExecutionStage` (this proposal)	85.8 s	−24 %
13811 blocks (24885313 → 24901124), includes an Heavy-class account with ~60 s storage trie	Baseline reth 2.0.0 serial	814 s	—
	`ParallelMerkleExecutionStage` (this proposal)	494 s	−39 %

perf(stages): parallelize MerkleExecute for pipeline initial sync #23632

Description

Describe the feature

Motivation

Bench

Plan

Scope notes

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions