Skip to content

perf(stages): parallelize MerkleExecute for pipeline initial sync #23632

@joohhnnn

Description

@joohhnnn

Describe the feature

Motivation

The pipeline MerkleStage::Execution computes per-account storage roots serially on the same thread that walks the account trie. On mainnet, a single XEN-class account can take 60+ seconds for its storage trie alone, blocking everything else.

reth already has ParallelStateRoot (used by the engine tree's payload processor) which dispatches per-account storage-root computation to a worker pool. Wiring it into the pipeline MerkleExecute lets that single-account work overlap with the rest of the chunk.

Bench

Hetzner i7-8700 (12 threads), mainnet pipeline sync. The size of the parallelism win scales with the weight of single-account storage tries in the workload — serial must compute every per-account storage root on the main thread, while parallel absorbs large accounts onto workers.

Workload Variant MerkleExecute Δ
2000 blocks (24885314 → 24887313) Baseline reth 2.0.0 serial 38.0 s
ParallelMerkleExecutionStage (this proposal) 30.2 s −21 %
7000 blocks (24885314 → 24892314) Baseline reth 2.0.0 serial 112.4 s
ParallelMerkleExecutionStage (this proposal) 85.8 s −24 %
13811 blocks (24885313 → 24901124), includes an Heavy-class account with ~60 s storage trie Baseline reth 2.0.0 serial 814 s
ParallelMerkleExecutionStage (this proposal) 494 s −39 %

The two smaller workloads come from isolated stage-replay runs (no engine API load, warm OS page cache). The 13811-block workload comes from a full pipeline sync with Lighthouse attached (cold cache); the absolute numbers are larger than the isolated runs due to cache state, but the relative speedup is what matters.

Strict isolated A/B (Lighthouse stopped, --debug.tip driven, 2000 blocks) of ParallelMerkleExecutionStage vs its closest serial-equivalent baseline matches within 0.006 % noise (160.406 s vs 160.415 s). The wall-clock win comes purely from parallelism, not an algorithm change.

Plan

Four PRs, intentionally small and independently mergeable. The first three are pre-existing bugs that benefit the serial path too; the fourth is the actual perf work and has a compile-time dependency on the second.

  • fix(trie): length-prefix StoredSubNode inner node in compact codec — heuristic-based decoder misreads state_mask values whose big-endian u16 encoding collides with a valid branch-node compact length (e.g. 6, 38, 70, 550), corrupting the wrapping MerkleCheckpoint's storage root payload. Add an explicit u16 length prefix.

  • fix(trie): persist should_check_walker_key across resumeTrieNodeIter::should_check_walker_key is set after yielding a branch so the next iteration knows not to re-emit it. The flag lived in memory only; pause/resume across an IntermediateRootState checkpoint dropped it, the resumed walker re-emitted the previous branch and panicked HashBuilder::add_branch with key == self.key. Plumb the flag through IntermediateRootState, MerkleCheckpoint, and the Compact codec.

  • fix(cli): reth stage dump commits merkle checkpoint per chunkdump_merkle_stage's replay loop reused a single provider and never saved the per-chunk checkpoint, looping forever on the first chunk. Open a fresh provider per iteration, save the returned output.checkpoint, commit, then advance.

  • perf(trie,stages): parallel MerkleExecute via rayon-dispatched ParallelStateRoot — adds ParallelMerkleExecutionStage driving ParallelStateRoot. Three changes to ParallelStateRoot::calculate:

    1. tokio::spawn_blockingrayon::spawn so storage-root tasks ride the global work-stealing pool sized to CPU cores. Matches the parallelism style of sender_recovery and hashing_storage.
    2. Bounded frontier (max_in_flight_storage_root_tasks) instead of eager up-front spawn of every target.
    3. Fallback path uses the account's real storage prefix set instead of Default::default() (the empty-set fallback used to silently return the on-disk root and drop the chunk's storage updates when the frontier throttled the worker).

    The new stage processes one incremental_threshold-block chunk per execute call and falls back to the serial MerkleStage::Execution when the range is empty, exceeds rebuild_threshold, starts at block 1, or modifies fewer than 1024 storage tries. Pipeline wiring via .set(...) in setup.rs replaces the default serial stage at the StageId::MerkleExecute slot. Unwind continues to use MerkleStage::Unwind unchanged.

    Equivalence guarded by two regression tests: parallel_trie_updates_match_serial (single-shot; bit-for-bit identical (root, TrieUpdates)) and parallel_multi_chunk_trie_tables_match_serial (multi-chunk; final AccountsTrie / StoragesTrie contents match what serial MerkleStage::Execution produces on the same seed).

    Depends on the second fix above: the new stage's serial fallback path (MerkleStage::Execution, used when a chunk modifies fewer than 1024 storage tries) calls StorageRootMerkleCheckpoint::new with the new should_check_walker_key parameter, so this PR cannot compile without that fix applied. Will rebase and drop the duplicated commit once PR B merges.

Scope notes

  • The engine tree's live-sync Merkle path is unchanged. The new stage only replaces the pipeline's serial MerkleStage::Execution variant for the incremental path.
  • Unwind is unchanged. The new stage skips on unwind; the existing MerkleStage::Unwind continues to handle reverts.
  • ParallelStateRoot::root_with_progress (intermediate progress for long chunks) is intentionally not exposed by the new stage — benchmarking shows per-Progress overhead in parallel is roughly two orders of magnitude higher than serial, so the chunk runs single-shot.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions