You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The pipeline MerkleStage::Execution computes per-account storage roots serially on the same thread that walks the account trie. On mainnet, a single XEN-class account can take 60+ seconds for its storage trie alone, blocking everything else.
reth already has ParallelStateRoot (used by the engine tree's payload processor) which dispatches per-account storage-root computation to a worker pool. Wiring it into the pipeline MerkleExecute lets that single-account work overlap with the rest of the chunk.
Bench
Hetzner i7-8700 (12 threads), mainnet pipeline sync. The size of the parallelism win scales with the weight of single-account storage tries in the workload — serial must compute every per-account storage root on the main thread, while parallel absorbs large accounts onto workers.
Workload
Variant
MerkleExecute
Δ
2000 blocks (24885314 → 24887313)
Baseline reth 2.0.0 serial
38.0 s
—
ParallelMerkleExecutionStage (this proposal)
30.2 s
−21 %
7000 blocks (24885314 → 24892314)
Baseline reth 2.0.0 serial
112.4 s
—
ParallelMerkleExecutionStage (this proposal)
85.8 s
−24 %
13811 blocks (24885313 → 24901124), includes an Heavy-class account with ~60 s storage trie
Baseline reth 2.0.0 serial
814 s
—
ParallelMerkleExecutionStage (this proposal)
494 s
−39 %
The two smaller workloads come from isolated stage-replay runs (no engine API load, warm OS page cache). The 13811-block workload comes from a full pipeline sync with Lighthouse attached (cold cache); the absolute numbers are larger than the isolated runs due to cache state, but the relative speedup is what matters.
Strict isolated A/B (Lighthouse stopped, --debug.tip driven, 2000 blocks) of ParallelMerkleExecutionStage vs its closest serial-equivalent baseline matches within 0.006 % noise (160.406 s vs 160.415 s). The wall-clock win comes purely from parallelism, not an algorithm change.
Plan
Four PRs, intentionally small and independently mergeable. The first three are pre-existing bugs that benefit the serial path too; the fourth is the actual perf work and has a compile-time dependency on the second.
fix(trie): length-prefix StoredSubNode inner node in compact codec — heuristic-based decoder misreads state_mask values whose big-endian u16 encoding collides with a valid branch-node compact length (e.g. 6, 38, 70, 550), corrupting the wrapping MerkleCheckpoint's storage root payload. Add an explicit u16 length prefix.
fix(trie): persist should_check_walker_key across resume — TrieNodeIter::should_check_walker_key is set after yielding a branch so the next iteration knows not to re-emit it. The flag lived in memory only; pause/resume across an IntermediateRootState checkpoint dropped it, the resumed walker re-emitted the previous branch and panicked HashBuilder::add_branch with key == self.key. Plumb the flag through IntermediateRootState, MerkleCheckpoint, and the Compact codec.
fix(cli): reth stage dump commits merkle checkpoint per chunk — dump_merkle_stage's replay loop reused a single provider and never saved the per-chunk checkpoint, looping forever on the first chunk. Open a fresh provider per iteration, save the returned output.checkpoint, commit, then advance.
perf(trie,stages): parallel MerkleExecute via rayon-dispatched ParallelStateRoot — adds ParallelMerkleExecutionStage driving ParallelStateRoot. Three changes to ParallelStateRoot::calculate:
tokio::spawn_blocking → rayon::spawn so storage-root tasks ride the global work-stealing pool sized to CPU cores. Matches the parallelism style of sender_recovery and hashing_storage.
Bounded frontier (max_in_flight_storage_root_tasks) instead of eager up-front spawn of every target.
Fallback path uses the account's real storage prefix set instead of Default::default() (the empty-set fallback used to silently return the on-disk root and drop the chunk's storage updates when the frontier throttled the worker).
The new stage processes one incremental_threshold-block chunk per execute call and falls back to the serial MerkleStage::Execution when the range is empty, exceeds rebuild_threshold, starts at block 1, or modifies fewer than 1024 storage tries. Pipeline wiring via .set(...) in setup.rs replaces the default serial stage at the StageId::MerkleExecute slot. Unwind continues to use MerkleStage::Unwind unchanged.
Equivalence guarded by two regression tests: parallel_trie_updates_match_serial (single-shot; bit-for-bit identical (root, TrieUpdates)) and parallel_multi_chunk_trie_tables_match_serial (multi-chunk; final AccountsTrie / StoragesTrie contents match what serial MerkleStage::Execution produces on the same seed).
Depends on the second fix above: the new stage's serial fallback path (MerkleStage::Execution, used when a chunk modifies fewer than 1024 storage tries) calls StorageRootMerkleCheckpoint::new with the new should_check_walker_key parameter, so this PR cannot compile without that fix applied. Will rebase and drop the duplicated commit once PR B merges.
Scope notes
The engine tree's live-sync Merkle path is unchanged. The new stage only replaces the pipeline's serial MerkleStage::Execution variant for the incremental path.
Unwind is unchanged. The new stage skips on unwind; the existing MerkleStage::Unwind continues to handle reverts.
ParallelStateRoot::root_with_progress (intermediate progress for long chunks) is intentionally not exposed by the new stage — benchmarking shows per-Progress overhead in parallel is roughly two orders of magnitude higher than serial, so the chunk runs single-shot.
Describe the feature
Motivation
The pipeline
MerkleStage::Executioncomputes per-account storage roots serially on the same thread that walks the account trie. On mainnet, a single XEN-class account can take 60+ seconds for its storage trie alone, blocking everything else.reth already has
ParallelStateRoot(used by the engine tree's payload processor) which dispatches per-account storage-root computation to a worker pool. Wiring it into the pipelineMerkleExecutelets that single-account work overlap with the rest of the chunk.Bench
Hetzner i7-8700 (12 threads), mainnet pipeline sync. The size of the parallelism win scales with the weight of single-account storage tries in the workload — serial must compute every per-account storage root on the main thread, while parallel absorbs large accounts onto workers.
ParallelMerkleExecutionStage(this proposal)ParallelMerkleExecutionStage(this proposal)ParallelMerkleExecutionStage(this proposal)The two smaller workloads come from isolated stage-replay runs (no engine API load, warm OS page cache). The 13811-block workload comes from a full pipeline sync with Lighthouse attached (cold cache); the absolute numbers are larger than the isolated runs due to cache state, but the relative speedup is what matters.
Strict isolated A/B (Lighthouse stopped,
--debug.tipdriven, 2000 blocks) ofParallelMerkleExecutionStagevs its closest serial-equivalent baseline matches within 0.006 % noise (160.406 s vs 160.415 s). The wall-clock win comes purely from parallelism, not an algorithm change.Plan
Four PRs, intentionally small and independently mergeable. The first three are pre-existing bugs that benefit the serial path too; the fourth is the actual perf work and has a compile-time dependency on the second.
fix(trie): length-prefix
StoredSubNodeinner node in compact codec — heuristic-based decoder misreadsstate_maskvalues whose big-endian u16 encoding collides with a valid branch-node compact length (e.g. 6, 38, 70, 550), corrupting the wrappingMerkleCheckpoint's storage root payload. Add an explicitu16length prefix.fix(trie): persist
should_check_walker_keyacross resume —TrieNodeIter::should_check_walker_keyis set after yielding a branch so the next iteration knows not to re-emit it. The flag lived in memory only; pause/resume across anIntermediateRootStatecheckpoint dropped it, the resumed walker re-emitted the previous branch and panickedHashBuilder::add_branchwithkey == self.key. Plumb the flag throughIntermediateRootState,MerkleCheckpoint, and the Compact codec.fix(cli):
reth stage dumpcommits merkle checkpoint per chunk —dump_merkle_stage's replay loop reused a single provider and never saved the per-chunk checkpoint, looping forever on the first chunk. Open a fresh provider per iteration, save the returnedoutput.checkpoint, commit, then advance.perf(trie,stages): parallel
MerkleExecutevia rayon-dispatchedParallelStateRoot— addsParallelMerkleExecutionStagedrivingParallelStateRoot. Three changes toParallelStateRoot::calculate:tokio::spawn_blocking→rayon::spawnso storage-root tasks ride the global work-stealing pool sized to CPU cores. Matches the parallelism style ofsender_recoveryandhashing_storage.max_in_flight_storage_root_tasks) instead of eager up-front spawn of every target.Default::default()(the empty-set fallback used to silently return the on-disk root and drop the chunk's storage updates when the frontier throttled the worker).The new stage processes one
incremental_threshold-block chunk per execute call and falls back to the serialMerkleStage::Executionwhen the range is empty, exceedsrebuild_threshold, starts at block 1, or modifies fewer than 1024 storage tries. Pipeline wiring via.set(...)insetup.rsreplaces the default serial stage at theStageId::MerkleExecuteslot. Unwind continues to useMerkleStage::Unwindunchanged.Equivalence guarded by two regression tests:
parallel_trie_updates_match_serial(single-shot; bit-for-bit identical(root, TrieUpdates)) andparallel_multi_chunk_trie_tables_match_serial(multi-chunk; finalAccountsTrie/StoragesTriecontents match what serialMerkleStage::Executionproduces on the same seed).Depends on the second fix above: the new stage's serial fallback path (
MerkleStage::Execution, used when a chunk modifies fewer than 1024 storage tries) callsStorageRootMerkleCheckpoint::newwith the newshould_check_walker_keyparameter, so this PR cannot compile without that fix applied. Will rebase and drop the duplicated commit once PR B merges.Scope notes
MerkleStage::Executionvariant for the incremental path.MerkleStage::Unwindcontinues to handle reverts.ParallelStateRoot::root_with_progress(intermediate progress for long chunks) is intentionally not exposed by the new stage — benchmarking shows per-Progressoverhead in parallel is roughly two orders of magnitude higher than serial, so the chunk runs single-shot.Additional context
No response