Skip to content

--full --prune.sender-recovery.distance causes deterministic static-file inconsistency panic ~25 min after snapshot import on Reth v2.0.0 — fix already exists in closed-unmerged PR #21636 #23463

@All-The-New

Description

@All-The-New

TL;DR

Reth v2.0.0 with --full --storage.v2 --prune.sender-recovery.distance 2000000 deterministically panics on check_consistency for the TransactionSenders segment after a fresh snapshot import. PR #21918 (merged 2026-02-06) fixed this for --prune.sender-recovery.full but did not cover the Distance variant. PR #21636 was authored specifically for this case by @gakonst, closed as stale 2026-02-25 without merging, and implements the exact fix. I extracted its 29-line logic change, adapted it to v2.0.0's current ensure_invariants function, rebuilt reth locally, and confirmed the node starts cleanly and runs the full sync pipeline (headers → bodies → SenderRecovery → execution → merkle → ... → finish) without any panic. The fix works. Please either reopen #21636 or cherry-pick its logic into main.

What happens

Reth v2.0.0 with --full --storage.v2 --prune.sender-recovery.distance 2000000 panics deterministically on the TransactionSenders consistency check approximately 25 minutes after startup. The panic occurs every time, including after a full wipe-and-re-download from a fresh snapshot. The node cannot stay running.

The panic site is crates/node/builder/src/launch/common.rs:533:13:

thread 'main' panicked at crates/node/builder/src/launch/common.rs:533:13:
assertion `left != right` failed: A static file inconsistency was found that would trigger an unwind to block 0
  left: 0
 right: 0

The same assertion fires at crates/cli/commands/src/common.rs:214:13 for any reth db or reth stage subcommand that touches the database — there is no escape hatch short of reth db stats --skip-consistency-checks.

Reproducer

  1. Perform a fresh snapshot import on a clean datadir:
    reth download --chain mainnet --full --storage.v2 -y
  2. Start reth with:
    reth node \
      --datadir /path/to/datadir \
      --storage.v2 \
      --full \
      --prune.block-interval 100 \
      --prune.sender-recovery.distance 2000000 \
      --prune.transaction-lookup.distance 2000000 \
      --prune.receipts.distance 2000000 \
      --prune.account-history.distance 2000000 \
      --prune.storage-history.distance 2000000 \
      --prune.bodies.distance 2000000
  3. Let a connected consensus client (Lighthouse, in this case) push forkchoice updates.
  4. After approximately 25 minutes, reth emits a Failed to load static file jar warning for the transaction-senders segment covering the current tip, then panics.
  5. On the next startup (or on any reth db/reth stage invocation), the consistency check finds the 0-byte placeholder files Reth wrote during the previous startup, computes unwind_target=0, and panics again. The node is now in a permanent crash loop that survives across full wipes.

This has been reproduced twice on identical hardware from two independent snapshot downloads — once on 2026-04-09 after upgrading to v2.0.0, and again on 2026-04-10 after a full wipe-and-redownload recovery attempt.

Consistency check output at panic time

In read-only mode (e.g., reth db stats), the check logs rather than panics. The output is:

INFO check_consistency{read_only=true}: Verifying storage consistency.
INFO check_consistency{read_only=true}: Checking consistency for segment{segment=TransactionSenders}:
INFO   ensure_invariants{
         highest_static_file_entry=None
         highest_static_file_block=None
         table="TransactionSenders"
       }: Setting unwind target. checkpoint_block_number=24850381 unwind_target=0
WARN Inconsistent storage. Restart node to heal. unwind_target=Unwind(0)

In write mode (reth node, reth stage unwind, reth db stage-checkpoints set), the same code path escalates to the assertion panic shown above.

On-disk state at crash

Static files directory listing (ls -la /path/to/datadir/static_files/static_file_transaction-senders*):

-rw-r--r--  1 _unknown  _unknown   0  static_file_transaction-senders_0_49999
-rw-r--r--  1 _unknown  _unknown  55  static_file_transaction-senders_0_49999.conf
-rw-r--r--  1 _unknown  _unknown   9  static_file_transaction-senders_0_49999.off
-rw-r--r--  1 _unknown  _unknown   0  static_file_transaction-senders_24850000_24899999
-rw-r--r--  1 _unknown  _unknown  55  static_file_transaction-senders_24850000_24899999.conf
-rw-r--r--  1 _unknown  _unknown   9  static_file_transaction-senders_24850000_24899999.off

Both data files are 0 bytes. The .conf and .off sidecars have the standard initialization sizes but no real content.

Notably, the segment covering all blocks between 50000 and 24799999 — where corresponding Transactions segment files do exist — is entirely absent. There are no tx-senders files for the full synced range.

The segment at the tip advances between crashes. Crash 1 produced 24800000_24849999; after wipe-and-redownload, crash 2 produced 24850000_24899999. This is deterministic with the snapshot tip, not random corruption.

MDBX state

From reth db --datadir /path/to/datadir stats --skip-consistency-checks:

  • Static file catalog (MDBX): 5 segment types — Headers, Transactions, Receipts, AccountChangeSets, StorageChangeSets. TransactionSenders is not in the catalog at all. The MDBX metadata does not believe any tx-senders static files exist.
  • TransactionSenders MDBX table: 0 entries, 0 bytes — fully distance-pruned from MDBX, as configured.
  • SenderRecovery stage checkpoint: StageCheckpoint { block_number: 24850381, stage_checkpoint: None }
  • PruneSenderRecovery checkpoint: mode=Distance(2000000), block=24850381
  • RocksDB TransactionHashNumbers: 509M entries, ~23 GiB — transaction hashes are NOT pruned (only senders)

The inconsistency in summary: MDBX says SenderRecovery completed to block 24,850,381, the catalog says no tx-senders static files exist, and the filesystem says two 0-byte files exist. These three facts are irreconcilable to check_consistency.

The trigger sequence

Based on log analysis:

  1. Reth starts from the snapshot at block 24,850,381. Lighthouse begins pushing forkchoice updates. Reth does not advance any blocks but does run background pipeline work (e.g., TransactionLookup rebuilds ~1.4M entries during the 25-minute window).
  2. ~25 minutes in, Reth attempts to read the transaction-senders static file for the segment covering the current tip:
    WARN Failed to load static file jar ... Os { code: 2, kind: NotFound, message: "No such file or directory" },
         path: ".../static_files/static_file_transaction-senders_24800000_24849999.conf"
    
  3. The engine-tree error chain propagates up:
    ERROR engine::persistence: Persistence service failed...
    ERROR engine::tree: Fatal error in consensus engine...
    INFO  reth::cli: Fatal error in consensus engine...shutting down
    
  4. During shutdown, Reth's "three-way healing" creates the 0-byte placeholder files at the expected segment boundaries (0_49999 and the tip segment).
  5. On next startup, check_consistency walks the static files directory, finds the 0-byte placeholders, sees highest_static_file_block=None because the files have no rows, compares to checkpoint_block_number=24850381, computes unwind_target=0, and panics.
  6. All subsequent startups reproduce step 5 because Reth also recreates the 0_49999 placeholder as a startup artifact.

Root cause analysis

The chain of causation points to a gap in segments_to_check() / should_check_segment() in the static file manager (approximately crates/storage/provider/src/providers/static_file/manager.rs):

  1. check_consistency iterates segments_to_check().
  2. should_check_segment() skips TransactionSenders only when is_segment_fully_pruned(SenderRecovery) returns true.
  3. is_segment_fully_pruned appears to return true only when the prune mode is Full — i.e., when --prune.sender-recovery.full is set.
  4. With --prune.sender-recovery.distance 2000000, the stored prune mode is Distance(2000000), which does not satisfy the Full check.
  5. So check_consistency runs for TransactionSenders, finds no static files, and panics — even though this config never wrote any tx-senders to static files in the first place.

The deeper issue is that the snapshot import procedure sets SenderRecovery stage checkpoint to the snapshot tip but does not write any TransactionSenders static files. This is the correct behavior for a --full + distance config (distance pruning keeps senders in MDBX for the recent window, not in static files). But check_consistency does not know this: it sees a non-zero stage checkpoint and a missing static file catalog entry, and concludes the data was lost.

What I tried that does not work

Approach Result
Quarantine the 0-byte placeholder files reth node recreates them on startup and hits the same panic
reth stage unwind --datadir /path to-block <N> reth stage runs check_consistency at startup and panics before doing any unwind
reth db --datadir /path stage-checkpoints set --stage sender-recovery 0 Same: write-mode consistency check runs before accepting any arguments, panics
reth node --prune.sender-recovery.full on the existing datadir Saves the new prune config to reth.toml but does NOT update the PruneSenderRecovery entry in MDBX, so check_consistency still reads Distance(2000000) and panics
Full wipe + fresh snapshot (reth download --full --storage.v2 -y) Produces a clean state that runs for ~25 minutes and then crashes with the same panic at the next 50k-block segment boundary. Reproduced twice.

The only command that can inspect the broken datadir without panicking is reth db --datadir /path stats --skip-consistency-checks. There is no equivalent escape hatch for reth node, reth stage, or reth db stage-checkpoints set.

Related issues

The difference between what PR #21918 fixed and what this issue describes is the prune mode: Full vs Distance. The fix in #21918 added a skip path for is_segment_fully_pruned, but distance-pruned senders are not "fully pruned" by that definition even though they are also never written to static files.

The already-existing fix: PR #21636

PR #21636 "fix(storage): respect prune checkpoint in static file consistency check" was authored by @gakonst (branch joshie/fix-static-file-prune-unwind) specifically for this bug. The PR body describes it word-for-word:

Root cause: When --full prunes segments like TransactionSenders, the static files are deleted but the stage checkpoint remains high. On next startup, the consistency check in ensure_invariants sees checkpoint_block_number > highest_static_file_block (e.g., 22M > 0) and assumes data corruption, triggering an unwanted unwind.

The PR's approach: add + PruneCheckpointReader to the ensure_invariants trait bound, and change the naive if checkpoint_block_number > highest_static_file_block comparison to use effective_available_block = max(highest_static_file_block, prune_checkpoint_block). For distance-pruned segments, the prune checkpoint's block_number is the tip, so effective_available_block equals the stage checkpoint and no unwind is triggered. The PR includes a green unit test (test_consistency_respects_prune_checkpoint) that seeds the exact failure mode.

PR #21636 was closed 2026-02-25 by @emmajam with the comment "Hey! We're doing some spring cleaning on our PR backlog 🧹 Closing old PRs to keep things tidy. If this is still relevant, please feel free to re-open" — it was never merged. It is still relevant: v2.0.0 shipped 2026-04-08 with the bug still present.

I confirmed PR #21636 fixes this bug

I extracted PR #21636's logic change (the full diff does not apply cleanly to v2.0.0 because v2.0.0 already landed PR #21918's prerequisite imports), hand-adapted it to v2.0.0's current ensure_invariants in crates/storage/provider/src/providers/static_file/manager.rs, rebuilt reth from source with default release features, and bootstrapped my previously-broken node. Results:

  • Startup check_consistency passes cleanly. No unwind_target=Unwind(0) warning, no panic.
  • Full pipeline cycle completes. Stages 1–14 advanced 24850381 → 24851130 (the snapshot tip + new blocks). Specifically the SenderRecovery stage (stage 3/14) ran in 17 seconds and wrote 275,698 sender rows to static_file_transaction-senders static files (498 writer opens, 497 commits, verified via reth_static_files_jar_provider_calls_total{segment="transaction-senders",operation="append"}). This is the exact code path that previously panicked the node at runtime.
  • Second pipeline cycle immediately started. Headers stage advanced to block 24858908 (7778 more blocks). eth_blockNumber advanced from 0x17b2fcd (24850381) to 0x17b32ba (24851130) and continues climbing.
  • Zero panics in reth.daemon.err.log after 30+ minutes of continuous operation across two full pipeline cycles.

The adapted diff I applied (the minimum change for v2.0.0) is:

     fn ensure_invariants_for<Provider>(
         ...
         where
-            Provider: DBProvider + BlockReader + StageCheckpointReader,
+            Provider: DBProvider + BlockReader + StageCheckpointReader + PruneCheckpointReader,
             N: NodePrimitives<Receipt: Value, BlockHeader: Value, SignedTx: Value>,

     fn ensure_invariants<Provider, T: Table<Key = u64>>(
         ...
         where
-            Provider: DBProvider + BlockReader + StageCheckpointReader,
+            Provider: DBProvider + BlockReader + StageCheckpointReader + PruneCheckpointReader,

         // inside ensure_invariants, replacing the naive comparison:
+        let prune_segment = match segment {
+            StaticFileSegment::TransactionSenders => Some(PruneSegment::SenderRecovery),
+            StaticFileSegment::Receipts => Some(PruneSegment::Receipts),
+            _ => None,
+        };
+        let effective_available_block = if let Some(ps) = prune_segment {
+            let prune_checkpoint_block =
+                provider.get_prune_checkpoint(ps)?.and_then(|c| c.block_number).unwrap_or(0);
+            std::cmp::max(highest_static_file_block, prune_checkpoint_block)
+        } else {
+            highest_static_file_block
+        };
-        if checkpoint_block_number > highest_static_file_block {
+        if checkpoint_block_number > effective_available_block {
             info!(
                 target: "reth::providers::static_file",
                 checkpoint_block_number,
-                unwind_target = highest_static_file_block,
+                unwind_target = effective_available_block,
                 ?segment,
                 "Setting unwind target."
             );
-            return Ok(Some(highest_static_file_block));
+            return Ok(Some(effective_available_block));
         }

29 insertions, 6 deletions, one file. Identical in behavior to the ensure_invariants portion of PR #21636 — I did NOT port the ensure_invariants_from_db portion of #21636 since v2.0.0 renamed that function to ensure_changeset_invariants_by_block and my bug doesn't exercise it. For a full upstream fix, that portion should also be ported.

Proposed fixes (in order of preference)

Option 1 (preferred): Reopen PR #21636 or cherry-pick its logic into main. The work is already done and I've verified it works on v2.0.0. PR #21636's approach (use max(highest_static_file_block, prune_checkpoint_block) as the effective "data available from" threshold) is cleaner than a skip-path extension because it also handles the Receipts variant of the same bug and any future prune-aware segment.

Option 2: Extend PR #21918's skip path to cover distance-pruned senders. Add a Distance branch to is_segment_fully_pruned (or introduce is_segment_in_non_static_file_storage). Simpler than #21636 but narrower — only fixes TransactionSenders, not Receipts.

Option 3: Migrate prune config from reth.toml to MDBX PruneCheckpoints before running check_consistency. Lets an operator recover by changing --prune.sender-recovery.distance--prune.sender-recovery.full in their config and restarting, without hand-patching the binary. Useful as an escape hatch independent of the main fix.

Option 4: Add --skip-consistency-checks to reth db stage-checkpoints set and reth stage unwind. Gives operators a manual escape hatch to repair MDBX state without rebuilding reth. Currently --skip-consistency-checks exists only on reth db stats and is read-only. Even a one-time reth db stage-checkpoints set --skip-consistency-checks --stage sender-recovery --block-number 0 would have let me recover without the ~12 hours of diagnostic + rebuild work this session took.

Option 5 (minimum): Improve diagnostics. The current error — assertion left != right failed with both sides equal to 0 — gives operators nothing to act on. A log line that says "TransactionSenders static files are missing but the SenderRecovery stage checkpoint is at block N; if you are using distance pruning, this is likely a known bug — see issue #XXXXX" would meaningfully reduce diagnostic burden.

Environment

  • OS: macOS 25 (Darwin 25.3.0), Apple Silicon
  • Reth version: v2.0.0, commit eb4c15e5, built 2026-04-07
  • Chain: mainnet
  • Storage format: Storage v2 (--storage.v2)
  • Consensus client: Lighthouse v8.1.3 (healthy, not involved in the crash)
  • Datadir size: ~390 GB (210 GB MDBX + 23 GB RocksDB + remainder headers/transactions/receipts static files)
  • Available disk: 3.2 TB free on a 3.6 TB volume

Full reth node invocation (from startup script):

reth node \
  --datadir /Volumes/ETHDATA/reth \
  --storage.v2 \
  --http \
  --http.addr 127.0.0.1 \
  --http.api eth,net,web3 \
  --ws.addr 127.0.0.1 \
  --authrpc.addr 127.0.0.1 \
  --authrpc.port 8551 \
  --authrpc.jwtsecret /Volumes/ETHDATA/reth/jwt.hex \
  --metrics 127.0.0.1:9001 \
  --full \
  --prune.block-interval 100 \
  --prune.sender-recovery.distance 2000000 \
  --prune.transaction-lookup.distance 2000000 \
  --prune.receipts.distance 2000000 \
  --prune.account-history.distance 2000000 \
  --prune.storage-history.distance 2000000 \
  --prune.bodies.distance 2000000

Confirmed not a config typo or one-off: two independent snapshot downloads on the same hardware reproduced the same panic at the same assertion. The trigger is the first prune cycle or segment rotation after startup, not random corruption.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions