Skip to content

fix: fail fast when postage block is ahead of chain tip#5460

Open
martinconic wants to merge 2 commits into
masterfrom
fix/4941-chainstate-self-heal
Open

fix: fail fast when postage block is ahead of chain tip#5460
martinconic wants to merge 2 commits into
masterfrom
fix/4941-chainstate-self-heal

Conversation

@martinconic
Copy link
Copy Markdown
Contributor

@martinconic martinconic commented May 13, 2026

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

  • Refuses to start when the persisted chainstate.Block is ahead of
    the block number reported by blockchain-rpc-endpoint, with a
    clear error pointing at the RPC misconfiguration.
  • Replaces a 10-minute "syncing in progress" stall plus
    lightnode-shutdown / fullnode-init-failure with an immediate, actionable
    failure.

Why

The reporter on #4941 observed /stamps returning 503 syncing in progress indefinitely, with /chainstate showing block ~1.18M blocks
ahead of chainTip. The two symptoms are one bug: once
chainstate.Block > chainTip, the postage listener loop in
pkg/postage/listener/listener.go evaluates to < from and continues
pkg/postage/listener/listener.go evaluates to < from and continues
syncStatusFn never returns done=true and /stamps stays in 503.
After postageSyncingStallingTimeout (10 min) the loop exits with
ErrPostageSyncingStalled; for lightnodes this triggers
b.syncingStopped.Signal() and the node shuts down, for fullnodes
init fails.

chainstate.Block is only ever advanced by UpdateBlockNumber
from events the listener received from the RPC. So a stored block
ahead of the current chain tip means the configured RPC is now
serving a different chain than it was on a previous run — a
misrouted public endpoint, a changed blockchain-rpc-endpoint, a
load-balancer pointing to the wrong backend. The chain-ID check at
startup (pkg/node/chain.go:109) does not catch this if the wrong
backend happens to report the configured chain ID. This is an RPC /
operational problem, not local DB corruption, and not something Bee
should auto-heal — silent rebuild would mask the misconfiguration and
trigger long resyncs on every restart.

Change

Before batchSvc.Start runs, query chainBackend.BlockNumber(ctx)
once. If the stored chainstate.Block is strictly greater, return an
error naming both block numbers and explaining the likely cause and
the recovery path (verify the RPC, then --resync). If the
BlockNumber call itself fails, log a warning and continue — we don't
want to block startup on a transient RPC hiccup.

No tolerance is applied: the listener always writes
cs.Block <= blockNumber - tailSize, so under normal operation cs.Block is
strictly below the live tip. The only way the check can trip is the
corruption scenario above.

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

AI Disclosure

  • This PR contains code that has been generated by an LLM.
  • I have reviewed the AI generated code thoroughly.
  • I possess the technical expertise to responsibly review the code generated in this PR.

@martinconic martinconic self-assigned this May 13, 2026
@martinconic martinconic added this to the 2026 milestone May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant