Skip to content

fix: startup exception being silently swallowed when journal replay fails#4815

Open
nodece wants to merge 2 commits into
apache:masterfrom
nodece:fix/startup-exception-propagation
Open

fix: startup exception being silently swallowed when journal replay fails#4815
nodece wants to merge 2 commits into
apache:masterfrom
nodece:fix/startup-exception-propagation

Conversation

@nodece

@nodece nodece commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

When BookieImpl.start() fails during journal replay (e.g. missing recovery log), the exception is silently swallowed and the bookie appears to be running despite being in a broken, partially-initialized state.

Root cause: Two issues conspire to hide the failure:

  1. BookieServer.start() detects !bookie.isRunning() but silently returns instead of throwing an exception. The method declares throws IOException but never uses it on the failure path.

  2. BookieImpl.shutdown() skips setting this.exitCode when isRunning() is false (which is always the case before initState() is called at line 756). So getExitCode() returns ExitCode.OK (0) even on failure.

Effect: The exception is caught by AbstractLifecycleComponent.start() and dispatched to the uncaughtExceptionHandler (set by ComponentStarter), which triggers ComponentShutdownHook for cleanup. However, without the IOException being thrown from BookieServer.start(), the exception path relies entirely on the handler — and the exit code was incorrectly OK(0).

Changes

  • BookieServer.start(): returnthrow new IOException("Bookie failed to start, exit code: " + exitCode) — ensures the exception properly propagates to the lifecycle framework's exception handler.
  • BookieImpl.shutdown(): Move this.exitCode = exitCode outside the if (isRunning()) guard — ensures the exit code is always set, even during early startup before initState().

Complementary to #4740 which fixed the root cause of journal file deletion (lastMark regression). This PR fixes the consequence: if a journal file is missing for any reason, the bookie should fail fast with a proper error.

@void-ptr974

Copy link
Copy Markdown
Contributor

The fix looks reasonable to me. Could you please add tests to cover the startup failure path and verify that the expected non-OK exit code is preserved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants