Add ControllerEventQueueSizeGauge to detect a wedged ("zombie leader" controller) by LZD-PratyushBhatt · Pull Request #201 · linkedin/helix

LZD-PratyushBhatt · 2026-06-18T08:50:40Z

Issues

My PR addresses the following Helix issues and references them in the PR description:

Fixes #200

Description

Here are some details about my PR:

What: Add a new per-cluster controller JMX gauge, ControllerEventQueueSizeGauge, on
ClusterStatusMonitor, exposing the depth of the DEFAULT cluster-event pipeline queue
(GenericHelixController._eventQueue).

Why: The controller pipeline is drained by a single ClusterEventProcessor thread while
the ZK session is kept alive by a separate ZK-client thread. As a result a wedged controller
can keep leadership and a live ZK session yet stop processing events entirely. None of the
existing controller signals (ZK session expiries, controllership-takeover failures, ZK op
latency, WAGED internal failure) catch this, because the process stays up and remains leader.
This "zombie leader" failure mode is currently invisible to monitoring.

How: Expose the DEFAULT cluster-event pipeline backlog as a per-cluster JMX gauge on
ClusterStatusMonitor, updated from GenericHelixController on both the enqueue side
(ZK-callback / periodic-rebalance threads, in enqueueEvent when the target is _eventQueue)
and the dequeue side (the pipeline thread, in ClusterEventProcessor.run() after take() for
the DEFAULT queue). Updates are guarded by _isMonitoring and a null check on the monitor.

A healthy controller keeps this near 0: the queue dedups by event type
(DedupEventBlockingQueue), so depth saturates at the few distinct pending event types and the
pipeline drains them quickly. A wedged controller lets it climb and stay elevated. EKG / alerts
should therefore gate on a sustained average above ~0 (not MAX), with a small threshold.

Sensor / attribute exposed: ObjectName ClusterStatus:cluster={cluster}, attribute
ControllerEventQueueSizeGauge.

Files changed:

helix-core/.../monitoring/mbeans/ClusterStatusMonitor.java - new AtomicLong field, setter
setControllerEventQueueSizeGauge(long), getter getControllerEventQueueSizeGauge().
helix-core/.../monitoring/mbeans/ClusterStatusMonitorMBean.java - getter declared on the MBean
interface (auto-exports the JMX attribute).
helix-core/.../controller/GenericHelixController.java - updateControllerEventQueueSizeGauge()
helper, invoked at enqueue and dequeue of the DEFAULT _eventQueue.

Tests

The following tests are written for this issue:
TestClusterStatusMonitor#testControllerEventQueueSizeGaugeStartsAtZero
TestClusterStatusMonitor#testControllerEventQueueSizeGaugeReflectsLatestSetter
The following is the result of the "mvn test" command on the appropriate module
(mvn test -pl helix-core -Dtest=TestClusterStatusMonitor):

[INFO] Running org.apache.helix.monitoring.mbeans.TestClusterStatusMonitor
[INFO] Tests run: 22, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.226 s - in org.apache.helix.monitoring.mbeans.TestClusterStatusMonitor
[INFO] Tests run: 22, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS

Changes that Break Backward Compatibility (Optional)

My PR contains changes that break backward compatibility or previous assumptions for certain methods or API.

No backward-incompatible behavior changes. The only API surface change is additive: a new
read-only getter getControllerEventQueueSizeGauge() on the ClusterStatusMonitorMBean JMX
interface. No existing method signatures are changed or removed, and the sole implementor
(ClusterStatusMonitor) is updated in this PR. The new value defaults to 0 until the controller
reports it, so existing consumers and dashboards are unaffected.

Documentation (Optional)

In case of new functionality, my PR adds documentation in the following wiki page:

Not applicable. The change adds a single self-describing JMX gauge (documented via Javadoc on the
MBean interface); no public wiki change is required.

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines.
In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Note: the current commit subject is
Add ControllerEventQueueSizeGauge to detect a wedged ("zombie leader") controller.
It uses the imperative mood, has a blank line before the body, does not end with a period, and
the body explains what/why. It does not yet reference #200 and exceeds 50 characters. Amend
to e.g. Fix #200: add ControllerEventQueueSizeGauge for wedged controllers before merge if
strict compliance is required.

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

…) controller The controller pipeline is drained by a single ClusterEventProcessor thread while the ZK session is kept alive by a separate ZK-client thread, so a wedged controller can keep leadership and a live ZK session yet stop processing events entirely. None of the existing controller signals (ZK session expiries, controllership-takeover failures, ZK op latency, WAGED internal failure) catch this, since the process stays up and leader. Expose the DEFAULT cluster-event pipeline backlog as a per-cluster JMX gauge on ClusterStatusMonitor, updated from GenericHelixController on both the enqueue side (ZK-callback / periodic-rebalance threads) and the dequeue side (pipeline thread). A healthy controller keeps this near 0 (the queue dedups by event type, so depth saturates at the few distinct pending event types); a wedged controller lets it climb. EKG/alerts should gate on a sustained average above ~0. Add unit tests for the gauge default and reversible setter.

LZD-PratyushBhatt requested review from PranaviAncha, arkmish, kabragaurav, laxman-ch, ngngwr, proud-parselmouth and thestreak101 as code owners June 18, 2026 08:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ControllerEventQueueSizeGauge to detect a wedged ("zombie leader" controller)#201

Add ControllerEventQueueSizeGauge to detect a wedged ("zombie leader" controller)#201
LZD-PratyushBhatt wants to merge 1 commit into
devfrom
lzd/controllerEventQueueMetric

LZD-PratyushBhatt commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LZD-PratyushBhatt commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Documentation (Optional)

Commits

Code Quality

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LZD-PratyushBhatt commented Jun 18, 2026 •

edited

Loading