Skip to content

Add ControllerEventQueueSizeGauge to detect a wedged ("zombie leader" controller)#201

Open
LZD-PratyushBhatt wants to merge 1 commit into
devfrom
lzd/controllerEventQueueMetric
Open

Add ControllerEventQueueSizeGauge to detect a wedged ("zombie leader" controller)#201
LZD-PratyushBhatt wants to merge 1 commit into
devfrom
lzd/controllerEventQueueMetric

Conversation

@LZD-PratyushBhatt

@LZD-PratyushBhatt LZD-PratyushBhatt commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

Fixes #200

Description

  • Here are some details about my PR:

What: Add a new per-cluster controller JMX gauge, ControllerEventQueueSizeGauge, on
ClusterStatusMonitor, exposing the depth of the DEFAULT cluster-event pipeline queue
(GenericHelixController._eventQueue).

Why: The controller pipeline is drained by a single ClusterEventProcessor thread while
the ZK session is kept alive by a separate ZK-client thread. As a result a wedged controller
can keep leadership and a live ZK session yet stop processing events entirely. None of the
existing controller signals (ZK session expiries, controllership-takeover failures, ZK op
latency, WAGED internal failure) catch this, because the process stays up and remains leader.
This "zombie leader" failure mode is currently invisible to monitoring.

How: Expose the DEFAULT cluster-event pipeline backlog as a per-cluster JMX gauge on
ClusterStatusMonitor, updated from GenericHelixController on both the enqueue side
(ZK-callback / periodic-rebalance threads, in enqueueEvent when the target is _eventQueue)
and the dequeue side (the pipeline thread, in ClusterEventProcessor.run() after take() for
the DEFAULT queue). Updates are guarded by _isMonitoring and a null check on the monitor.

A healthy controller keeps this near 0: the queue dedups by event type
(DedupEventBlockingQueue), so depth saturates at the few distinct pending event types and the
pipeline drains them quickly. A wedged controller lets it climb and stay elevated. EKG / alerts
should therefore gate on a sustained average above ~0 (not MAX), with a small threshold.

Sensor / attribute exposed: ObjectName ClusterStatus:cluster={cluster}, attribute
ControllerEventQueueSizeGauge.

Files changed:

  • helix-core/.../monitoring/mbeans/ClusterStatusMonitor.java - new AtomicLong field, setter
    setControllerEventQueueSizeGauge(long), getter getControllerEventQueueSizeGauge().
  • helix-core/.../monitoring/mbeans/ClusterStatusMonitorMBean.java - getter declared on the MBean
    interface (auto-exports the JMX attribute).
  • helix-core/.../controller/GenericHelixController.java - updateControllerEventQueueSizeGauge()
    helper, invoked at enqueue and dequeue of the DEFAULT _eventQueue.

Tests

  • The following tests are written for this issue:

  • TestClusterStatusMonitor#testControllerEventQueueSizeGaugeStartsAtZero

  • TestClusterStatusMonitor#testControllerEventQueueSizeGaugeReflectsLatestSetter

  • The following is the result of the "mvn test" command on the appropriate module
    (mvn test -pl helix-core -Dtest=TestClusterStatusMonitor):

[INFO] Running org.apache.helix.monitoring.mbeans.TestClusterStatusMonitor
[INFO] Tests run: 22, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.226 s - in org.apache.helix.monitoring.mbeans.TestClusterStatusMonitor
[INFO] Tests run: 22, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API.

No backward-incompatible behavior changes. The only API surface change is additive: a new
read-only getter getControllerEventQueueSizeGauge() on the ClusterStatusMonitorMBean JMX
interface. No existing method signatures are changed or removed, and the sole implementor
(ClusterStatusMonitor) is updated in this PR. The new value defaults to 0 until the controller
reports it, so existing consumers and dashboards are unaffected.

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

Not applicable. The change adds a single self-describing JMX gauge (documented via Javadoc on the
MBean interface); no public wiki change is required.

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines.
    In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Note: the current commit subject is
Add ControllerEventQueueSizeGauge to detect a wedged ("zombie leader") controller.
It uses the imperative mood, has a blank line before the body, does not end with a period, and
the body explains what/why. It does not yet reference #200 and exceeds 50 characters. Amend
to e.g. Fix #200: add ControllerEventQueueSizeGauge for wedged controllers before merge if
strict compliance is required.

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

…) controller

The controller pipeline is drained by a single ClusterEventProcessor thread while the
ZK session is kept alive by a separate ZK-client thread, so a wedged controller can
keep leadership and a live ZK session yet stop processing events entirely. None of the
existing controller signals (ZK session expiries, controllership-takeover failures, ZK
op latency, WAGED internal failure) catch this, since the process stays up and leader.

Expose the DEFAULT cluster-event pipeline backlog as a per-cluster JMX gauge on
ClusterStatusMonitor, updated from GenericHelixController on both the enqueue side
(ZK-callback / periodic-rebalance threads) and the dequeue side (pipeline thread). A
healthy controller keeps this near 0 (the queue dedups by event type, so depth
saturates at the few distinct pending event types); a wedged controller lets it climb.
EKG/alerts should gate on a sustained average above ~0.

Add unit tests for the gauge default and reversible setter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant