Add ControllerEventQueueSizeGauge to detect a wedged ("zombie leader" controller)#201
Open
LZD-PratyushBhatt wants to merge 1 commit into
Open
Add ControllerEventQueueSizeGauge to detect a wedged ("zombie leader" controller)#201LZD-PratyushBhatt wants to merge 1 commit into
LZD-PratyushBhatt wants to merge 1 commit into
Conversation
…) controller The controller pipeline is drained by a single ClusterEventProcessor thread while the ZK session is kept alive by a separate ZK-client thread, so a wedged controller can keep leadership and a live ZK session yet stop processing events entirely. None of the existing controller signals (ZK session expiries, controllership-takeover failures, ZK op latency, WAGED internal failure) catch this, since the process stays up and leader. Expose the DEFAULT cluster-event pipeline backlog as a per-cluster JMX gauge on ClusterStatusMonitor, updated from GenericHelixController on both the enqueue side (ZK-callback / periodic-rebalance threads) and the dequeue side (pipeline thread). A healthy controller keeps this near 0 (the queue dedups by event type, so depth saturates at the few distinct pending event types); a wedged controller lets it climb. EKG/alerts should gate on a sustained average above ~0. Add unit tests for the gauge default and reversible setter.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issues
Fixes #200
Description
What: Add a new per-cluster controller JMX gauge,
ControllerEventQueueSizeGauge, onClusterStatusMonitor, exposing the depth of the DEFAULT cluster-event pipeline queue(
GenericHelixController._eventQueue).Why: The controller pipeline is drained by a single
ClusterEventProcessorthread whilethe ZK session is kept alive by a separate ZK-client thread. As a result a wedged controller
can keep leadership and a live ZK session yet stop processing events entirely. None of the
existing controller signals (ZK session expiries, controllership-takeover failures, ZK op
latency, WAGED internal failure) catch this, because the process stays up and remains leader.
This "zombie leader" failure mode is currently invisible to monitoring.
How: Expose the DEFAULT cluster-event pipeline backlog as a per-cluster JMX gauge on
ClusterStatusMonitor, updated fromGenericHelixControlleron both the enqueue side(ZK-callback / periodic-rebalance threads, in
enqueueEventwhen the target is_eventQueue)and the dequeue side (the pipeline thread, in
ClusterEventProcessor.run()aftertake()forthe DEFAULT queue). Updates are guarded by
_isMonitoringand a null check on the monitor.A healthy controller keeps this near 0: the queue dedups by event type
(
DedupEventBlockingQueue), so depth saturates at the few distinct pending event types and thepipeline drains them quickly. A wedged controller lets it climb and stay elevated. EKG / alerts
should therefore gate on a sustained average above ~0 (not MAX), with a small threshold.
Sensor / attribute exposed: ObjectName
ClusterStatus:cluster={cluster}, attributeControllerEventQueueSizeGauge.Files changed:
helix-core/.../monitoring/mbeans/ClusterStatusMonitor.java- newAtomicLongfield, settersetControllerEventQueueSizeGauge(long), gettergetControllerEventQueueSizeGauge().helix-core/.../monitoring/mbeans/ClusterStatusMonitorMBean.java- getter declared on the MBeaninterface (auto-exports the JMX attribute).
helix-core/.../controller/GenericHelixController.java-updateControllerEventQueueSizeGauge()helper, invoked at enqueue and dequeue of the DEFAULT
_eventQueue.Tests
The following tests are written for this issue:
TestClusterStatusMonitor#testControllerEventQueueSizeGaugeStartsAtZeroTestClusterStatusMonitor#testControllerEventQueueSizeGaugeReflectsLatestSetterThe following is the result of the "mvn test" command on the appropriate module
(
mvn test -pl helix-core -Dtest=TestClusterStatusMonitor):Changes that Break Backward Compatibility (Optional)
No backward-incompatible behavior changes. The only API surface change is additive: a new
read-only getter
getControllerEventQueueSizeGauge()on theClusterStatusMonitorMBeanJMXinterface. No existing method signatures are changed or removed, and the sole implementor
(
ClusterStatusMonitor) is updated in this PR. The new value defaults to 0 until the controllerreports it, so existing consumers and dashboards are unaffected.
Documentation (Optional)
Not applicable. The change adds a single self-describing JMX gauge (documented via Javadoc on the
MBean interface); no public wiki change is required.
Commits
In addition, my commits follow the guidelines from "How to write a good git commit message":
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)