feat(metrics): Add CLF Ready condition alert#3291
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds a new Prometheus alert rule ClusterLogForwarderNotReady (fires when log_forwarder_ready{status="False"} == 1 for 1 minute) to both the bundled PrometheusRule manifest and the source Prometheus config, and documents the alert in the collector metrics documentation. ChangesCluster Log Forwarder Alert Rule
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/assign @jcantrill |
|
/hold |
jcantrill
left a comment
There was a problem hiding this comment.
Consider doing a quick search of our documentation here to see if there is an update to be made.
| runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-logging-operator/ClusterLogForwarderNotReady.md | ||
| summary: |- | ||
| The ClusterLogForwarder {{ $labels.resource_namespace }}/{{ $labels.resource_name }} has been in a not ready state | ||
| for more than 5 minutes. This could indicate a validation error in the ClusterLogForwarder spec or a deployment |
There was a problem hiding this comment.
Can we state there is a deployment error of the collector? We don't check the actual deployment of the pods to be able to surface this information.
@r2d2rnd I was actually wondering if we may have some "false" security here with regards to "Ready". This feature was created to mainly identify CLF validation errors that occurs after admission of the CLF where an admin does not get immediate feedback. Do we need to reframe this alert or rename it to clarify what it will actually expose?
There was a problem hiding this comment.
Dropped "deployment error"
Add ClusterLogForwarderNotReady alert that fires when a CLF has been in a not ready state for more than 1 minute. This typically indicates a validation error in the ClusterLogForwarder spec. Severity is error and includes a runbook URL for mitigation guidance. Depends on LOG-7718 which adds the log_forwarder_ready metric. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
/test e2e-target |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jcantrill, vparfonov The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
|
/label verified by @jcantrill |
|
/lgtm |
|
/verified by @jcantrill |
|
@jcantrill: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@vparfonov: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
Add
ClusterLogForwarderNotReadyalert that fires when a CLF has been in a not ready state for more than 1 minutes.Depends on LOG-7718 which adds the
log_forwarder_readymetric./cc
/assign
Links
Summary by CodeRabbit
New Features
Documentation