CNV-87529: management: add alert rule classification system#976
Conversation
|
@sradco: This pull request references CNV-80608 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sradco The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@sradco: This pull request references CNV-87529 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
@sradco: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| - Core control plane components (renamed to layer `cluster`) | ||
| - Workload/namespace-level alerts (renamed to layer `namespace`) |
There was a problem hiding this comment.
Avoid using rename since alerts/rules have names:
| - Core control plane components (renamed to layer `cluster`) | |
| - Workload/namespace-level alerts (renamed to layer `namespace`) | |
| - Core control plane components (layer set to `cluster`) | |
| - Workload/namespace-level alerts (layer set to `namespace`) |
There was a problem hiding this comment.
Good catch, I applied the suggestion. Updated to "layer set to" to avoid confusion with alert/rule names.
|
|
||
| // ValidateLayer returns true if the provided layer is one of the allowed values. | ||
| func ValidateLayer(layer string) bool { | ||
| _, ok := allowedLayers[strings.ToLower(strings.TrimSpace(layer))] |
There was a problem hiding this comment.
the allowed layers calls ToLower in a number of places, is there a reason that we should accept "Cluster" as a layer value instead of requiring "cluster"
There was a problem hiding this comment.
+1. I removed the ToLower/TrimSpace from ValidateLayer , it now requires exact lowercase (cluster or namespace).
The normalization in buildClassificationLabels was also removed since validation already ensures correctness.
This makes the API strict at the boundary.
| if req.Layer != nil && !classification.ValidateLayer(*req.Layer) { | ||
| return &ValidationError{Message: fmt.Sprintf("invalid layer %q (allowed: cluster, namespace)", *req.Layer)} | ||
| } | ||
| if req.ComponentFrom != nil { |
There was a problem hiding this comment.
If req.Component is a whitespace only string then the validation error's. If req.ComponentFrom is a whitespace only string then the validation does NOT error. Is there a reason for the different handling of the 2 fields?
There was a problem hiding this comment.
No good reason for the difference, fixed.
ComponentFrom and LayerFrom now reject whitespace-only values with a clear error message ("must not be empty or whitespace-only; set to null to remove"), consistent with how Component rejects empty strings via ValidateComponent.
|
|
||
| if len(nextChanges) == 0 { | ||
| if found { | ||
| if err := c.k8sClient.AlertRelabelConfigs().Delete(ctx, arcNamespace, arcName); err != nil { |
There was a problem hiding this comment.
Since the len(nextChanges) is called without regard to existingRuleDrops, this code will always attempt to delete an ARC if there are no changes after applying the ARC, which may not be desired. An ARC that was manually created with a drop command on a specific label which does not currently exist on the PrometheusRule will be deleted even if the user wanted the ARC to exist to gate against that label being added in the future
There was a problem hiding this comment.
Good catch. Fixed. The ARC deletion check now also considers existingRuleDrops.
The condition is len(nextChanges) == 0 && len(existingRuleDrops) == 0 before deleting.
If there are existing Drop configs on the ARC (e.g., a disabled rule), the ARC is preserved even when classification produces no label changes.
| // layer_from label, the corresponding alert label value is used instead of the | ||
| // static default. Unresolvable or empty lookups fall back to the supplied | ||
| // defaults. | ||
| func ApplyDynamicClassification(ruleLabels, alertLabels map[string]string, defaultComponent, defaultLayer string) (string, string) { |
There was a problem hiding this comment.
I assume this will be used in a future PR for get AlertRules, just want to double check
There was a problem hiding this comment.
Correct. ApplyDynamicClassification is used in the get-alerts read path (get_alerts.go) to resolve _from labels at query time.
It's called from the alerts enrichment flow when building the API response.
The get-alerts PR is the next one in the stack.
| Annotations are intentionally ignored to reduce id churn on documentation-only changes. | ||
|
|
||
| ## Classification Logic (How component/layer are determined) | ||
| Location: `pkg/alertcomponent/matcher.go` |
There was a problem hiding this comment.
This code isn't included in this PR, which is fine, but I'm wondering how this will be implemented once it is. Will there be some kind of reconcile loop which watches for PrometheusRules, or will it only trigger an update when one of these API endpoints are called?
There was a problem hiding this comment.
No reconcile loop. Classification is triggered only through the API endpoints (PATCH single/bulk).
The matcher (pkg/alertcomponent/matcher.go) runs at read time during alert enrichment.
When GET /api/v1/alerting/alerts is called, the backend computes component/layer from rule labels and the matcher fallback logic, then applies any ARC-based overrides.
The matcher code itself is included in the base branch (main-alerts-management-api) and was introduced in an earlier PR in the stack.
| arc.Annotations = map[string]string{} | ||
| } | ||
| arc.Annotations[managementlabels.ARCAnnotationAlertRuleIDKey] = alertRuleId | ||
| if err := k8sClient.AlertRelabelConfigs().Update(ctx, *arc); err != nil { |
There was a problem hiding this comment.
Not needed for this PR, but is there any desire to have retry functionality occur when trying to update, or should that be something that is strictly driven by the frontend UI (ie. user needs to resubmit a form)
There was a problem hiding this comment.
Good question. For now, retry is driven by the frontend/client.
If an ARC update fails (e.g., conflict on resourceVersion), the client should re-fetch and resubmit. Backend retry with conflict resolution is worth considering for bulk operations at scale, but for MVP the client-driven approach is simpler and matches how other management endpoints (severity, drop/restore) work today.
| the owning `AlertingRule` CR is operator-managed (has operator owner references), the | ||
| backend cannot modify it directly (the operator would reconcile the change back). In | ||
| this case, label updates are applied through an ARC instead. When the `AlertingRule` CR | ||
| is not externally managed, label updates are written directly into the CR. Classification |
There was a problem hiding this comment.
I'm unclear on why we would want to swap between directly updating the AlertingRule CR and creating/updating an ARC through a single backend endpoint instead of always just always updating through the ARC when this endpoint is called.
I can image a scenario where the frontend has created a user-created and managed AR which has these component and layer labels, and then the user wants to update these labels. It is nice to have these validation layers for these labels specifically, but if a user manually updates the labels for the alert rather than using this endpoint then the validation will be lost.
It seems like it would make more sense to have this "update the ARC vs update the AR" decision happen on the frontend rather than the backend, and then have the update AR logic always perform the validation on these labels if they exist. That way the PATCH requests with the classification will only ever update ARC's and an update to an AR directly will still receive the validation
There was a problem hiding this comment.
Good feedback. I've clarified the doc.
The classification endpoint in this PR always uses the ARC path, regardless of whether the AlertingRule CR is operator-managed or not. It never writes directly to AlertingRule CRs.
The paragraph you flagged was describing the behavior of other management endpoints (severity, drop/restore) which may write to unmanaged AlertingRule CRs directly. I've rewritten it to make this distinction clear:
Classification overrides always use the ARC path regardless of AlertingRule management status — this endpoint never writes directly to AlertingRule CRs. (Other management endpoints, such as severity updates, may write to unmanaged AlertingRule CRs directly, but classification is ARC-only.)
This is intentional because classification labels are additive metadata that should not interfere with the rule definition itself, and the ARC path provides a consistent storage model for both platform and user-defined rules.
Add classification labels (component, layer) for alert rules with validation, ARC-based persistence, and bulk update support. Signed-off-by: Shirly Radco <[email protected]> Signed-off-by: João Vilaça <[email protected]> Signed-off-by: Aviv Litman <[email protected]> Co-authored-by: AI Assistant <[email protected]> Co-authored-by: Cursor <[email protected]>
8f0202e to
6268b33
Compare
Add classification labels (component,
layer) for alert rules with validation,
ARC-based persistence, and bulk update
support.
Signed-off-by: Shirly Radco [email protected]
Signed-off-by: João Vilaça [email protected]
Signed-off-by: Aviv Litman [email protected]
Co-authored-by: AI Assistant [email protected]
Made with Cursor