Skip to content

CNV-87529: management: add alert rule classification system#976

Open
sradco wants to merge 1 commit into
openshift:main-alerts-management-apifrom
sradco:alert-mgmt-restructured-04-classification
Open

CNV-87529: management: add alert rule classification system#976
sradco wants to merge 1 commit into
openshift:main-alerts-management-apifrom
sradco:alert-mgmt-restructured-04-classification

Conversation

@sradco

@sradco sradco commented Jun 8, 2026

Copy link
Copy Markdown

Add classification labels (component,
layer) for alert rules with validation,
ARC-based persistence, and bulk update
support.

Signed-off-by: Shirly Radco [email protected]
Signed-off-by: João Vilaça [email protected]
Signed-off-by: Aviv Litman [email protected]
Co-authored-by: AI Assistant [email protected]

Made with Cursor

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 8, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 8, 2026

Copy link
Copy Markdown

@sradco: This pull request references CNV-80608 which is a valid jira issue.

Details

In response to this:

Add classification labels (component,
layer) for alert rules with validation,
ARC-based persistence, and bulk update
support.

Signed-off-by: Shirly Radco [email protected]
Signed-off-by: João Vilaça [email protected]
Signed-off-by: Aviv Litman [email protected]
Co-authored-by: AI Assistant [email protected]

Made with Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from PeterYurkovich and zhuje June 8, 2026 14:23
@openshift-ci

openshift-ci Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sradco
Once this PR has been reviewed and has the lgtm label, please assign peteryurkovich for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sradco sradco changed the title CNV-80608: management: add alert rule classification system CNV-87529: management: add alert rule classification system Jun 8, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 8, 2026

Copy link
Copy Markdown

@sradco: This pull request references CNV-87529 which is a valid jira issue.

Details

In response to this:

Add classification labels (component,
layer) for alert rules with validation,
ARC-based persistence, and bulk update
support.

Signed-off-by: Shirly Radco [email protected]
Signed-off-by: João Vilaça [email protected]
Signed-off-by: Aviv Litman [email protected]
Co-authored-by: AI Assistant [email protected]

Made with Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sradco

sradco commented Jun 9, 2026

Copy link
Copy Markdown
Author

/retest

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@sradco: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment thread docs/alert-rule-classification.md
Comment thread docs/alert-rule-classification.md Outdated
Comment on lines +49 to +50
- Core control plane components (renamed to layer `cluster`)
- Workload/namespace-level alerts (renamed to layer `namespace`)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using rename since alerts/rules have names:

Suggested change
- Core control plane components (renamed to layer `cluster`)
- Workload/namespace-level alerts (renamed to layer `namespace`)
- Core control plane components (layer set to `cluster`)
- Workload/namespace-level alerts (layer set to `namespace`)

@sradco sradco Jun 16, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I applied the suggestion. Updated to "layer set to" to avoid confusion with alert/rule names.

Comment thread pkg/classification/validation.go Outdated

// ValidateLayer returns true if the provided layer is one of the allowed values.
func ValidateLayer(layer string) bool {
_, ok := allowedLayers[strings.ToLower(strings.TrimSpace(layer))]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the allowed layers calls ToLower in a number of places, is there a reason that we should accept "Cluster" as a layer value instead of requiring "cluster"

@sradco sradco Jun 16, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I removed the ToLower/TrimSpace from ValidateLayer , it now requires exact lowercase (cluster or namespace).
The normalization in buildClassificationLabels was also removed since validation already ensures correctness.
This makes the API strict at the boundary.

if req.Layer != nil && !classification.ValidateLayer(*req.Layer) {
return &ValidationError{Message: fmt.Sprintf("invalid layer %q (allowed: cluster, namespace)", *req.Layer)}
}
if req.ComponentFrom != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If req.Component is a whitespace only string then the validation error's. If req.ComponentFrom is a whitespace only string then the validation does NOT error. Is there a reason for the different handling of the 2 fields?

@sradco sradco Jun 16, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No good reason for the difference, fixed.
ComponentFrom and LayerFrom now reject whitespace-only values with a clear error message ("must not be empty or whitespace-only; set to null to remove"), consistent with how Component rejects empty strings via ValidateComponent.


if len(nextChanges) == 0 {
if found {
if err := c.k8sClient.AlertRelabelConfigs().Delete(ctx, arcNamespace, arcName); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the len(nextChanges) is called without regard to existingRuleDrops, this code will always attempt to delete an ARC if there are no changes after applying the ARC, which may not be desired. An ARC that was manually created with a drop command on a specific label which does not currently exist on the PrometheusRule will be deleted even if the user wanted the ARC to exist to gate against that label being added in the future

@sradco sradco Jun 16, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed. The ARC deletion check now also considers existingRuleDrops.
The condition is len(nextChanges) == 0 && len(existingRuleDrops) == 0 before deleting.
If there are existing Drop configs on the ARC (e.g., a disabled rule), the ARC is preserved even when classification produces no label changes.

// layer_from label, the corresponding alert label value is used instead of the
// static default. Unresolvable or empty lookups fall back to the supplied
// defaults.
func ApplyDynamicClassification(ruleLabels, alertLabels map[string]string, defaultComponent, defaultLayer string) (string, string) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this will be used in a future PR for get AlertRules, just want to double check

@sradco sradco Jun 16, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. ApplyDynamicClassification is used in the get-alerts read path (get_alerts.go) to resolve _from labels at query time.
It's called from the alerts enrichment flow when building the API response.
The get-alerts PR is the next one in the stack.

Annotations are intentionally ignored to reduce id churn on documentation-only changes.

## Classification Logic (How component/layer are determined)
Location: `pkg/alertcomponent/matcher.go`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code isn't included in this PR, which is fine, but I'm wondering how this will be implemented once it is. Will there be some kind of reconcile loop which watches for PrometheusRules, or will it only trigger an update when one of these API endpoints are called?

@sradco sradco Jun 16, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reconcile loop. Classification is triggered only through the API endpoints (PATCH single/bulk).
The matcher (pkg/alertcomponent/matcher.go) runs at read time during alert enrichment.
When GET /api/v1/alerting/alerts is called, the backend computes component/layer from rule labels and the matcher fallback logic, then applies any ARC-based overrides.
The matcher code itself is included in the base branch (main-alerts-management-api) and was introduced in an earlier PR in the stack.

arc.Annotations = map[string]string{}
}
arc.Annotations[managementlabels.ARCAnnotationAlertRuleIDKey] = alertRuleId
if err := k8sClient.AlertRelabelConfigs().Update(ctx, *arc); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed for this PR, but is there any desire to have retry functionality occur when trying to update, or should that be something that is strictly driven by the frontend UI (ie. user needs to resubmit a form)

@sradco sradco Jun 16, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. For now, retry is driven by the frontend/client.
If an ARC update fails (e.g., conflict on resourceVersion), the client should re-fetch and resubmit. Backend retry with conflict resolution is worth considering for bulk operations at scale, but for MVP the client-driven approach is simpler and matches how other management endpoints (severity, drop/restore) work today.

Comment thread docs/alert-rule-classification.md Outdated
the owning `AlertingRule` CR is operator-managed (has operator owner references), the
backend cannot modify it directly (the operator would reconcile the change back). In
this case, label updates are applied through an ARC instead. When the `AlertingRule` CR
is not externally managed, label updates are written directly into the CR. Classification

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear on why we would want to swap between directly updating the AlertingRule CR and creating/updating an ARC through a single backend endpoint instead of always just always updating through the ARC when this endpoint is called.

I can image a scenario where the frontend has created a user-created and managed AR which has these component and layer labels, and then the user wants to update these labels. It is nice to have these validation layers for these labels specifically, but if a user manually updates the labels for the alert rather than using this endpoint then the validation will be lost.

It seems like it would make more sense to have this "update the ARC vs update the AR" decision happen on the frontend rather than the backend, and then have the update AR logic always perform the validation on these labels if they exist. That way the PATCH requests with the classification will only ever update ARC's and an update to an AR directly will still receive the validation

@sradco sradco Jun 16, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good feedback. I've clarified the doc.
The classification endpoint in this PR always uses the ARC path, regardless of whether the AlertingRule CR is operator-managed or not. It never writes directly to AlertingRule CRs.

The paragraph you flagged was describing the behavior of other management endpoints (severity, drop/restore) which may write to unmanaged AlertingRule CRs directly. I've rewritten it to make this distinction clear:

Classification overrides always use the ARC path regardless of AlertingRule management status — this endpoint never writes directly to AlertingRule CRs. (Other management endpoints, such as severity updates, may write to unmanaged AlertingRule CRs directly, but classification is ARC-only.)

This is intentional because classification labels are additive metadata that should not interfere with the rule definition itself, and the ARC path provides a consistent storage model for both platform and user-defined rules.

Add classification labels (component,
layer) for alert rules with validation,
ARC-based persistence, and bulk update
support.

Signed-off-by: Shirly Radco <[email protected]>
Signed-off-by: João Vilaça <[email protected]>
Signed-off-by: Aviv Litman <[email protected]>
Co-authored-by: AI Assistant <[email protected]>
Co-authored-by: Cursor <[email protected]>
@sradco sradco force-pushed the alert-mgmt-restructured-04-classification branch from 8f0202e to 6268b33 Compare June 16, 2026 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants