Skip to content

docs: per-object state metrics#3104

Open
yoks wants to merge 1 commit into
NVIDIA:mainfrom
yoks:metric-per-object-design
Open

docs: per-object state metrics#3104
yoks wants to merge 1 commit into
NVIDIA:mainfrom
yoks:metric-per-object-design

Conversation

@yoks

@yoks yoks commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Design describing #2186 implementation.

https://github.com/yoks/bare-metal-manager-core/blob/d7497715d555d0bdc120766a8ebda0c77d56ffb0/docs/design/per-object-state-metrics.md

Related issues

#2186

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • Documentation
    • Added a design proposal for per-object state progress metrics.
    • Describes new per-object gauges, bounded-cardinality collection, and an opt-in metrics endpoint.
    • Includes example queries and guidance for tracking state timestamps, SLA values, manual intervention needs, and object/association info.

Walkthrough

This PR adds a design document (docs/design/per-object-state-metrics.md) proposing per-object state progress metrics for a state-controller system. It defines a new gauge-only Prometheus metric catalog, an opt-in dedicated endpoint, cardinality controls, query examples, non-goals, and an implementation approach.

Changes

Per-object state metrics design proposal

Layer / File(s) Summary
Problem statement and building blocks
docs/design/per-object-state-metrics.md
Documents the limitations of aggregate-only metrics at fleet scale, enumerates per-object questions to answer, and lists existing infrastructure (PerObjectMetricsRegistry, manual-intervention signals) the design reuses.
Metric catalog and endpoint design
docs/design/per-object-state-metrics.md
Specifies a dedicated opt-in Prometheus endpoint, a gauges-only metric catalog with fixed label sets, a cardinality budget table, and new gauge series (state entry timestamp, per-object SLA, manual intervention required, object/association info).
Query examples and non-goals
docs/design/per-object-state-metrics.md
Provides PromQL examples for SLA breach detection, manual-intervention triage, and join/suppression patterns, and lists explicitly excluded metric categories.
Implementation approach
docs/design/per-object-state-metrics.md
Outlines reusing/extending PerObjectMetricsRegistry, routing gauges to the new endpoint, generating metrics from the generic processor, and recording info/associations in the machine-controller handler.

Estimated code review effort: 1 (Trivial) | ~5 minutes

Related Issues: Not specified in the provided diff.

Related PRs: Not specified in the provided diff.

Suggested labels: documentation, design-proposal

Suggested reviewers: Not specified in the provided diff.

Poem:
A rabbit sketched a metrics plan,
One gauge per object, best we can,
No counters sprawling wide and vast,
Just SLA breaches tracked at last,
On paper first — then code, my friend! 🐇📊

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the change: documentation for per-object state metrics.
Description check ✅ Passed The description is directly related to the documentation change and the referenced issue.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

No Grype artifacts were found to aggregate.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/design/per-object-state-metrics.md (1)

222-248: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Call out the registry API break explicitly.

PerObjectMetricsRegistry is currently classification-specific and registers one hard-coded gauge on the supplied meter. The implementation plan should say this is a breaking expansion, not a drop-in reuse, so the existing health metric and the new state/_info gauges stay clearly separated.

Suggested text change
- We extend it rather than adding a sibling:
+ We extend it rather than adding a sibling, but this is a breaking API expansion: the current classification-only gauge stays intact while new handles are added for state and `_info` metrics:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/per-object-state-metrics.md` around lines 222 - 248, The plan
should explicitly state that reusing PerObjectMetricsRegistry is a breaking API
expansion, not a drop-in reuse, because it is currently classification-specific
and only registers one hard-coded gauge. Update the description around
PerObjectMetricsRegistry, gauge(name, description), and the processor.rs hookup
to call out that the registry shape and handle API must change, while keeping
the existing health metric separate from the new state and _info gauges on the
per-object meter.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/per-object-state-metrics.md`:
- Around line 119-135: The `carbide_machine_instance_info` example currently
includes free-form `tenant_org`, which violates the bounded-label rule for
`_info` metrics. Update the documentation around the `carbide_object_info` and
`carbide_machine_instance_info` examples to remove the raw tenant string and, if
tenant attribution is needed, describe using a stable internal tenant ID/code
instead; keep the label set limited to closed-set traits plus object IDs.
- Around line 55-59: Qualify the scrape-interval guidance in the design doc so
it only applies to the state-transition series, since the `_info` and
association metrics can still change on inventory or topology updates. Update
the wording around the scrape advice near the `run.rs::create_metrics()`
discussion to explicitly distinguish the slow-changing state series from the
freshness-sensitive metrics, and avoid implying that all of the new metrics can
safely be scraped every 60–120s.

---

Nitpick comments:
In `@docs/design/per-object-state-metrics.md`:
- Around line 222-248: The plan should explicitly state that reusing
PerObjectMetricsRegistry is a breaking API expansion, not a drop-in reuse,
because it is currently classification-specific and only registers one
hard-coded gauge. Update the description around PerObjectMetricsRegistry,
gauge(name, description), and the processor.rs hookup to call out that the
registry shape and handle API must change, while keeping the existing health
metric separate from the new state and _info gauges on the per-object meter.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 10f87234-fdf5-450f-b024-f03862a708d3

📥 Commits

Reviewing files that changed from the base of the PR and between 7a199a7 and d749771.

📒 Files selected for processing (1)
  • docs/design/per-object-state-metrics.md

Comment thread docs/design/per-object-state-metrics.md
Comment thread docs/design/per-object-state-metrics.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants