console-1968-improve-most-valuable-alert-types#7935
Conversation
There was a problem hiding this comment.
Code Review
This pull request outlines a proposal to enhance the alerts and notifications system by adding email support and introducing metric-based alerts for latency, error rates, and traffic. The review identifies several critical technical considerations for the implementation: the inability to run certain PostgreSQL type alterations within transactions, potential inaccuracies in ClickHouse metric calculations due to interval snapping, the need for zero-division handling in percentage change formulas, and a consistency error regarding the proposed evaluation frequency.
|
🐋 This PR was built and pushed to the following Docker images: Targets: Platforms: Image Tag: |
This PR adds a metric-based alerting system to Console. Users can define rules that fire when traffic, reliability, or latency on a target breaches a threshold (fixed or % change) and get notified via the existing Slack/webhook/MS Teams channels. Ships behind a two-tier feature flag (cluster kill-switch + per-org enable, both default off) so we can selectively enroll our own org for validation without exposing the feature to other customers, then flip a single Pulumi config value to GA the feature for everyone.
What it ships
Backend
metric_alert_rules,_rule_channels,_incidents,_state_log,_notifications_sent. State log has plan-gated expires_at (7d HOBBY/PRO, 30d ENTERPRISE); daily 4am purge.MetricAlertRulewith cursor-paginated incidents/state-log; add/update/delete mutations; Target-side queries; cross-scope validation (channels + saved filters must match rule's project).evaluateMetricAlertRulescron runs every minute, executesNORMAL → PENDING → FIRING → RECOVERINGstate machine with a "Hold minutes" debounce. Per-channel notification fan-out via independent retryablesendMetricAlertChannelNotificationjobs with three idempotency layers (pre-send dedup,ON CONFLICTpost-send,Idempotency-Keyheader).purgeExpiredAlertStateLogdaily cron.ClickHouseSlowwarning,ClickHouseErrorscritical) protecting against silent staleness of the alerter.Frontend
Two-tier feature flag, mirroring the existing schemaProposals pattern
FEATURE_FLAGS_METRIC_ALERT_RULES_ENABLED): defaults off; flipping on enables the feature cluster-wide.organizations.feature_flags.metricAlertRules): defaultsfalse; can be set via direct PG UPDATE to enable for specific orgs (no admin mutation in this PR; same operational pattern as how schemaProposals is enabled today).featureFlags:metricAlertRulesEnabledstack value flips both API and workflows.purgeExpiredAlertStateLogruns unconditionally so opted-in orgs' state-log tables stay bounded.Seed script
scripts/seed-insights-and-alerts/replaces the oldseed-insights.mtswith metric-first alert history (per-rule incident windows + matching ClickHouse ops + PG state-log rows)Notable decisions
-Evaluation anchored to
helpers.job.run_at, notDate.now(). Wall-clock would shift the window forward when the worker is backed up and miss the spike that should have fired.-Rule grouping by (
target, window, savedFilter) — one ClickHouse query per group; query count scales with groups, not rule count.-OR-semantics feature flag, resolver-layer gate (mirrors schemaProposals).
-Atomic firing transition. State row + log row + incident row + per-channel job enqueues all in one PG transaction
-Plan-gated
expires_atsnapshotted at insert time, so plan changes only affect new rows.Worth a closer look in review
A few sub-features that touch shared infrastructure or have cost / blast-radius implications and deserve focused attention:
alerts/resolvers/Target.ts, the three mutation files, andworkflows/src/lib/metric-alert-evaluator.tsall have to agree on the OR-gate semantics. If a future contributor copies the resolver pattern but forgets to mirror the SQL predicate (or vice versa), you'd get a state where mutations are gated but the cron evaluates rules anyway, or vice versa. Worth confirming the four call sites are consistent.evaluateMetricAlertRulesevery minute againstoperations_minutely/operations_hourly, batched by (targetId,timeWindowMinutes,savedFilterId). Worst case: every enabled rule with a unique grouping key is its own ClickHouse round-trip. The query is light (covered by the (target, timestamp) index, returns 2 rows), but it's worth keeping an eye on aggregate ClickHouse load when we widen rollout. Consider checking once we've enabled for a handful of orgs whethersystem.query_logshows acceptable patterns.packages/services/workflows/src/lib/clickhouse-client.tsis a new dependency edge. The workflows service previously only talked to Postgres. New env vars (CLICKHOUSE_HOST, etc.) are wired through the workflows deployment indeployment/services/workflows.ts. Confirm the Pulumi stack actually has these set in non-dev environments before flipping the env var.metric_alert_state_logtable is the highest-volume new table. Each rule transition writes a row, and the table has plan-gated TTL. ThepurgeExpiredAlertStateLogcron is the only thing keeping it bounded.metric-alert-notifier.tssends to each channel in sequence. If Slack succeeds and the webhook fails, today the failure is logged but doesn't surface to the user...they just get a partial notification. Future observability work will add metrics here, but in the meantime any review feedback on whether we should retry / surface failures more prominently is welcome.