console-1968-improve-most-valuable-alert-types by jonathanawesome · Pull Request #7935 · graphql-hive/console

jonathanawesome · 2026-03-30T19:47:06Z

This PR adds a metric-based alerting system to Console. Users can define rules that fire when traffic, reliability, or latency on a target breaches a threshold (fixed or % change) and get notified via the existing Slack/webhook/MS Teams channels. Ships behind a two-tier feature flag (cluster kill-switch + per-org enable, both default off) so we can selectively enroll our own org for validation without exposing the feature to other customers, then flip a single Pulumi config value to GA the feature for everyone.

What it ships

Backend

5 PG tables: metric_alert_rules, _rule_channels, _incidents, _state_log, _notifications_sent. State log has plan-gated expires_at (7d HOBBY/PRO, 30d ENTERPRISE); daily 4am purge.
API: MetricAlertRule with cursor-paginated incidents/state-log; add/update/delete mutations; Target-side queries; cross-scope validation (channels + saved filters must match rule's project).
Workflows: evaluateMetricAlertRules cron runs every minute, executes NORMAL → PENDING → FIRING → RECOVERING state machine with a "Hold minutes" debounce. Per-channel notification fan-out via independent retryable sendMetricAlertChannelNotification jobs with three idempotency layers (pre-send dedup, ON CONFLICT post-send, Idempotency-Key header). purgeExpiredAlertStateLog daily cron.
Observability: new ClickHouse-query histogram, Metric-Alerts Grafana dashboard (4 panels), and two Pulumi-provisioned Grafana alert rules (ClickHouseSlow warning, ClickHouseErrors critical) protecting against silent staleness of the alerter.

Frontend

Alerts area under Target with subroutes: Activity, Rules, Create, Detail
Detail page: rule conditions panel, state-transitions timeline bar, metric over-time chart, events table
Rules table: activity table with severity-bucketed activity chart
Notification preview: users see what Slack/webhook payloads look like before saving
Refactor of filter components (filter-dropdown to floating/filter-menu)...new reusable base components (button, card, accordion, data-table, form, input, page-lead, description-list, select, etc.) with stories

Two-tier feature flag, mirroring the existing schemaProposals pattern

Cluster-wide kill switch (FEATURE_FLAGS_METRIC_ALERT_RULES_ENABLED): defaults off; flipping on enables the feature cluster-wide.
Per-org enable (organizations.feature_flags.metricAlertRules): defaults false; can be set via direct PG UPDATE to enable for specific orgs (no admin mutation in this PR; same operational pattern as how schemaProposals is enabled today).
OR semantics: enabled = clusterFlag || orgFlag. Matches every existing flag in the codebase. The cluster flag short-circuits...when off, resolvers fall back to checking the org flag.
Wired through Pulumi config so a single featureFlags:metricAlertRulesEnabled stack value flips both API and workflows.
Workflows cron filters rules whose org isn't enrolled when the cluster flag is off; runs everything when it's on. purgeExpiredAlertStateLog runs unconditionally so opted-in orgs' state-log tables stay bounded.

Seed script

scripts/seed-insights-and-alerts/ replaces the old seed-insights.mts with metric-first alert history (per-rule incident windows + matching ClickHouse ops + PG state-log rows)

Notable decisions

-Evaluation anchored to helpers.job.run_at, not Date.now(). Wall-clock would shift the window forward when the worker is backed up and miss the spike that should have fired.
-Rule grouping by (target, window, savedFilter) — one ClickHouse query per group; query count scales with groups, not rule count.
-OR-semantics feature flag, resolver-layer gate (mirrors schemaProposals).
-Atomic firing transition. State row + log row + incident row + per-channel job enqueues all in one PG transaction
-Plan-gated expires_at snapshotted at insert time, so plan changes only affect new rows.

Worth a closer look in review

A few sub-features that touch shared infrastructure or have cost / blast-radius implications and deserve focused attention:

Feature flag plumbing. alerts/resolvers/Target.ts, the three mutation files, and workflows/src/lib/metric-alert-evaluator.ts all have to agree on the OR-gate semantics. If a future contributor copies the resolver pattern but forgets to mirror the SQL predicate (or vice versa), you'd get a state where mutations are gated but the cron evaluates rules anyway, or vice versa. Worth confirming the four call sites are consistent.
ClickHouse query cost. The evaluator runs evaluateMetricAlertRules every minute against operations_minutely / operations_hourly, batched by (targetId, timeWindowMinutes, savedFilterId). Worst case: every enabled rule with a unique grouping key is its own ClickHouse round-trip. The query is light (covered by the (target, timestamp) index, returns 2 rows), but it's worth keeping an eye on aggregate ClickHouse load when we widen rollout. Consider checking once we've enabled for a handful of orgs whether system.query_log shows acceptable patterns.
Workflows service now reaches ClickHouse. packages/services/workflows/src/lib/clickhouse-client.ts is a new dependency edge. The workflows service previously only talked to Postgres. New env vars (CLICKHOUSE_HOST, etc.) are wired through the workflows deployment in deployment/services/workflows.ts. Confirm the Pulumi stack actually has these set in non-dev environments before flipping the env var.
State-log retention. The metric_alert_state_log table is the highest-volume new table. Each rule transition writes a row, and the table has plan-gated TTL. The purgeExpiredAlertStateLog cron is the only thing keeping it bounded.
Notification fan-out is silent on partial failure. metric-alert-notifier.ts sends to each channel in sequence. If Slack succeeds and the webhook fails, today the failure is logged but doesn't surface to the user...they just get a partial notification. Future observability work will add metrics here, but in the meantime any review feedback on whether we should retry / surface failures more prominently is welcome.

gemini-code-assist

Code Review

This pull request outlines a proposal to enhance the alerts and notifications system by adding email support and introducing metric-based alerts for latency, error rates, and traffic. The review identifies several critical technical considerations for the implementation: the inability to run certain PostgreSQL type alterations within transactions, potential inaccuracies in ClickHouse metric calculations due to interval snapping, the need for zero-division handling in percentage change formulas, and a consistency error regarding the proposed evaluation frequency.

github-actions · 2026-03-30T19:50:47Z

🐋 This PR was built and pushed to the following Docker images:

Targets: build

Platforms: linux/amd64

Image Tag: 944f6599947eb6e1aa104742b27696954f63e43e

github-actions · 2026-03-30T19:58:03Z

🚀 Snapshot Release (`alpha`)

The latest changes of this PR are available as alpha on npm (based on the declared changesets):

Package	Version	Info
`hive`	`11.1.1-alpha-20260514023953-944f6599947eb6e1aa104742b27696954f63e43e`	npm ↗︎ unpkg ↗︎

…s://github.com/graphql-hive/console into console-1968-improve-most-valuable-alert-types

…splay

updated plan

7ff3795

jonathanawesome marked this pull request as draft March 30, 2026 19:47

gemini-code-assist Bot reviewed Mar 30, 2026

View reviewed changes

jonathanawesome added 3 commits March 30, 2026 14:55

gemini suggestions

0be673a

prettier

a7b778e

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

5503be7

jonathanawesome added 22 commits April 8, 2026 21:07

add additional states to plan: NORMAL, PENDING, FIRING, RECOVERING

4714cbc

rename metric_alerts to metric_alert_rules

ae2c487

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

a1f5472

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

c3552fd

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

e41f725

shuffle email alert channel work, review figma screens

5211d5e

add migration

7102e9e

prettier

a5389dd

add .claude to prettierignore

0481f39

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

b563337

add metric alert rules entity types and storage provider

42e53e7

generate

e38889b

add GraphQL schema and resolvers for metric alert rules

e92f120

add metric alert evaluation engine to workflows service

10af3ff

add metric alert notification sender for Slack, Webhook, and Teams

707234e

address review issues, add plan-based retention, add integration tests

35cc93a

move Project.metricAlertRules to Target, add created_by_user_id

df87dd6

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

b756a8b

add initial metric alerts seed script with 30 days history

9ad4638

scaffold alerts tab and sub routes

54d6c2d

lint new routes scaffold

efdf386

complete insights and alerts seed script

5d0d3ef

jonathanawesome added 30 commits May 6, 2026 16:54

Merge branch 'console-1968-improve-most-valuable-alert-types' of http…

10c0b91

…s://github.com/graphql-hive/console into console-1968-improve-most-valuable-alert-types

fan out alert notifications to per-channel retryable tasks

5724716

anchor metric alert evaluation to job.run_at, not Date.now()

476e764

update comment

f0ed569

adjust x-axis tick count

9afe112

type labels based on api type

37dddb4

pp → %

11cd400

UTc -> local time for charts

1a99ac5

widen Select.options to readonly arrays, drop spread workarounds

8b4826b

expand form chart to 2x selected range, also fix x-axis tick label di…

9fb48a3

…splay

remove string matching, be smart about matching shape

7c91f6d

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

32435a3

generate

9ad5a9d

expect ts error for grafana RuleGroup

aab4774

lock file

bc2f387

add "no data" segment on status transitions chart

574fae6

limit to 10 alert rules per target

af7ff91

form allows submission without a destination

65d259f

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

888afac

prettier

f08b8ee

add org type prompt to seed script

94d536e

backdate seed data so we dont get false status transitions charts

b04194a

continue to improve seed

c2207ab

create alerts live seed

2f38c9b

update alerts UI with polling

7fc3236

ensure that clickhouse is enabled in workflows and deployment

2186b8e

fix bug where PG returns a string rather than a date

8ca0926

fix notification dispatch failing because o.slug column doesn't exist

b90c910

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

abc3801

prettier

944f659

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

console-1968-improve-most-valuable-alert-types#7935

console-1968-improve-most-valuable-alert-types#7935
jonathanawesome wants to merge 125 commits into
mainfrom
console-1968-improve-most-valuable-alert-types

jonathanawesome commented Mar 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

jonathanawesome commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it ships

Backend

Frontend

Two-tier feature flag, mirroring the existing schemaProposals pattern

Seed script

Notable decisions

Worth a closer look in review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Snapshot Release (alpha)

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

jonathanawesome commented Mar 30, 2026 •

edited

Loading

github-actions Bot commented Mar 30, 2026 •

edited

Loading

github-actions Bot commented Mar 30, 2026 •

edited

Loading

🚀 Snapshot Release (`alpha`)