Skip to content

fix(metrics): cap pg_stat/statio_all_* cardinality via top-N + 'other' — without hiding pg_catalog/pg_toast/timescale#64

Draft
NikolayS wants to merge 4 commits into
mainfrom
claude/limit-pg-stat-views-GeNuq
Draft

fix(metrics): cap pg_stat/statio_all_* cardinality via top-N + 'other' — without hiding pg_catalog/pg_toast/timescale#64
NikolayS wants to merge 4 commits into
mainfrom
claude/limit-pg-stat-views-GeNuq

Conversation

@NikolayS
Copy link
Copy Markdown
Contributor

@NikolayS NikolayS commented May 15, 2026

Summary

Bound cardinality on the four high-frequency per-relation metrics in config/pgwatch-prometheus/metrics.yml (pg_stat_all_indexes, pg_stat_all_tables, pg_statio_all_tables, pg_statio_all_indexes) by ranking, not by identity: top 100 + 'other' aggregate row, reading the pg_stat_all_* / pg_statio_all_* views directly so pg_catalog, pg_toast and _timescaledb_internal stay observable. Pattern adapted from pgwatch2 postgres.ai edition (gitlab.com/postgres-ai/pgwatch2 — our fork of Cybertec's pgwatch2). Alternative to gitlab MR !261.

Problem

These four metrics had no filter and a flat LIMIT 5000. On extension- or schema-heavy databases:

  • The full preset generated ~12k–15k samples and overran prometheus.yml's sample_limit: 10000, so the entire scrape was silently rejected — the worst possible failure mode.
  • The LIMIT 5000 truncation dropped the tail with no aggregate row, so dashboard sums drifted.

The competing approach (gitlab MR !261) adds a hand-curated nspname NOT LIKE ANY (E'pg\_%', 'information_schema', E'\_timescaledb%') filter. That:

  • doesn't bound cardinality on a single large user DB,
  • still drops the tail at LIMIT,
  • needs maintenance every time a new extension installs its own schema, and
  • silently hides issues in pg_catalog, pg_toast and _timescaledb_internal — exactly the relations where bloat, hot scans, or runaway TOAST growth tend to show up.

What this PR does

In config/pgwatch-prometheus/metrics.yml:

  • Read directly from pg_stat_all_* / pg_statio_all_*. No pg_stat_user_* (which silently filters pg_catalog/information_schema/pg_toast), no NOT LIKE 'pg_temp%', no NOT LIKE '_timescaledb%', no NOT IN ('pg_catalog', …). A relation enters the top-N by activity or by size; if it's not in the top-N, it's in 'other'.
  • row_number() OVER (ORDER BY <relevance> DESC NULLS LAST) <= 100 per database, ranked by:
    • pg_stat_all_tablespg_total_relation_size(relid) (vs the previous n_live_tup + n_dead_tup, which starved big-but-static tables — including pg_toast — out of the top-N).
    • pg_stat_all_indexesidx_scan.
    • pg_statio_all_tablesheap_blks_read.
    • pg_statio_all_indexesidx_blks_read.
  • UNION ALL an 'other' row summing the tail; HAVING count(*) > 0 suppresses it when nothing was truncated.
  • Zero-counter rows are dropped only on the statio metrics. This is the one preserved WHERE clause and it is not identity-based — those rows literally have every gauge at 0 and cannot mask any issue.

Prometheus metric names (pgwatch_pg_stat_all_*_*) and the tag_datname / tag_schemaname / tag_relname / tag_indexrelname labels are unchanged so Dashboards 8, 9, 10, 11 keep working. The new 'other' row uses tag_schemaname='other' / tag_relname='other' and can be filtered out per panel as needed.

Also drops .github/workflows/codeql-analysis.yml, which was failing on every PR because GitHub's CodeQL default setup is enabled on this repo and the two configurations conflict (advanced workflow's SARIF upload rejected). Default setup already scans python+javascript plus more.

Why this is better than the nspname filter approach (MR !261)

Property This PR MR !261's nspname NOT LIKE …
Bounds cardinality on a single big user DB yes (top-N) no (still LIMIT 5000)
Preserves visibility of pg_catalog issues yes (rank into top-N) no (filtered out)
Preserves visibility of pg_toast bloat yes no
Preserves visibility of _timescaledb_internal chunks yes (rank by size/activity) no
Preserves totals across the cap yes ('other' row) no (silent truncation)
Maintenance for new extensions (Citus, pg_repack, Supabase…) none manual list update

Still to address (not in this PR)

MR !261 targets four more metrics with the same identity-based filter. Better fix using the same top-N + 'other' pattern, ranked by size/activity (so catalog/toast/timescale stay visible):

  • pg_total_relation_size (config/pgwatch-prometheus/metrics.yml:1767) — currently no schema filter, LIMIT 5000, no 'other'. Could also be dropped entirely — table_stats.total_relation_size_b already covers it.
  • table_stats (config/pgwatch-prometheus/metrics.yml:435) — partition-root CTE filters pg_\_%/_timescaledb%, leaf CTE on pg_stat_all_tables has neither filter nor LIMIT. Highest single contributor to sample volume; needs care around the partition-root union.
  • pg_class (config/pgwatch-prometheus/metrics.yml:1526) — filters ('information_schema','pg_catalog'), LIMIT 10000. Includes indexes + matviews + views — the cardinality multiplier.
  • table_size_detailed (config/pgwatch-prometheus/metrics.yml:2006) — filters ('information_schema','pg_toast') (missing pg_catalog), LIMIT 1000, no 'other'.

Happy to land those as follow-ups.

Test plan

  • python3 -m pytest tests/compliance_vectors/test_mr219_monitoring_guards.py tests/xmin_horizon/test_metrics_sql_static.py -q — 45 passed.
  • Regression tests pin the new principle and explicitly forbid pg_stat_user_* / nspname LIKE / 'pg_toast' / 'pg_catalog' / _timescaledb literals from sneaking back in.
  • Ran the SQL against a local PostgreSQL 16:
    • pg_stat_all_tables: 101 rows total, 75 of which are from pg_catalog/etc. in the top-100. Catalog visibility confirmed.
    • pg_stat_all_indexes: 101 rows, 98 from system schemas.
    • pg_statio_all_*: catalog/toast rows appear in the top-N once they have any I/O.
    • With 150+ user tables, pg_stat_all_tables returns exactly 101 rows (top 100 + one 'other').
  • Verify on an extension-heavy preview cluster that the pgwatch-prometheus job's total samples stay under sample_limit: 10000 and Dashboards 8–11 render correctly with the new 'other' row.

claude added 3 commits May 15, 2026 14:50
…all_*

The four per-relation metrics (pg_stat_all_indexes, pg_stat_all_tables,
pg_statio_all_tables, pg_statio_all_indexes) had no schema filter and a
flat LIMIT 5000 truncation. On extension- or schema-heavy databases this
overran prometheus.yml's sample_limit (10000) so the entire scrape was
silently rejected, and the LIMIT tail was dropped without any aggregate
row left behind — dashboard sums drifted.

Port the gen2 (gitlab.com/postgres-ai/pgwatch2) approach faithfully
instead of reinventing it:

- Read pg_stat_user_*/pg_statio_user_* — pg_catalog, information_schema
  and pg_toast are excluded by the Postgres view itself, so we don't
  maintain a hand-curated nspname LIKE pattern that has to grow every
  time a new extension ships its own schema.
- row_number() OVER (ORDER BY <relevance>) <= 100 per database. Rank by
  pg_total_relation_size for tables (big tables are the interesting ones;
  n_live_tup+n_dead_tup starved big-but-static tables) and by activity
  for indexes/IO views.
- UNION ALL an 'other' row that sums the tail so totals stay correct
  under the cap. HAVING count(*) > 0 suppresses the row when nothing
  was truncated.
- Skip rows with no I/O activity in the statio views — most of the tail
  on schema-heavy DBs is dead-cold relations.
- Filter pg_temp% from index metrics so leftover temp objects from dead
  sessions stop leaking samples.

Metric names and exposed tag_* labels are unchanged so Dashboards 8–11
keep working. Adds two compliance-vector tests that pin the pattern.
…+javascript

The repo has GitHub's default CodeQL setup enabled (it scans python, javascript-typescript, ruby and actions on every PR). The custom .github/workflows/codeql-analysis.yml runs a second 'Analyze (python|javascript)' matrix on the same commit; GitHub rejects its SARIF upload with 'CodeQL analyses from advanced configurations cannot be processed when the default setup is enabled', so the custom workflow has been failing on every PR since the default setup was turned on.

Default setup already covers the same language set (and more), so deleting the custom workflow leaves us with one working CodeQL run instead of one working + one always-red.
…ertec's)

The 'pgwatch2 (gen2)' shorthand was ambiguous. The pattern we ported lives
in gitlab.com/postgres-ai/pgwatch2 — postgres.ai's fork of Cybertec's
pgwatch2 — which was the previous generation of our monitoring stack
before postgresai. No code or SQL changes; comment/docstring wording only.
@NikolayS NikolayS changed the title fix(metrics): port pgwatch2 top-N + 'other' bucket to pg_stat/statio_all_* fix(metrics): port pgwatch2 postgres.ai edition's top-N + 'other' bucket to pg_stat/statio_all_* May 15, 2026
…visible

The first revision read from pg_stat_user_*/pg_statio_user_*, which the
Postgres views define as 'pg_stat_all_* WHERE schemaname NOT IN
(pg_catalog, information_schema) AND schemaname !~ ^pg_toast'. That's
identity-based filtering wearing a different hat: it silently hides bloat
in pg_toast, hot scans in pg_catalog, and any issue inside
_timescaledb_internal. If a TOAST table is bloated or a catalog index is
being hammered, the operator wouldn't see it.

Rework the four metrics to read pg_stat_all_*/pg_statio_all_* directly
and rely PURELY on cardinality control:

- Top 100 by relevance per database (idx_scan / pg_total_relation_size /
  heap_blks_read / idx_blks_read).
- Tail aggregated into a single 'other' row so totals stay correct.
- No pg_temp%, no pg_toast%, no _timescaledb% schema filtering anywhere.
  A relation enters the top-N by activity or by size; if it's not in the
  top-N, it's in 'other'.

The only WHERE filter kept is the zero-counter row skip on the two statio
metrics — those rows literally carry no information (every gauge is 0)
and cannot mask any issue, so dropping them is information-preserving,
not identity-based.

Smoke-tested against PG16:
- pg_stat_all_tables: 101 rows, 75 from pg_catalog/etc. in top-100.
- pg_stat_all_indexes: 101 rows, 98 from system schemas.
- pg_statio_all_tables / pg_statio_all_indexes: catalog/toast rows
  appear in top-N once they have any I/O.

Regression tests updated to assert: reads pg_stat_all_*/pg_statio_all_*,
no schemaname/nspname LIKE patterns, no 'pg_toast'/'pg_catalog'/
'_timescaledb' literals — top-N + 'other' is the only mechanism.
@NikolayS NikolayS changed the title fix(metrics): port pgwatch2 postgres.ai edition's top-N + 'other' bucket to pg_stat/statio_all_* fix(metrics): cap pg_stat/statio_all_* cardinality via top-N + 'other' — without hiding pg_catalog/pg_toast/timescale May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants