fix(metrics): cap pg_stat/statio_all_* cardinality via top-N + 'other' — without hiding pg_catalog/pg_toast/timescale#64
Draft
NikolayS wants to merge 4 commits into
Draft
Conversation
…all_* The four per-relation metrics (pg_stat_all_indexes, pg_stat_all_tables, pg_statio_all_tables, pg_statio_all_indexes) had no schema filter and a flat LIMIT 5000 truncation. On extension- or schema-heavy databases this overran prometheus.yml's sample_limit (10000) so the entire scrape was silently rejected, and the LIMIT tail was dropped without any aggregate row left behind — dashboard sums drifted. Port the gen2 (gitlab.com/postgres-ai/pgwatch2) approach faithfully instead of reinventing it: - Read pg_stat_user_*/pg_statio_user_* — pg_catalog, information_schema and pg_toast are excluded by the Postgres view itself, so we don't maintain a hand-curated nspname LIKE pattern that has to grow every time a new extension ships its own schema. - row_number() OVER (ORDER BY <relevance>) <= 100 per database. Rank by pg_total_relation_size for tables (big tables are the interesting ones; n_live_tup+n_dead_tup starved big-but-static tables) and by activity for indexes/IO views. - UNION ALL an 'other' row that sums the tail so totals stay correct under the cap. HAVING count(*) > 0 suppresses the row when nothing was truncated. - Skip rows with no I/O activity in the statio views — most of the tail on schema-heavy DBs is dead-cold relations. - Filter pg_temp% from index metrics so leftover temp objects from dead sessions stop leaking samples. Metric names and exposed tag_* labels are unchanged so Dashboards 8–11 keep working. Adds two compliance-vector tests that pin the pattern.
…+javascript The repo has GitHub's default CodeQL setup enabled (it scans python, javascript-typescript, ruby and actions on every PR). The custom .github/workflows/codeql-analysis.yml runs a second 'Analyze (python|javascript)' matrix on the same commit; GitHub rejects its SARIF upload with 'CodeQL analyses from advanced configurations cannot be processed when the default setup is enabled', so the custom workflow has been failing on every PR since the default setup was turned on. Default setup already covers the same language set (and more), so deleting the custom workflow leaves us with one working CodeQL run instead of one working + one always-red.
…ertec's) The 'pgwatch2 (gen2)' shorthand was ambiguous. The pattern we ported lives in gitlab.com/postgres-ai/pgwatch2 — postgres.ai's fork of Cybertec's pgwatch2 — which was the previous generation of our monitoring stack before postgresai. No code or SQL changes; comment/docstring wording only.
…visible The first revision read from pg_stat_user_*/pg_statio_user_*, which the Postgres views define as 'pg_stat_all_* WHERE schemaname NOT IN (pg_catalog, information_schema) AND schemaname !~ ^pg_toast'. That's identity-based filtering wearing a different hat: it silently hides bloat in pg_toast, hot scans in pg_catalog, and any issue inside _timescaledb_internal. If a TOAST table is bloated or a catalog index is being hammered, the operator wouldn't see it. Rework the four metrics to read pg_stat_all_*/pg_statio_all_* directly and rely PURELY on cardinality control: - Top 100 by relevance per database (idx_scan / pg_total_relation_size / heap_blks_read / idx_blks_read). - Tail aggregated into a single 'other' row so totals stay correct. - No pg_temp%, no pg_toast%, no _timescaledb% schema filtering anywhere. A relation enters the top-N by activity or by size; if it's not in the top-N, it's in 'other'. The only WHERE filter kept is the zero-counter row skip on the two statio metrics — those rows literally carry no information (every gauge is 0) and cannot mask any issue, so dropping them is information-preserving, not identity-based. Smoke-tested against PG16: - pg_stat_all_tables: 101 rows, 75 from pg_catalog/etc. in top-100. - pg_stat_all_indexes: 101 rows, 98 from system schemas. - pg_statio_all_tables / pg_statio_all_indexes: catalog/toast rows appear in top-N once they have any I/O. Regression tests updated to assert: reads pg_stat_all_*/pg_statio_all_*, no schemaname/nspname LIKE patterns, no 'pg_toast'/'pg_catalog'/ '_timescaledb' literals — top-N + 'other' is the only mechanism.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bound cardinality on the four high-frequency per-relation metrics in
config/pgwatch-prometheus/metrics.yml(pg_stat_all_indexes,pg_stat_all_tables,pg_statio_all_tables,pg_statio_all_indexes) by ranking, not by identity: top 100 +'other'aggregate row, reading thepg_stat_all_*/pg_statio_all_*views directly sopg_catalog,pg_toastand_timescaledb_internalstay observable. Pattern adapted from pgwatch2 postgres.ai edition (gitlab.com/postgres-ai/pgwatch2— our fork of Cybertec's pgwatch2). Alternative to gitlab MR !261.Problem
These four metrics had no filter and a flat
LIMIT 5000. On extension- or schema-heavy databases:prometheus.yml'ssample_limit: 10000, so the entire scrape was silently rejected — the worst possible failure mode.LIMIT 5000truncation dropped the tail with no aggregate row, so dashboard sums drifted.The competing approach (gitlab MR !261) adds a hand-curated
nspname NOT LIKE ANY (E'pg\_%', 'information_schema', E'\_timescaledb%')filter. That:LIMIT,pg_catalog,pg_toastand_timescaledb_internal— exactly the relations where bloat, hot scans, or runaway TOAST growth tend to show up.What this PR does
In
config/pgwatch-prometheus/metrics.yml:pg_stat_all_*/pg_statio_all_*. Nopg_stat_user_*(which silently filterspg_catalog/information_schema/pg_toast), noNOT LIKE 'pg_temp%', noNOT LIKE '_timescaledb%', noNOT IN ('pg_catalog', …). A relation enters the top-N by activity or by size; if it's not in the top-N, it's in'other'.row_number() OVER (ORDER BY <relevance> DESC NULLS LAST) <= 100per database, ranked by:pg_stat_all_tables→pg_total_relation_size(relid)(vs the previousn_live_tup + n_dead_tup, which starved big-but-static tables — including pg_toast — out of the top-N).pg_stat_all_indexes→idx_scan.pg_statio_all_tables→heap_blks_read.pg_statio_all_indexes→idx_blks_read.UNION ALLan'other'row summing the tail;HAVING count(*) > 0suppresses it when nothing was truncated.WHEREclause and it is not identity-based — those rows literally have every gauge at 0 and cannot mask any issue.Prometheus metric names (
pgwatch_pg_stat_all_*_*) and thetag_datname/tag_schemaname/tag_relname/tag_indexrelnamelabels are unchanged so Dashboards 8, 9, 10, 11 keep working. The new'other'row usestag_schemaname='other'/tag_relname='other'and can be filtered out per panel as needed.Also drops
.github/workflows/codeql-analysis.yml, which was failing on every PR because GitHub's CodeQL default setup is enabled on this repo and the two configurations conflict (advanced workflow's SARIF upload rejected). Default setup already scans python+javascript plus more.Why this is better than the
nspnamefilter approach (MR !261)nspname NOT LIKE …LIMIT 5000)_timescaledb_internalchunks'other'row)Still to address (not in this PR)
MR !261 targets four more metrics with the same identity-based filter. Better fix using the same top-N +
'other'pattern, ranked by size/activity (so catalog/toast/timescale stay visible):pg_total_relation_size(config/pgwatch-prometheus/metrics.yml:1767) — currently no schema filter,LIMIT 5000, no'other'. Could also be dropped entirely —table_stats.total_relation_size_balready covers it.table_stats(config/pgwatch-prometheus/metrics.yml:435) — partition-root CTE filterspg_\_%/_timescaledb%, leaf CTE onpg_stat_all_tableshas neither filter norLIMIT. Highest single contributor to sample volume; needs care around the partition-root union.pg_class(config/pgwatch-prometheus/metrics.yml:1526) — filters('information_schema','pg_catalog'),LIMIT 10000. Includes indexes + matviews + views — the cardinality multiplier.table_size_detailed(config/pgwatch-prometheus/metrics.yml:2006) — filters('information_schema','pg_toast')(missingpg_catalog),LIMIT 1000, no'other'.Happy to land those as follow-ups.
Test plan
python3 -m pytest tests/compliance_vectors/test_mr219_monitoring_guards.py tests/xmin_horizon/test_metrics_sql_static.py -q— 45 passed.pg_stat_user_*/nspname LIKE/'pg_toast'/'pg_catalog'/_timescaledbliterals from sneaking back in.pg_stat_all_tables: 101 rows total, 75 of which are frompg_catalog/etc. in the top-100. Catalog visibility confirmed.pg_stat_all_indexes: 101 rows, 98 from system schemas.pg_statio_all_*: catalog/toast rows appear in the top-N once they have any I/O.pg_stat_all_tablesreturns exactly 101 rows (top 100 + one'other').pgwatch-prometheusjob's total samples stay undersample_limit: 10000and Dashboards 8–11 render correctly with the new'other'row.