Skip to content

health: stability events log + live system metrics page#732

Merged
cyberb merged 12 commits into
masterfrom
health-page
May 21, 2026
Merged

health: stability events log + live system metrics page#732
cyberb merged 12 commits into
masterfrom
health-page

Conversation

@cyberb
Copy link
Copy Markdown
Member

@cyberb cyberb commented May 21, 2026

Summary

  • Stability service now persists structured events to \$SNAP_COMMON/stability-events.jsonl — every zram setup, file-swap disable, pressure detection, SIGTERM/SIGKILL is recorded with timestamp + structured fields.
  • New `backend/health/` package collects live system metrics from `/proc/{stat,meminfo,diskstats,net/dev}` + `statfs(2)` of each non-snap mountpoint.
  • Two admin-secured endpoints: `GET /rest/settings/health/events` and `GET /rest/settings/health/metrics`.
  • New `Health.vue` page under Settings. Polls metrics every 2 s, events every 10 s. Shows CPU %, memory + swap usage, per-mount disk capacity, per-disk I/O rate, per-interface network rate, and the event history table.

No UI framework migration in this PR — uses existing Element Plus (`el-progress`, `el-table`).

Test plan

  • `go test ./...` (stability, health, ioc, rest all green)
  • `npx vite build` (Health.vue compiles)
  • CI green
  • Manual on borisarm64: install, open /health, verify live numbers move; trigger the python memhog to confirm the SIGTERM event appears in the table

cyberb added 12 commits May 21, 2026 11:41
backend/stability:
  EventLog persists structured events to $SNAP_COMMON/stability-events.jsonl
  (one JSON line per zram setup / file-swap disable / pressure detection /
  SIGTERM / SIGKILL). Watcher and Zram both accept an optional *EventLog
  and append events alongside their existing zap logging.

backend/health:
  New package. Collector reads /proc/stat, /proc/meminfo, /proc/diskstats,
  /proc/net/dev for live system metrics; statfs of each non-snap mountpoint
  for capacity. Health{} bundles the EventLog reader + Collector for use
  by REST.

backend/rest:
  Two new admin-secured endpoints:
    GET /rest/settings/health/events?limit=N  -> recent stability events
    GET /rest/settings/health/metrics         -> single Snapshot

web/platform:
  New Health.vue. Polls metrics every 2 s (computes CPU% / disk-IO KB/s /
  net rate from snapshot deltas), events every 10 s. Shows per-mount usage
  bars, swap usage, and the stability event history. Listed under Settings
  with a 'favorite' material icon.
CPU ticks advance with sine-shaped busy/idle, mem/swap usage oscillates
so the live bars actually move, net/disk byte counters grow with random
deltas to produce realistic KB/s rates. Events list includes a recent
SIGTERM + earlier SIGKILL chain plus a zram_enabled + swapoff_file pair
matching what the borisarm64 OOM stress test produced.
el-table forces a fixed wide layout that overflowed on mobile. Switch
to a vertical list of cards (max-width 720px) with a colored left
border per event kind (red for kills, orange for pressure, green for
zram/swap actions). Time wraps below kind on narrow screens via
flex-wrap.
Each event now has a material icon (warning/cancel for kills, priority
for pressure, memory/swap for zram) tinted to match the left-border
accent color, a white rounded card with soft shadow, and relative time
(e.g. '2m ago') with absolute timestamp on hover. Tighter padding +
smaller fonts under 600px.
Wrap events H2 + list in .settingsblock so they get the same
1024px-capped centered container as the CPU/memory/disk sections.
Drop the now-redundant 720px event-list cap.
Other views import these in their <style> block. Routes are lazy-loaded,
so refreshing on /health was loading Health.vue cold — without those
imports the page lost the site-wide typography/layout rules and the
event-card material icons fell back to text. Matches the pattern in
Settings.vue / Logs.vue / etc.
- Switch <style scoped> to plain <style> matching Settings/Logs. Under
  scoped, the @import 'site.css' was rewritten to data-v-… selectors
  so it didn't reach the global menu/header on cold refreshes of /health.
- Refactor disks/network rows into a flex 'metric-row' with name on the
  left, tabular-num value on the right, optional bar below — keeps
  layout stable when rate digits change.
- Add 16 px horizontal padding to the events block under 1024 px so the
  event cards inset from the screen edge like the rest of the content.
Wrap the two col2 columns in a flex .health-row (40 px gap, wrap on
narrow screens) — gives the CPU/memory and disks/network blocks real
breathing room on desktop. Cap the events block at 720 px and center
it so it doesn't stretch full settingsblock width.
EventLog:
  Recent() previously decoded the entire jsonl file into a slice before
  trimming, so a 100k-event log would allocate the full set even for a
  Recent(10) call. Reworked to use a fixed-size ring buffer (capped at
  the requested limit) so memory use is O(limit), not O(file size).

  Append() now rotates the file when it exceeds 256 KiB — keeps the
  newest 1000 events and rewrites atomically via tmpfile+rename. Disk
  usage is also bounded.

i18n:
  Added the new settings.health label and the full health.* dictionary
  to ar/de/es/fr/hi/ja/pt/ru/zh-CN. Previously only en.json had them
  and other locales fell back to English mid-page.
readCPU/readMemory/readDisks/readNet now hang off *Collector so they
own the procDir path instead of taking it as an argument. Snapshot()
no longer threads filepath.Join calls. Tests cover each method
individually plus the end-to-end Snapshot path through a shared
newTestCollector helper. Matches the same OO style we use for the
EventLog / Watcher / Zram.
Every commit on a branch with an open PR was creating two builds (push
+ pull_request) and Drone serializes them, so the queue piled up faster
than it drained. Drop pull_request from the event trigger — the push
build for the same commit already validates everything PR check
would, and merging is unblocked once that one is green.
@cyberb cyberb merged commit d5cfe70 into master May 21, 2026
1 check passed
@cyberb cyberb deleted the health-page branch May 21, 2026 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant