Skip to content

feat(metrics): per-container CPU/memory/network with Grafana rig#87

Merged
luthermonson merged 2 commits into
mainfrom
feat/container-metrics
Jun 9, 2026
Merged

feat(metrics): per-container CPU/memory/network with Grafana rig#87
luthermonson merged 2 commits into
mainfrom
feat/container-metrics

Conversation

@luthermonson

Copy link
Copy Markdown
Contributor

Summary

Adds per-container CPU, memory, and network metrics to /metrics, plus a self-provisioned Prometheus + Grafana rig under examples/observability/. End-to-end stack works against the embedded Linux VM and native Windows Hyper-V containers from a single host scrape target.

Five new metric families, labelled {id, repo, runtime}:

Name Type Notes
ephemerd_container_cpu_usage_seconds_total counter cumulative CPU; use rate() for utilization
ephemerd_container_memory_bytes / _memory_anon_bytes gauge current usage; anon is 0 on Windows
ephemerd_container_memory_limit_bytes / _cpu_limit gauge configured caps; 0 = unlimited
ephemerd_container_network_rx_bytes_total / _tx_bytes_total counter runner container's netns only — sibling dind containers are not included

Series get DeleteLabelValues'd on container destroy so live cardinality is bounded by max_concurrent.

Architecture

  • Linux sampler reads cgroupv2 via containerd.Task.Metrics + /proc/<pid>/net/dev. Switched off netlink after observing protocol not supported on the embedded virt kernel — /proc is universal.
  • Windows sampler reads via hcsshim.OpenContainer + Statistics, sums BytesReceived/BytesSent across all NetworkStats endpoints.
  • Transport for VM-side stats: new StreamContainerStats RPC on the existing Dispatch service. Host is the client (matches existing CreateJob/WaitJob/DestroyJob polarity), in-VM ephemerd server-streams batches every container_stats_interval (default 10 s). Host updates its own gauges with runtime=linux-vm. Same flow works on Darwin Vz Linux VM unchanged.
  • Runtime hooksOnTaskStarted / OnTaskDestroy callbacks on runtime.Config; SetTaskHooks for post-construction wiring (needed because the dispatch server depends on the runtime).

Full design doc: docs/arch/container-metrics.md.

Observability rig

examples/observability/ is a two-container Compose rig (Prometheus + Grafana) that pre-provisions the datasource and a 12-panel dashboard. Boot is one command:

cd examples/observability
docker compose up -d

host.docker.internal:9090 is the default scrape target (works on Docker Desktop). For Podman / rootless / remote, copy .env.example to .env and set EPHEMERD_TARGET — see the README.

Verified

End-to-end against a live PHP-SDK CI run on the Hyper-V Linux VM:

  • 12 distinct job containers captured live
  • Peak rx: 502 MB (heavy composer install on calm_wright)
  • Peak download rate: 8.2 MB/s sustained during artifact pulls
  • All gauges + counters render correctly in Grafana
  • Series correctly disappear from /metrics after DestroyJob

mage ci lint clean (golangci-lint run ./... 0 issues for CGO_ENABLED=0 GOOS=linux); all touched packages pass go test ./... natively. The persistent pre-existing pkcs11 cgo failure on Windows-without-MinGW is the same on main and is not introduced by this PR.

Test plan

  • docker compose up -d in examples/observability/ and confirm Grafana renders the dashboard
  • Trigger a Linux CI job and confirm ephemerd_container_* series appear with runtime="linux-vm"
  • Trigger a Windows CI job and confirm runtime="windows-hyperv" series appear
  • Verify series disappear from /metrics after job destroy
  • On macOS host: confirm the same flow works via Vz Linux VM (untested locally — design is identical)

Out of scope

  • macOS Vz macOS VMs (different stats API; deferred).
  • Disk IO (cgroup io.stat and HCS StorageStats both available; will add when a user actually asks).
  • Per-interface network breakdown (CI workloads only ever use eth0; not worth the cardinality).
  • Sibling dind container metrics (each gets its own netns; not currently surfaced).

…tream

Adds five new per-container metric families to /metrics, labelled
{id, repo, runtime}. Series are deleted on container destroy so live
cardinality is bounded by max_concurrent:

- ephemerd_container_cpu_usage_seconds_total
- ephemerd_container_memory_bytes / _memory_anon_bytes
- ephemerd_container_memory_limit_bytes / _cpu_limit
- ephemerd_container_network_rx_bytes_total / _tx_bytes_total

Linux samples come from containerd Task.Metrics + /proc/<pid>/net/dev.
Switched the network reader off netlink after observing
"protocol not supported" on the embedded virt kernel — /proc is
universal. Windows host samples come from hcsshim.OpenContainer +
Statistics, summing across NetworkStats endpoints.

The in-VM Linux daemon pushes batches back to the host over a new
StreamContainerStats RPC on the existing Dispatch service, so the
host's /metrics is the single scrape target. Series carry
runtime=linux-vm vs runtime=windows-hyperv vs runtime=linux-native so
Grafana can split them.

Includes examples/observability/ — a Docker Compose rig that boots
Prometheus + Grafana with the ephemerd dashboard pre-provisioned.
Tested on Podman + WSL via the EPHEMERD_TARGET env var override
documented in .env.example.

Verified end-to-end against a live PHP-SDK CI run: 12 distinct Linux
job containers captured with rx peaking at 502 MB and 8.2 MB/s bursts
during composer install.

Design doc: docs/arch/container-metrics.md.
…ed targets

EPHEMERD_TARGET now accepts a comma-separated list of host:port pairs so
a single rig can scrape a fleet of ephemerd nodes on the same local
network. Whitespace around commas is stripped. Prometheus auto-tags each
series with instance="<host:port>" so the dashboard splits per node
without any extra config.

Verified end-to-end with a real + fake target — both discovered, real
one up, fake one down with the expected DNS error.
@luthermonson luthermonson merged commit 6a51bc8 into main Jun 9, 2026
3 of 4 checks passed
@luthermonson luthermonson deleted the feat/container-metrics branch June 9, 2026 03:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant