feat(metrics): per-container CPU/memory/network with Grafana rig by luthermonson · Pull Request #87 · ephpm/ephemerd

luthermonson · 2026-06-08T22:46:48Z

Summary

Adds per-container CPU, memory, and network metrics to /metrics, plus a self-provisioned Prometheus + Grafana rig under examples/observability/. End-to-end stack works against the embedded Linux VM and native Windows Hyper-V containers from a single host scrape target.

Five new metric families, labelled {id, repo, runtime}:

Name	Type	Notes
`ephemerd_container_cpu_usage_seconds_total`	counter	cumulative CPU; use `rate()` for utilization
`ephemerd_container_memory_bytes` / `_memory_anon_bytes`	gauge	current usage; anon is 0 on Windows
`ephemerd_container_memory_limit_bytes` / `_cpu_limit`	gauge	configured caps; 0 = unlimited
`ephemerd_container_network_rx_bytes_total` / `_tx_bytes_total`	counter	runner container's netns only — sibling dind containers are not included

Series get DeleteLabelValues'd on container destroy so live cardinality is bounded by max_concurrent.

Architecture

Linux sampler reads cgroupv2 via containerd.Task.Metrics + /proc/<pid>/net/dev. Switched off netlink after observing protocol not supported on the embedded virt kernel — /proc is universal.
Windows sampler reads via hcsshim.OpenContainer + Statistics, sums BytesReceived/BytesSent across all NetworkStats endpoints.
Transport for VM-side stats: new StreamContainerStats RPC on the existing Dispatch service. Host is the client (matches existing CreateJob/WaitJob/DestroyJob polarity), in-VM ephemerd server-streams batches every container_stats_interval (default 10 s). Host updates its own gauges with runtime=linux-vm. Same flow works on Darwin Vz Linux VM unchanged.
Runtime hooks — OnTaskStarted / OnTaskDestroy callbacks on runtime.Config; SetTaskHooks for post-construction wiring (needed because the dispatch server depends on the runtime).

Full design doc: docs/arch/container-metrics.md.

Observability rig

examples/observability/ is a two-container Compose rig (Prometheus + Grafana) that pre-provisions the datasource and a 12-panel dashboard. Boot is one command:

cd examples/observability
docker compose up -d

host.docker.internal:9090 is the default scrape target (works on Docker Desktop). For Podman / rootless / remote, copy .env.example to .env and set EPHEMERD_TARGET — see the README.

Verified

End-to-end against a live PHP-SDK CI run on the Hyper-V Linux VM:

12 distinct job containers captured live
Peak rx: 502 MB (heavy composer install on calm_wright)
Peak download rate: 8.2 MB/s sustained during artifact pulls
All gauges + counters render correctly in Grafana
Series correctly disappear from /metrics after DestroyJob

mage ci lint clean (golangci-lint run ./... 0 issues for CGO_ENABLED=0 GOOS=linux); all touched packages pass go test ./... natively. The persistent pre-existing pkcs11 cgo failure on Windows-without-MinGW is the same on main and is not introduced by this PR.

Test plan

docker compose up -d in examples/observability/ and confirm Grafana renders the dashboard
Trigger a Linux CI job and confirm ephemerd_container_* series appear with runtime="linux-vm"
Trigger a Windows CI job and confirm runtime="windows-hyperv" series appear
Verify series disappear from /metrics after job destroy
On macOS host: confirm the same flow works via Vz Linux VM (untested locally — design is identical)

Out of scope

macOS Vz macOS VMs (different stats API; deferred).
Disk IO (cgroup io.stat and HCS StorageStats both available; will add when a user actually asks).
Per-interface network breakdown (CI workloads only ever use eth0; not worth the cardinality).
Sibling dind container metrics (each gets its own netns; not currently surfaced).

…tream Adds five new per-container metric families to /metrics, labelled {id, repo, runtime}. Series are deleted on container destroy so live cardinality is bounded by max_concurrent: - ephemerd_container_cpu_usage_seconds_total - ephemerd_container_memory_bytes / _memory_anon_bytes - ephemerd_container_memory_limit_bytes / _cpu_limit - ephemerd_container_network_rx_bytes_total / _tx_bytes_total Linux samples come from containerd Task.Metrics + /proc/<pid>/net/dev. Switched the network reader off netlink after observing "protocol not supported" on the embedded virt kernel — /proc is universal. Windows host samples come from hcsshim.OpenContainer + Statistics, summing across NetworkStats endpoints. The in-VM Linux daemon pushes batches back to the host over a new StreamContainerStats RPC on the existing Dispatch service, so the host's /metrics is the single scrape target. Series carry runtime=linux-vm vs runtime=windows-hyperv vs runtime=linux-native so Grafana can split them. Includes examples/observability/ — a Docker Compose rig that boots Prometheus + Grafana with the ephemerd dashboard pre-provisioned. Tested on Podman + WSL via the EPHEMERD_TARGET env var override documented in .env.example. Verified end-to-end against a live PHP-SDK CI run: 12 distinct Linux job containers captured with rx peaking at 502 MB and 8.2 MB/s bursts during composer install. Design doc: docs/arch/container-metrics.md.

…ed targets EPHEMERD_TARGET now accepts a comma-separated list of host:port pairs so a single rig can scrape a fleet of ephemerd nodes on the same local network. Whitespace around commas is stripped. Prometheus auto-tags each series with instance="<host:port>" so the dashboard splits per node without any extra config. Verified end-to-end with a real + fake target — both discovered, real one up, fake one down with the expected DNS error.

luthermonson added 2 commits June 7, 2026 21:30

luthermonson merged commit 6a51bc8 into main Jun 9, 2026
3 of 4 checks passed

luthermonson deleted the feat/container-metrics branch June 9, 2026 03:57

luthermonson mentioned this pull request Jun 10, 2026

feat(vm): deliver host config.toml into the Linux VM via boot-initrd tail #89

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(metrics): per-container CPU/memory/network with Grafana rig#87

feat(metrics): per-container CPU/memory/network with Grafana rig#87
luthermonson merged 2 commits into
mainfrom
feat/container-metrics

luthermonson commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

luthermonson commented Jun 8, 2026

Summary

Architecture

Observability rig

Verified

Test plan

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant