feat(metrics): per-container CPU/memory/network with Grafana rig#87
Merged
Conversation
…tream
Adds five new per-container metric families to /metrics, labelled
{id, repo, runtime}. Series are deleted on container destroy so live
cardinality is bounded by max_concurrent:
- ephemerd_container_cpu_usage_seconds_total
- ephemerd_container_memory_bytes / _memory_anon_bytes
- ephemerd_container_memory_limit_bytes / _cpu_limit
- ephemerd_container_network_rx_bytes_total / _tx_bytes_total
Linux samples come from containerd Task.Metrics + /proc/<pid>/net/dev.
Switched the network reader off netlink after observing
"protocol not supported" on the embedded virt kernel — /proc is
universal. Windows host samples come from hcsshim.OpenContainer +
Statistics, summing across NetworkStats endpoints.
The in-VM Linux daemon pushes batches back to the host over a new
StreamContainerStats RPC on the existing Dispatch service, so the
host's /metrics is the single scrape target. Series carry
runtime=linux-vm vs runtime=windows-hyperv vs runtime=linux-native so
Grafana can split them.
Includes examples/observability/ — a Docker Compose rig that boots
Prometheus + Grafana with the ephemerd dashboard pre-provisioned.
Tested on Podman + WSL via the EPHEMERD_TARGET env var override
documented in .env.example.
Verified end-to-end against a live PHP-SDK CI run: 12 distinct Linux
job containers captured with rx peaking at 502 MB and 8.2 MB/s bursts
during composer install.
Design doc: docs/arch/container-metrics.md.
…ed targets EPHEMERD_TARGET now accepts a comma-separated list of host:port pairs so a single rig can scrape a fleet of ephemerd nodes on the same local network. Whitespace around commas is stripped. Prometheus auto-tags each series with instance="<host:port>" so the dashboard splits per node without any extra config. Verified end-to-end with a real + fake target — both discovered, real one up, fake one down with the expected DNS error.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds per-container CPU, memory, and network metrics to
/metrics, plus a self-provisioned Prometheus + Grafana rig underexamples/observability/. End-to-end stack works against the embedded Linux VM and native Windows Hyper-V containers from a single host scrape target.Five new metric families, labelled
{id, repo, runtime}:ephemerd_container_cpu_usage_seconds_totalrate()for utilizationephemerd_container_memory_bytes/_memory_anon_bytesephemerd_container_memory_limit_bytes/_cpu_limitephemerd_container_network_rx_bytes_total/_tx_bytes_totalSeries get
DeleteLabelValues'd on container destroy so live cardinality is bounded bymax_concurrent.Architecture
containerd.Task.Metrics+/proc/<pid>/net/dev. Switched off netlink after observingprotocol not supportedon the embedded virt kernel —/procis universal.hcsshim.OpenContainer + Statistics, sumsBytesReceived/BytesSentacross allNetworkStatsendpoints.StreamContainerStatsRPC on the existing Dispatch service. Host is the client (matches existing CreateJob/WaitJob/DestroyJob polarity), in-VM ephemerd server-streams batches everycontainer_stats_interval(default 10 s). Host updates its own gauges withruntime=linux-vm. Same flow works on Darwin Vz Linux VM unchanged.OnTaskStarted/OnTaskDestroycallbacks onruntime.Config;SetTaskHooksfor post-construction wiring (needed because the dispatch server depends on the runtime).Full design doc:
docs/arch/container-metrics.md.Observability rig
examples/observability/is a two-container Compose rig (Prometheus + Grafana) that pre-provisions the datasource and a 12-panel dashboard. Boot is one command:cd examples/observability docker compose up -dhost.docker.internal:9090is the default scrape target (works on Docker Desktop). For Podman / rootless / remote, copy.env.exampleto.envand setEPHEMERD_TARGET— see the README.Verified
End-to-end against a live PHP-SDK CI run on the Hyper-V Linux VM:
calm_wright)/metricsafterDestroyJobmage cilint clean (golangci-lint run ./...0 issues forCGO_ENABLED=0 GOOS=linux); all touched packages passgo test ./...natively. The persistent pre-existing pkcs11 cgo failure on Windows-without-MinGW is the same onmainand is not introduced by this PR.Test plan
docker compose up -dinexamples/observability/and confirm Grafana renders the dashboardephemerd_container_*series appear withruntime="linux-vm"runtime="windows-hyperv"series appear/metricsafter job destroyOut of scope
io.statand HCSStorageStatsboth available; will add when a user actually asks).eth0; not worth the cardinality).