DGH-703: Unified k8s test framework — phase 1 + 1.5#9066
Draft
DGH-703: Unified k8s test framework — phase 1 + 1.5#9066
Conversation
Introduce shared k8s test infrastructure to support a unified test
framework that can replace fault-tolerance, deployment-validation, and
load-based functional tests behind one event-based API. This is
phase 1 of DEP DGH-703 / GH#7767.
New modules under tests/utils/:
- k8s_helpers.py: shared K8s client init with KUBECONFIG -> in-cluster
-> default kubeconfig fallback. Used by both ManagedDeployment and
ManagedLoad so each test type talks to the cluster the same way.
- pvc_extractor.py: unified PVC file extraction via a busybox download
job (tar matching files, stream to local, extract, cleanup). Used
for both service logs and aiperf result files.
- managed_load.py: reusable ManagedLoad with a LoadConfig dataclass.
Creates a K8s Job from a YAML template, builds the aiperf command,
and manages start/wait/terminate/get_results.
- templates/log_download_job.yaml + log_wrapper.sh: backing assets.
ServiceSpec additions (additive, no behavior change to existing APIs):
- _ensure_path() helper to eliminate repetitive nested-dict guards.
- component_type property reading spec["componentType"].
- set_arg(arg_name, arg_value): set or update a launch arg.
Quote-aware via shlex.split (matches DEP intent).
- _add_volume_mount() / _add_env_var() idempotent helpers.
- enable_log_collection(log_dir, pvc_name): wrap mainContainer command
to tee output into a PVC-mounted directory.
DeploymentSpec additions:
- from_backend(backend, deployment_type) classmethod loading
examples/backends/<backend>/deploy/<type>.yaml.
- backend property: detect from YAML path, fall back to service names.
__init__ now stores _base_path so path-based detection works.
- worker_services(): non-frontend service names.
- set_worker_replicas(n): set replicas across every worker service.
- enable_log_collection(): declares a deployment-level log PVC and
enables wrapping for all (or named) services. ManagedDeployment
materializes the PVC at deploy time.
Consistency fix in ServiceSpec._get_args(): scalar string args are now
normalized via shlex.split() instead of naive str.split(), matching
set_arg's normalization so quoted launch args don't fragment.
Surgical edit to the legacy scenarios.py (deprecated header added):
- Token-overflow scenario now uses
deployment_spec[name].set_arg(arg, val) in place of the older
deployment_spec.add_arg_to_service(name, arg, val). The legacy
helper stays for back-compat callers.
ManagedDGDR (added by #7343 after this branch was last touched) is
preserved untouched. DGDR support in the unified framework is a
follow-up.
Refs: DGH-703, GH#7767
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: nnshah1 <[email protected]>
Introduce an event-based declarative API for k8s tests. A test author
writes only four things — deployment spec, events, loads, and checks —
and run_scenario() handles deploy / event execution / log extraction /
cleanup / validation. This is the unified surface that fault-tolerance,
deployment-validation, and load-based functional tests should converge
on; legacy fixture-driven scenarios remain in place for back-compat.
New modules under tests/fault_tolerance/deploy/:
- events.py: Event ABC plus built-ins (StartLoad, StopLoad, Wait,
DeletePod, RollingUpgrade, TerminateProcess, WaitForRecovery,
WaitForLogPattern, RunCommand, WaitForLoadCompletion). Each event
is a dataclass with execute() / stop() / description and is
discoverable via __all__.
- checks.py: Check ABC plus built-ins (ZeroErrors, MaxErrors,
MinRequests, WasCancelled, ServiceLogContains,
ServiceLogNotContains). Validation is explicit, not buried in
fixture teardown.
- reports.py: Report ABC for post-check artifact generation
(HTML reports, metrics summaries, etc.).
- scenario.py: run_scenario(deployment_spec, events, checks, ...)
orchestrator and ScenarioContext. Self-contained narrative; no
jumping between conftest, scenarios, and test files.
- test_deployment_scenario.py: tests rewritten on the new API to
exercise the framework end-to-end.
- TESTING.md: agent-friendly reference doc cataloging events, checks,
common patterns, and copy-paste examples.
Also adds tests/utils/resource_monitor.py — pulled in transitively by
the framework for periodic GPU/process snapshots.
conftest.py is updated additively, not replaced:
- Legacy --include-custom-build flag, pytest_generate_tests, and
pytest_collection_modifyitems are preserved so the existing
test_fault_scenario parametrization keeps working.
- --storage-class added for PVC log collection (RWX-capable class).
- --restart-services added; with skip_service_restart now defaulting
to True (the iteration-friendly path most users already hit via
--skip-service-restart). The legacy flag is still honored when
explicitly passed.
- storage_class fixture exposes the new option.
Refs: DGH-703, GH#7767
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: nnshah1 <[email protected]>
…GH-703 phase 1) Mark 10 legacy files in tests/fault_tolerance/deploy/ with a uniform deprecation header pointing readers to the event-based framework introduced in the previous commit. No behavior change — these files remain importable so existing test_fault_scenario parametrization keeps running. Files frozen: - base_checker.py - checker_factory.py - checkers.py - client.py - client_factory.py - legacy_client.py - legacy_parse_results.py - parse_factory.py - parse_results.py - test_deployment.py Migration of these tests onto run_scenario() (followed by deletion of the legacy code) is DEP DGH-703 phase 2 and tracked separately. Refs: DGH-703, GH#7767 Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: nnshah1 <[email protected]>
…erance report (DGH-703 phase 1.5)
Closes the gaps the manual port from the _2 branch left behind so the
unified framework runs end-to-end on a real cluster.
ManagedDeployment:
- Ports the missing surface from _2: _create_log_collection_pvc,
_verify_pvc_binding, _extract_logs_from_pvc, _cleanup_log_collection_pvc,
_cleanup_orphaned_jobs, get_log_pvc_name, apply_service_changes,
_exec_in_pod, _get_pod_metrics, _get_pod_manifest, plus ServiceSpec
setters (set_env_var, set_readiness_probe, set_termination_grace_period)
and DeploymentSpec.frontend_service / get_in_cluster_frontend_url.
- Adds _wait_for_cr_deleted + _wait_for_pods_terminated, called from
_delete_deployment so a new run never races leftover pods from a
previous failed run.
- Adds _scrub_namespace at __aenter__: deletes all DGDs, log-collection
PVCs, and load/extract/verify jobs in the test namespace before
starting. Test authors don't have to scrub between runs.
- Wires lifecycle helpers into the right points: PVC verify-binding
after create, log extraction + orphan-jobs cleanup + PVC delete on
__aexit__.
- apply_service_changes now publishes the FULL ServiceSpec instead of
just envs, so callers can drive any rolling-upgrade-style mutation
(image, replicas, args, envs, ...) through ServiceSpec setters and
one publish call.
- Drops 0-caller delegators on DeploymentSpec (get_model,
set_service_env_var, add_arg_to_service, get_service,
get_service_env_vars, set_service_readiness_probe,
set_service_termination_grace_period). Per-service mutations go
through spec[name] directly.
- Drops redundant pre-delete _get_pod_logs call: the PVC log tee
already captures stdout/stderr, kubectl logs at delete time is just
a duplicate read.
Event base + scenario runner:
- Splits public template (Event.timed_execute) from subclass hook
(Event.execute). Subclasses keep writing the natural execute(ctx);
the framework wraps with started_at/ended_at timestamping that
reports use to slice timestamped data by event boundary.
- Caches deployment.startup_seconds on ScenarioContext before
clearing the deployment ref so reports can still see it.
FaultToleranceReport:
- New concrete Report subclass. Reads aiperf's per-request JSONL
(profile_export.jsonl), buckets records around each fault event's
started_at, and renders the legacy markdown table:
| Failure | Startup (s) | Success Before | Failed Before
| Success After | Failed After | Latency Before (ms)
| Latency After (ms) |
- Detects request failures correctly: was_cancelled OR missing
request_latency (the latter is what aiperf TimeoutError records
look like in the JSONL).
Templates:
- Adds tests/utils/templates/load_job.yaml needed by managed_load.py
but missing from the original port.
Verified by 3 sanity-test runs and 2 fault-test runs (DeletePod + report)
on a local k3s cluster with the bundled mocker backend.
Signed-off-by: nnshah1 <[email protected]>
Contributor
…DGH-703 phase 1.5)
Before this change the framework spread per-test artifacts across two
roots (cwd-relative for ManagedLoad, /tmp/dynamo_tests for everything
else) AND two layouts within each root (Frontend/ for kubectl-side
manifests/logs, services/<service-lower>/ for PVC-tee'd stdout). Inside
a dev container with the worktree mounted at /workspace this made
artifacts effectively invisible from the host.
Consolidates:
- run_scenario resolves log_dir once via resolve_test_output_path so
every component (ManagedDeployment, ManagedLoad, the
FaultToleranceReport, the conftest test.log handler) writes to one
absolute path.
- DYN_TEST_OUTPUT_PATH default for these k8s tests is set in the local
conftest (not the shared resolver) to <cwd>/test_outputs — preserves
the /tmp/dynamo_tests default for non-k8s tests on the host while
giving us host-visible artifacts in a dev container.
- _get_pod_manifest, _get_pod_metrics, get_pod_manifest_logs_metrics
all write to the LOWERCASED service dir so manifest/metrics paths
overlay the PVC-extracted stdout paths (PVC tee uses lowercase).
- _extract_logs_from_pvc lands files directly into <log_dir> instead
of <log_dir>/services — the PVC structure (<service-lower>/<pod>.log)
becomes the per-service subdir, no extra middle layer.
- get_pod_manifest_logs_metrics no longer writes <pod>.log or
<pod>.previous.log via kubectl. The container's stdout is already
captured continuously to the PVC by the tee wrapper, so kubectl logs
is just a duplicate. Dropping it also kills the empty .previous.log
files that confused readers (kubectl previous-log only has content
when the SAME pod's container restarted in place; for our DeletePod
scenarios the replacement pod has no previous incarnation).
- .gitignore picks up test_outputs/ so artifacts don't pollute the git
tree.
Result: each test produces one directory, with one subdir per service,
holding every per-pod artifact together:
test_outputs/<test>/
├── decode/
│ ├── <pod>_<ts>.log (PVC stdout, full lifetime)
│ ├── <pod>.yaml (manifest at end of life)
│ ├── <pod>.before_delete.yaml (manifest pre-fault, if killed)
│ └── <pod>_<ts>.metrics.before_delete.log (metrics pre-fault, if killed)
├── frontend/
│ ├── <pod>_<ts>.log
│ └── <pod>.yaml
├── load/ aiperf artifacts
├── test.log.txt framework log
└── fault_tolerance_report.md report table
Verified by re-running test_worker_kill_mocker without any env var
set: artifacts land in the consolidated tree, report renders with
Startup populated and the killed pod's pre/post window bucketed.
Signed-off-by: nnshah1 <[email protected]>
…tion w/ conntrack flush, richer reports (DGH-703) What changed: - WaitForModelReady event: port-forwards to the frontend pod and polls /v1/models so vllm/sglang model-load (which can run tens of seconds past pod-Ready) doesn't bleed into the load window. Auto-detects the model name from the spec, parsing the bash-wrapped command for --model when ServiceSpec.model returns None (the log-collection wrapper hides args inside command[]). - NetworkPartition: applies a NetworkPolicy targeting the per-service nvidia.com/dynamo-component label, then schedules a privileged hostNetwork pod on the source pod's node that runs `conntrack -D` for source<->target flows in both directions. NetworkPolicy alone is connection-tracked, so the pooled TCP request-plane socket trivially survives policy creation; the flush is what makes the partition observable as a fault. Mocker run goes from 0 errors → ~5k errors during the 20s window; vllm sees ~7k errors with the expected "Model not found" / "Connection refused" / "TCP timeout" mix in the error breakdown. - LoadConfig.connection_reuse_strategy: maps to aiperf's --connection-reuse-strategy (pooled / never / sticky-user-sessions). Used by partition tests so each request opens a fresh aiperf<> frontend socket; not strictly required for the partition (the conntrack flush is the load-bearing step) but documents the intent and keeps aiperf-side connection state from masking other faults. - LoadStopped / LoadCompleted checks: replace the ambiguous WasCancelled name. LoadStopped asserts the load was cut short by StopLoad (matches kill/partition scenarios); LoadCompleted asserts the load reached its configured request_count or duration without intervention (the sanity baseline). - FaultToleranceReport rewrite: three narrow Markdown sections (Timing | Counts | Latency) with a Recovery (s) column derived from WaitForRecovery's elapsed bracket (or partition duration for transient partitions). JSON sidecar (fault_tolerance_report.json) with the same data so the vault aggregator can produce a multi- scenario combined comparison. - ErrorBreakdownReport: per-load aiperf error-type breakdown so the dominant failure mode (404 Model not found vs. 500 Connection refused vs. TCP timeout) is visible at a glance. - ManagedDeployment: spliced 19 helpers from the _2 branch (_create_log_collection_pvc, _wait_for_cr_deleted, _wait_for_pods_terminated, _scrub_namespace, _verify_namespace_scrubbed, get_log_pvc_name, apply_service_changes, _get_pod_manifest, _get_pod_metrics async, ...). PVC log collection is now always-on with per-pod tee wrappers; logs land in a single per-test directory tree alongside reports, manifests, and aiperf output. Cleanup is two-phase (drain log PVC before deleting CR, then scrub-verify post-condition) so a killed test leaves the namespace exactly as clean as it found it. - README rewrite: leads with the event-based design (component architecture, data-flow, lifecycle, per-fault state diagrams for DeletePod / TerminateProcess / NetworkPartition+conntrack), the supported fault-event table, an example test, and an example combined comparison report including the aiperf server-metrics table (Inflight / Queued / Disconnected). Adds a "Current gaps" section enumerating what's not yet covered (partial-network faults, cross-namespace partitions, GPU faults, multi-replica, disagg topologies, SLA bounds) and a "Future direction" section pointing at Oviya's fault-injection-service work (oviya/fault-injection/dev and friends — API service, network injector DaemonSet, Chaos Mesh integration, GPU-fault tooling, Python client library) as the convergence path off the privileged conntrack-flush hack. Keeps the legacy parametrised scenario docs below as "Legacy scenario harness". Signed-off-by: nnshah1 <[email protected]>
…t modes (DGH-703) Eight per-mode tests (sanity / worker_kill / engine_kill / network_partition × mocker / vllm) drive the event-based harness end to end and emit the per-test FaultToleranceReport + ErrorBreakdownReport that the vault aggregator combines into the cross-scenario comparison. Useful both as a working smoke set for the framework and as the fixtures the next round of work (Chaos Mesh, fault-injection-service) will plug into. Signed-off-by: nnshah1 <[email protected]>
…GH-703) StallProcess pauses a named process via SIGSTOP, holds for an optional duration, then resumes via SIGCONT. Models a "hung worker" — pod IP, container, TCP sockets, and conntrack entries all stay alive, but the worker stops servicing requests because it isn't scheduled. Exercises frontend timeout / health-check behaviour rather than crash recovery which is what TerminateProcess (SIGKILL) covers. - StallProcess event with optional duration: transient stalls heal inside execute(); duration=None holds until scenario stop(). - test_engine_stall_mocker / test_engine_stall_vllm: 20 s mid-load stall on the decode worker; FaultToleranceReport + ErrorBreakdownReport show the queued/timed-out behaviour vs. clean recovery on SIGCONT. - README: fault-events table now lists StallProcess; new per-fault state diagram for the stall path so the distinction vs. TerminateProcess and NetworkPartition is explicit. Signed-off-by: nnshah1 <[email protected]>
…ty test (DGH-703)
Two changes that close the loop on capturing worker-side prometheus
metrics (vllm:* counters that the frontend's /metrics never sees,
including the NIXL counters tracked on the P instance in disagg):
- LoadConfig.extra_server_metrics_urls (Optional[list[str]]): URLs
passed to aiperf's --server-metrics in addition to the inference
endpoint base URL. StartLoad auto-populates this from each non-
frontend worker's pod IP + system_port/metrics when the user does
not explicitly set it. Pod IPs are valid for the duration of a
pod's lifetime — tests that destroy and recreate a worker
(DeletePod) will see the worker scrape stop on recovery; the
framework's per-pod _get_pod_metrics snapshot at fault-injection
time still captures that point.
- test_sanity_vllm_disagg.py: from_backend("vllm", "disagg") with
Qwen3-0.6B, 1 prefill + 1 decode. The prefill loads the
NixlConnector so vllm:nixl_num_kv_expired_reqs and
vllm:nixl_num_failed_notifications register, and the StartLoad
auto-scrape collects them. Documents the >= 2 GPU node
requirement (single-GPU clusters can't schedule both workers).
Signed-off-by: nnshah1 <[email protected]>
…sagg sanity (DGH-703)
k8s mirror of examples/backends/vllm/launch/disagg_same_gpu.sh:
prefill + decode co-resident on one GPU via tiny
--gpu-memory-utilization (0.01), capped --kv-cache-memory-bytes
(~976 MiB), --enforce-eager, and --max-model-len 4096. NixlConnector
with kv_role=kv_both handles the prefill->decode KV transfer so
vllm:nixl_* counters register on the prefill (P) instance, where
they're scraped by the system_port pass-through that StartLoad's
auto-discovery hands to aiperf.
DYN_SYSTEM_PORT distinct per worker (8081 / 8082) so the two
prometheus servers don't collide. VLLM_NIXL_SIDE_CHANNEL_PORT bumped
on prefill (20097) to avoid the default-20096 conflict with decode.
test_sanity_vllm_disagg now loads templates/vllm/disagg_same_gpu.yaml
directly instead of from_backend("vllm", "disagg"), so it can run on
single-GPU clusters with nvidia-device-plugin time-slicing (still
needs both nvidia.com/gpu:1 slots, but they're satisfied by a
sliced GPU).
Signed-off-by: nnshah1 <[email protected]>
… port overrides on disagg yaml (DGH-703)
Two fixes that together let the disagg sanity test come up on a
single-GPU cluster with nvidia-device-plugin time-slicing
(replicas=2). Validated end-to-end: aiperf scrapes Frontend +
decode + prefill /metrics, the captured server_metrics_export.json
contains the full vllm:nixl_* counter set including
vllm:nixl_num_kv_expired_reqs.
- ServiceSpec.enable_log_collection: shlex.join the original
command + args before pasting into the bash -c heredoc. The naive
" ".join broke any arg with shell metacharacters — notably the
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
JSON literal, where bash interpreted the braces as brace-expansion
("kv_connector:NixlConnector kv_role:kv_both") and vllm rejected
the resulting non-JSON.
- disagg_same_gpu.yaml: removed the per-worker DYN_SYSTEM_PORT
(8081/8082) and VLLM_NIXL_SIDE_CHANNEL_PORT (20097) overrides.
Those were carried over from the standalone disagg_same_gpu.sh
where both workers share a host network, but each k8s worker has
its own pod IP — so the default ports (9090, 20096) don't collide.
The overrides were causing the operator's startup probe to keep
hitting 9090 while the worker exposed 8081/8082, looping into
CrashLoopBackOff after the failureThreshold.
Signed-off-by: nnshah1 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements DGH-703 (DEP: Unified K8s Test Framework for Dynamo) as 4 commits — phase 1 (the framework + freeze of legacy) and phase 1.5 (full surface port + reliable lifecycle + fault report).
Tracks: GH#7767, Linear DGH-703.
What's in this PR
Phase 1 (3 existing commits)
refactor(tests): shared k8s infrastructure (k8s_helpers,pvc_extractor,managed_load, additiveManagedDeploymentAPIs, shlex consistency)feat(tests): event-based declarative framework (events,checks,scenario,test_deployment_scenario,TESTING.md, resource monitor, conftest UNION)chore(tests): deprecation headers freezing 10 legacy FT filesPhase 1.5 (new commit
c968d1c0b7)_2so the framework runs end-to-end_wait_for_cr_deleted+_wait_for_pods_terminated+_scrub_namespace+ wired-in lifecycle helpersEvent.timed_executetemplate stampsstarted_at/ended_atso reports can bucket time-stamped dataFaultToleranceReportrenders the legacy-style markdown table:| Failure | Startup (s) | Success Before | Failed Before | Success After | Failed After | Latency Before (ms) | Latency After (ms) |apply_service_changesto publish fullServiceSpec(image / replicas / args / envs / …) instead of just envsDeploymentSpec; mutations go throughspec[name]directlyVerified locally
Out of scope (filed as follow-ups)
mgr.collect_timeline()reading pod-status conditions)Test plan
pytest --collect-only tests/fault_tolerance/deploy/test_deployment_scenario.py— confirms imports + parametrization (20 tests collected)test_sanity_mockeragainst k3s with mocker backend — passestest_worker_kill_mocker(DeletePod scenario) — passes, report renderedNotes