DGH-703: Unified k8s test framework — phase 1 + 1.5 by nnshah1 · Pull Request #9066 · ai-dynamo/dynamo

nnshah1 · 2026-05-03T16:37:29Z

Summary

Implements DGH-703 (DEP: Unified K8s Test Framework for Dynamo) as 4 commits — phase 1 (the framework + freeze of legacy) and phase 1.5 (full surface port + reliable lifecycle + fault report).

Tracks: GH#7767, Linear DGH-703.

What's in this PR

Phase 1 (3 existing commits)

refactor(tests): shared k8s infrastructure (k8s_helpers, pvc_extractor, managed_load, additive ManagedDeployment APIs, shlex consistency)
feat(tests): event-based declarative framework (events, checks, scenario, test_deployment_scenario, TESTING.md, resource monitor, conftest UNION)
chore(tests): deprecation headers freezing 10 legacy FT files

Phase 1.5 (new commit `c968d1c0b7`)

Ports the missing surface from _2 so the framework runs end-to-end
Reliable cleanup: _wait_for_cr_deleted + _wait_for_pods_terminated + _scrub_namespace + wired-in lifecycle helpers
Event.timed_execute template stamps started_at/ended_at so reports can bucket time-stamped data
New FaultToleranceReport renders the legacy-style markdown table:
| Failure | Startup (s) | Success Before | Failed Before | Success After | Failed After | Latency Before (ms) | Latency After (ms) |
Generalized apply_service_changes to publish full ServiceSpec (image / replicas / args / envs / …) instead of just envs
Drops 0-caller per-service delegators on DeploymentSpec; mutations go through spec[name] directly

Verified locally

3× sanity-test passes (mocker backend, k3s)
2× fault-test passes (DeletePod + FaultToleranceReport renders correctly with non-zero pre/post buckets)

Out of scope (filed as follow-ups)

Recovery time column (needs mgr.collect_timeline() reading pod-status conditions)
Frontend behavior: the frontend buffers requests indefinitely when no worker is ready, so aiperf only sees its own task-cancellation timeouts. Filing a separate frontend issue — not a framework problem.
Switch RWX shared PVC → per-pod RWO PVCs OR fluent-bit sidecar (eliminates the RWX storage requirement)

Test plan

pytest --collect-only tests/fault_tolerance/deploy/test_deployment_scenario.py — confirms imports + parametrization (20 tests collected)
Run test_sanity_mocker against k3s with mocker backend — passes
Run test_worker_kill_mocker (DeletePod scenario) — passes, report rendered
Run on Nebius cluster against real trtllm/sglang/vllm runtime (separate; this PR is phase-1 framework infra)

Notes

Stays draft until reviewers confirm Phase 1 scope

Introduce shared k8s test infrastructure to support a unified test framework that can replace fault-tolerance, deployment-validation, and load-based functional tests behind one event-based API. This is phase 1 of DEP DGH-703 / GH#7767. New modules under tests/utils/: - k8s_helpers.py: shared K8s client init with KUBECONFIG -> in-cluster -> default kubeconfig fallback. Used by both ManagedDeployment and ManagedLoad so each test type talks to the cluster the same way. - pvc_extractor.py: unified PVC file extraction via a busybox download job (tar matching files, stream to local, extract, cleanup). Used for both service logs and aiperf result files. - managed_load.py: reusable ManagedLoad with a LoadConfig dataclass. Creates a K8s Job from a YAML template, builds the aiperf command, and manages start/wait/terminate/get_results. - templates/log_download_job.yaml + log_wrapper.sh: backing assets. ServiceSpec additions (additive, no behavior change to existing APIs): - _ensure_path() helper to eliminate repetitive nested-dict guards. - component_type property reading spec["componentType"]. - set_arg(arg_name, arg_value): set or update a launch arg. Quote-aware via shlex.split (matches DEP intent). - _add_volume_mount() / _add_env_var() idempotent helpers. - enable_log_collection(log_dir, pvc_name): wrap mainContainer command to tee output into a PVC-mounted directory. DeploymentSpec additions: - from_backend(backend, deployment_type) classmethod loading examples/backends/<backend>/deploy/<type>.yaml. - backend property: detect from YAML path, fall back to service names. __init__ now stores _base_path so path-based detection works. - worker_services(): non-frontend service names. - set_worker_replicas(n): set replicas across every worker service. - enable_log_collection(): declares a deployment-level log PVC and enables wrapping for all (or named) services. ManagedDeployment materializes the PVC at deploy time. Consistency fix in ServiceSpec._get_args(): scalar string args are now normalized via shlex.split() instead of naive str.split(), matching set_arg's normalization so quoted launch args don't fragment. Surgical edit to the legacy scenarios.py (deprecated header added): - Token-overflow scenario now uses deployment_spec[name].set_arg(arg, val) in place of the older deployment_spec.add_arg_to_service(name, arg, val). The legacy helper stays for back-compat callers. ManagedDGDR (added by #7343 after this branch was last touched) is preserved untouched. DGDR support in the unified framework is a follow-up. Refs: DGH-703, GH#7767 Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: nnshah1 <[email protected]>

Introduce an event-based declarative API for k8s tests. A test author writes only four things — deployment spec, events, loads, and checks — and run_scenario() handles deploy / event execution / log extraction / cleanup / validation. This is the unified surface that fault-tolerance, deployment-validation, and load-based functional tests should converge on; legacy fixture-driven scenarios remain in place for back-compat. New modules under tests/fault_tolerance/deploy/: - events.py: Event ABC plus built-ins (StartLoad, StopLoad, Wait, DeletePod, RollingUpgrade, TerminateProcess, WaitForRecovery, WaitForLogPattern, RunCommand, WaitForLoadCompletion). Each event is a dataclass with execute() / stop() / description and is discoverable via __all__. - checks.py: Check ABC plus built-ins (ZeroErrors, MaxErrors, MinRequests, WasCancelled, ServiceLogContains, ServiceLogNotContains). Validation is explicit, not buried in fixture teardown. - reports.py: Report ABC for post-check artifact generation (HTML reports, metrics summaries, etc.). - scenario.py: run_scenario(deployment_spec, events, checks, ...) orchestrator and ScenarioContext. Self-contained narrative; no jumping between conftest, scenarios, and test files. - test_deployment_scenario.py: tests rewritten on the new API to exercise the framework end-to-end. - TESTING.md: agent-friendly reference doc cataloging events, checks, common patterns, and copy-paste examples. Also adds tests/utils/resource_monitor.py — pulled in transitively by the framework for periodic GPU/process snapshots. conftest.py is updated additively, not replaced: - Legacy --include-custom-build flag, pytest_generate_tests, and pytest_collection_modifyitems are preserved so the existing test_fault_scenario parametrization keeps working. - --storage-class added for PVC log collection (RWX-capable class). - --restart-services added; with skip_service_restart now defaulting to True (the iteration-friendly path most users already hit via --skip-service-restart). The legacy flag is still honored when explicitly passed. - storage_class fixture exposes the new option. Refs: DGH-703, GH#7767 Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: nnshah1 <[email protected]>

…GH-703 phase 1) Mark 10 legacy files in tests/fault_tolerance/deploy/ with a uniform deprecation header pointing readers to the event-based framework introduced in the previous commit. No behavior change — these files remain importable so existing test_fault_scenario parametrization keeps running. Files frozen: - base_checker.py - checker_factory.py - checkers.py - client.py - client_factory.py - legacy_client.py - legacy_parse_results.py - parse_factory.py - parse_results.py - test_deployment.py Migration of these tests onto run_scenario() (followed by deletion of the legacy code) is DEP DGH-703 phase 2 and tracked separately. Refs: DGH-703, GH#7767 Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: nnshah1 <[email protected]>

…erance report (DGH-703 phase 1.5) Closes the gaps the manual port from the _2 branch left behind so the unified framework runs end-to-end on a real cluster. ManagedDeployment: - Ports the missing surface from _2: _create_log_collection_pvc, _verify_pvc_binding, _extract_logs_from_pvc, _cleanup_log_collection_pvc, _cleanup_orphaned_jobs, get_log_pvc_name, apply_service_changes, _exec_in_pod, _get_pod_metrics, _get_pod_manifest, plus ServiceSpec setters (set_env_var, set_readiness_probe, set_termination_grace_period) and DeploymentSpec.frontend_service / get_in_cluster_frontend_url. - Adds _wait_for_cr_deleted + _wait_for_pods_terminated, called from _delete_deployment so a new run never races leftover pods from a previous failed run. - Adds _scrub_namespace at __aenter__: deletes all DGDs, log-collection PVCs, and load/extract/verify jobs in the test namespace before starting. Test authors don't have to scrub between runs. - Wires lifecycle helpers into the right points: PVC verify-binding after create, log extraction + orphan-jobs cleanup + PVC delete on __aexit__. - apply_service_changes now publishes the FULL ServiceSpec instead of just envs, so callers can drive any rolling-upgrade-style mutation (image, replicas, args, envs, ...) through ServiceSpec setters and one publish call. - Drops 0-caller delegators on DeploymentSpec (get_model, set_service_env_var, add_arg_to_service, get_service, get_service_env_vars, set_service_readiness_probe, set_service_termination_grace_period). Per-service mutations go through spec[name] directly. - Drops redundant pre-delete _get_pod_logs call: the PVC log tee already captures stdout/stderr, kubectl logs at delete time is just a duplicate read. Event base + scenario runner: - Splits public template (Event.timed_execute) from subclass hook (Event.execute). Subclasses keep writing the natural execute(ctx); the framework wraps with started_at/ended_at timestamping that reports use to slice timestamped data by event boundary. - Caches deployment.startup_seconds on ScenarioContext before clearing the deployment ref so reports can still see it. FaultToleranceReport: - New concrete Report subclass. Reads aiperf's per-request JSONL (profile_export.jsonl), buckets records around each fault event's started_at, and renders the legacy markdown table: | Failure | Startup (s) | Success Before | Failed Before | Success After | Failed After | Latency Before (ms) | Latency After (ms) | - Detects request failures correctly: was_cancelled OR missing request_latency (the latter is what aiperf TimeoutError records look like in the JSONL). Templates: - Adds tests/utils/templates/load_job.yaml needed by managed_load.py but missing from the original port. Verified by 3 sanity-test runs and 2 fault-test runs (DeletePod + report) on a local k3s cluster with the bundled mocker backend. Signed-off-by: nnshah1 <[email protected]>

github-actions · 2026-05-03T16:39:25Z

🌿 Fern Docs Preview: https://nvidia-preview-7c2c1ad9-d011-4d09-b51f-820fdcb678f3.docs.buildwithfern.com/dynamo/dev

…DGH-703 phase 1.5) Before this change the framework spread per-test artifacts across two roots (cwd-relative for ManagedLoad, /tmp/dynamo_tests for everything else) AND two layouts within each root (Frontend/ for kubectl-side manifests/logs, services/<service-lower>/ for PVC-tee'd stdout). Inside a dev container with the worktree mounted at /workspace this made artifacts effectively invisible from the host. Consolidates: - run_scenario resolves log_dir once via resolve_test_output_path so every component (ManagedDeployment, ManagedLoad, the FaultToleranceReport, the conftest test.log handler) writes to one absolute path. - DYN_TEST_OUTPUT_PATH default for these k8s tests is set in the local conftest (not the shared resolver) to <cwd>/test_outputs — preserves the /tmp/dynamo_tests default for non-k8s tests on the host while giving us host-visible artifacts in a dev container. - _get_pod_manifest, _get_pod_metrics, get_pod_manifest_logs_metrics all write to the LOWERCASED service dir so manifest/metrics paths overlay the PVC-extracted stdout paths (PVC tee uses lowercase). - _extract_logs_from_pvc lands files directly into <log_dir> instead of <log_dir>/services — the PVC structure (<service-lower>/<pod>.log) becomes the per-service subdir, no extra middle layer. - get_pod_manifest_logs_metrics no longer writes <pod>.log or <pod>.previous.log via kubectl. The container's stdout is already captured continuously to the PVC by the tee wrapper, so kubectl logs is just a duplicate. Dropping it also kills the empty .previous.log files that confused readers (kubectl previous-log only has content when the SAME pod's container restarted in place; for our DeletePod scenarios the replacement pod has no previous incarnation). - .gitignore picks up test_outputs/ so artifacts don't pollute the git tree. Result: each test produces one directory, with one subdir per service, holding every per-pod artifact together: test_outputs/<test>/ ├── decode/ │ ├── <pod>_<ts>.log (PVC stdout, full lifetime) │ ├── <pod>.yaml (manifest at end of life) │ ├── <pod>.before_delete.yaml (manifest pre-fault, if killed) │ └── <pod>_<ts>.metrics.before_delete.log (metrics pre-fault, if killed) ├── frontend/ │ ├── <pod>_<ts>.log │ └── <pod>.yaml ├── load/ aiperf artifacts ├── test.log.txt framework log └── fault_tolerance_report.md report table Verified by re-running test_worker_kill_mocker without any env var set: artifacts land in the consolidated tree, report renders with Startup populated and the killed pod's pre/post window bucketed. Signed-off-by: nnshah1 <[email protected]>

…tion w/ conntrack flush, richer reports (DGH-703) What changed: - WaitForModelReady event: port-forwards to the frontend pod and polls /v1/models so vllm/sglang model-load (which can run tens of seconds past pod-Ready) doesn't bleed into the load window. Auto-detects the model name from the spec, parsing the bash-wrapped command for --model when ServiceSpec.model returns None (the log-collection wrapper hides args inside command[]). - NetworkPartition: applies a NetworkPolicy targeting the per-service nvidia.com/dynamo-component label, then schedules a privileged hostNetwork pod on the source pod's node that runs `conntrack -D` for source<->target flows in both directions. NetworkPolicy alone is connection-tracked, so the pooled TCP request-plane socket trivially survives policy creation; the flush is what makes the partition observable as a fault. Mocker run goes from 0 errors → ~5k errors during the 20s window; vllm sees ~7k errors with the expected "Model not found" / "Connection refused" / "TCP timeout" mix in the error breakdown. - LoadConfig.connection_reuse_strategy: maps to aiperf's --connection-reuse-strategy (pooled / never / sticky-user-sessions). Used by partition tests so each request opens a fresh aiperf<> frontend socket; not strictly required for the partition (the conntrack flush is the load-bearing step) but documents the intent and keeps aiperf-side connection state from masking other faults. - LoadStopped / LoadCompleted checks: replace the ambiguous WasCancelled name. LoadStopped asserts the load was cut short by StopLoad (matches kill/partition scenarios); LoadCompleted asserts the load reached its configured request_count or duration without intervention (the sanity baseline). - FaultToleranceReport rewrite: three narrow Markdown sections (Timing | Counts | Latency) with a Recovery (s) column derived from WaitForRecovery's elapsed bracket (or partition duration for transient partitions). JSON sidecar (fault_tolerance_report.json) with the same data so the vault aggregator can produce a multi- scenario combined comparison. - ErrorBreakdownReport: per-load aiperf error-type breakdown so the dominant failure mode (404 Model not found vs. 500 Connection refused vs. TCP timeout) is visible at a glance. - ManagedDeployment: spliced 19 helpers from the _2 branch (_create_log_collection_pvc, _wait_for_cr_deleted, _wait_for_pods_terminated, _scrub_namespace, _verify_namespace_scrubbed, get_log_pvc_name, apply_service_changes, _get_pod_manifest, _get_pod_metrics async, ...). PVC log collection is now always-on with per-pod tee wrappers; logs land in a single per-test directory tree alongside reports, manifests, and aiperf output. Cleanup is two-phase (drain log PVC before deleting CR, then scrub-verify post-condition) so a killed test leaves the namespace exactly as clean as it found it. - README rewrite: leads with the event-based design (component architecture, data-flow, lifecycle, per-fault state diagrams for DeletePod / TerminateProcess / NetworkPartition+conntrack), the supported fault-event table, an example test, and an example combined comparison report including the aiperf server-metrics table (Inflight / Queued / Disconnected). Adds a "Current gaps" section enumerating what's not yet covered (partial-network faults, cross-namespace partitions, GPU faults, multi-replica, disagg topologies, SLA bounds) and a "Future direction" section pointing at Oviya's fault-injection-service work (oviya/fault-injection/dev and friends — API service, network injector DaemonSet, Chaos Mesh integration, GPU-fault tooling, Python client library) as the convergence path off the privileged conntrack-flush hack. Keeps the legacy parametrised scenario docs below as "Legacy scenario harness". Signed-off-by: nnshah1 <[email protected]>

…t modes (DGH-703) Eight per-mode tests (sanity / worker_kill / engine_kill / network_partition × mocker / vllm) drive the event-based harness end to end and emit the per-test FaultToleranceReport + ErrorBreakdownReport that the vault aggregator combines into the cross-scenario comparison. Useful both as a working smoke set for the framework and as the fixtures the next round of work (Chaos Mesh, fault-injection-service) will plug into. Signed-off-by: nnshah1 <[email protected]>

…GH-703) StallProcess pauses a named process via SIGSTOP, holds for an optional duration, then resumes via SIGCONT. Models a "hung worker" — pod IP, container, TCP sockets, and conntrack entries all stay alive, but the worker stops servicing requests because it isn't scheduled. Exercises frontend timeout / health-check behaviour rather than crash recovery which is what TerminateProcess (SIGKILL) covers. - StallProcess event with optional duration: transient stalls heal inside execute(); duration=None holds until scenario stop(). - test_engine_stall_mocker / test_engine_stall_vllm: 20 s mid-load stall on the decode worker; FaultToleranceReport + ErrorBreakdownReport show the queued/timed-out behaviour vs. clean recovery on SIGCONT. - README: fault-events table now lists StallProcess; new per-fault state diagram for the stall path so the distinction vs. TerminateProcess and NetworkPartition is explicit. Signed-off-by: nnshah1 <[email protected]>

…ty test (DGH-703) Two changes that close the loop on capturing worker-side prometheus metrics (vllm:* counters that the frontend's /metrics never sees, including the NIXL counters tracked on the P instance in disagg): - LoadConfig.extra_server_metrics_urls (Optional[list[str]]): URLs passed to aiperf's --server-metrics in addition to the inference endpoint base URL. StartLoad auto-populates this from each non- frontend worker's pod IP + system_port/metrics when the user does not explicitly set it. Pod IPs are valid for the duration of a pod's lifetime — tests that destroy and recreate a worker (DeletePod) will see the worker scrape stop on recovery; the framework's per-pod _get_pod_metrics snapshot at fault-injection time still captures that point. - test_sanity_vllm_disagg.py: from_backend("vllm", "disagg") with Qwen3-0.6B, 1 prefill + 1 decode. The prefill loads the NixlConnector so vllm:nixl_num_kv_expired_reqs and vllm:nixl_num_failed_notifications register, and the StartLoad auto-scrape collects them. Documents the >= 2 GPU node requirement (single-GPU clusters can't schedule both workers). Signed-off-by: nnshah1 <[email protected]>

…sagg sanity (DGH-703) k8s mirror of examples/backends/vllm/launch/disagg_same_gpu.sh: prefill + decode co-resident on one GPU via tiny --gpu-memory-utilization (0.01), capped --kv-cache-memory-bytes (~976 MiB), --enforce-eager, and --max-model-len 4096. NixlConnector with kv_role=kv_both handles the prefill->decode KV transfer so vllm:nixl_* counters register on the prefill (P) instance, where they're scraped by the system_port pass-through that StartLoad's auto-discovery hands to aiperf. DYN_SYSTEM_PORT distinct per worker (8081 / 8082) so the two prometheus servers don't collide. VLLM_NIXL_SIDE_CHANNEL_PORT bumped on prefill (20097) to avoid the default-20096 conflict with decode. test_sanity_vllm_disagg now loads templates/vllm/disagg_same_gpu.yaml directly instead of from_backend("vllm", "disagg"), so it can run on single-GPU clusters with nvidia-device-plugin time-slicing (still needs both nvidia.com/gpu:1 slots, but they're satisfied by a sliced GPU). Signed-off-by: nnshah1 <[email protected]>

… port overrides on disagg yaml (DGH-703) Two fixes that together let the disagg sanity test come up on a single-GPU cluster with nvidia-device-plugin time-slicing (replicas=2). Validated end-to-end: aiperf scrapes Frontend + decode + prefill /metrics, the captured server_metrics_export.json contains the full vllm:nixl_* counter set including vllm:nixl_num_kv_expired_reqs. - ServiceSpec.enable_log_collection: shlex.join the original command + args before pasting into the bash -c heredoc. The naive " ".join broke any arg with shell metacharacters — notably the --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' JSON literal, where bash interpreted the braces as brace-expansion ("kv_connector:NixlConnector kv_role:kv_both") and vllm rejected the resulting non-JSON. - disagg_same_gpu.yaml: removed the per-worker DYN_SYSTEM_PORT (8081/8082) and VLLM_NIXL_SIDE_CHANNEL_PORT (20097) overrides. Those were carried over from the standalone disagg_same_gpu.sh where both workers share a host network, but each k8s worker has its own pod IP — so the default ports (9090, 20096) don't collide. The overrides were causing the operator's startup probe to keep hitting 9090 while the worker exposed 8081/8082, looping into CrashLoopBackOff after the failureThreshold. Signed-off-by: nnshah1 <[email protected]>

nnshah1 and others added 4 commits April 28, 2026 18:50

pull-request-size Bot added the size/XXL label May 3, 2026

github-actions Bot added the documentation Improvements or additions to documentation label May 3, 2026

copy-pr-bot Bot temporarily deployed to GITLAB May 3, 2026 17:35 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 3, 2026 17:36 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 5, 2026 13:09 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 5, 2026 13:14 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 5, 2026 14:32 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 5, 2026 14:33 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 5, 2026 22:15 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 5, 2026 22:53 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 6, 2026 00:08 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 6, 2026 07:17 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 6, 2026 08:26 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 6, 2026 08:44 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DGH-703: Unified k8s test framework — phase 1 + 1.5#9066

DGH-703: Unified k8s test framework — phase 1 + 1.5#9066
nnshah1 wants to merge 11 commits intomainfrom
neelays/dgh-703-dep-unified-k8s-test-framework-for-dynamo

nnshah1 commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nnshah1 commented May 3, 2026

Summary

What's in this PR

Phase 1 (3 existing commits)

Phase 1.5 (new commit c968d1c0b7)

Verified locally

Out of scope (filed as follow-ups)

Test plan

Notes

Uh oh!

github-actions Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Phase 1.5 (new commit `c968d1c0b7`)

github-actions Bot commented May 3, 2026 •

edited

Loading