Skip to content

DGH-703: Unified k8s test framework — phase 1 + 1.5#9066

Draft
nnshah1 wants to merge 11 commits intomainfrom
neelays/dgh-703-dep-unified-k8s-test-framework-for-dynamo
Draft

DGH-703: Unified k8s test framework — phase 1 + 1.5#9066
nnshah1 wants to merge 11 commits intomainfrom
neelays/dgh-703-dep-unified-k8s-test-framework-for-dynamo

Conversation

@nnshah1
Copy link
Copy Markdown
Contributor

@nnshah1 nnshah1 commented May 3, 2026

Summary

Implements DGH-703 (DEP: Unified K8s Test Framework for Dynamo) as 4 commits — phase 1 (the framework + freeze of legacy) and phase 1.5 (full surface port + reliable lifecycle + fault report).

Tracks: GH#7767, Linear DGH-703.

What's in this PR

Phase 1 (3 existing commits)

  • refactor(tests): shared k8s infrastructure (k8s_helpers, pvc_extractor, managed_load, additive ManagedDeployment APIs, shlex consistency)
  • feat(tests): event-based declarative framework (events, checks, scenario, test_deployment_scenario, TESTING.md, resource monitor, conftest UNION)
  • chore(tests): deprecation headers freezing 10 legacy FT files

Phase 1.5 (new commit c968d1c0b7)

  • Ports the missing surface from _2 so the framework runs end-to-end
  • Reliable cleanup: _wait_for_cr_deleted + _wait_for_pods_terminated + _scrub_namespace + wired-in lifecycle helpers
  • Event.timed_execute template stamps started_at/ended_at so reports can bucket time-stamped data
  • New FaultToleranceReport renders the legacy-style markdown table:
    | Failure | Startup (s) | Success Before | Failed Before | Success After | Failed After | Latency Before (ms) | Latency After (ms) |
  • Generalized apply_service_changes to publish full ServiceSpec (image / replicas / args / envs / …) instead of just envs
  • Drops 0-caller per-service delegators on DeploymentSpec; mutations go through spec[name] directly

Verified locally

  • 3× sanity-test passes (mocker backend, k3s)
  • 2× fault-test passes (DeletePod + FaultToleranceReport renders correctly with non-zero pre/post buckets)

Out of scope (filed as follow-ups)

  • Recovery time column (needs mgr.collect_timeline() reading pod-status conditions)
  • Frontend behavior: the frontend buffers requests indefinitely when no worker is ready, so aiperf only sees its own task-cancellation timeouts. Filing a separate frontend issue — not a framework problem.
  • Switch RWX shared PVC → per-pod RWO PVCs OR fluent-bit sidecar (eliminates the RWX storage requirement)

Test plan

  • pytest --collect-only tests/fault_tolerance/deploy/test_deployment_scenario.py — confirms imports + parametrization (20 tests collected)
  • Run test_sanity_mocker against k3s with mocker backend — passes
  • Run test_worker_kill_mocker (DeletePod scenario) — passes, report rendered
  • Run on Nebius cluster against real trtllm/sglang/vllm runtime (separate; this PR is phase-1 framework infra)

Notes

  • Stays draft until reviewers confirm Phase 1 scope

nnshah1 and others added 4 commits April 28, 2026 18:50
Introduce shared k8s test infrastructure to support a unified test
framework that can replace fault-tolerance, deployment-validation, and
load-based functional tests behind one event-based API. This is
phase 1 of DEP DGH-703 / GH#7767.

New modules under tests/utils/:
  - k8s_helpers.py: shared K8s client init with KUBECONFIG -> in-cluster
    -> default kubeconfig fallback. Used by both ManagedDeployment and
    ManagedLoad so each test type talks to the cluster the same way.
  - pvc_extractor.py: unified PVC file extraction via a busybox download
    job (tar matching files, stream to local, extract, cleanup). Used
    for both service logs and aiperf result files.
  - managed_load.py: reusable ManagedLoad with a LoadConfig dataclass.
    Creates a K8s Job from a YAML template, builds the aiperf command,
    and manages start/wait/terminate/get_results.
  - templates/log_download_job.yaml + log_wrapper.sh: backing assets.

ServiceSpec additions (additive, no behavior change to existing APIs):
  - _ensure_path() helper to eliminate repetitive nested-dict guards.
  - component_type property reading spec["componentType"].
  - set_arg(arg_name, arg_value): set or update a launch arg.
    Quote-aware via shlex.split (matches DEP intent).
  - _add_volume_mount() / _add_env_var() idempotent helpers.
  - enable_log_collection(log_dir, pvc_name): wrap mainContainer command
    to tee output into a PVC-mounted directory.

DeploymentSpec additions:
  - from_backend(backend, deployment_type) classmethod loading
    examples/backends/<backend>/deploy/<type>.yaml.
  - backend property: detect from YAML path, fall back to service names.
    __init__ now stores _base_path so path-based detection works.
  - worker_services(): non-frontend service names.
  - set_worker_replicas(n): set replicas across every worker service.
  - enable_log_collection(): declares a deployment-level log PVC and
    enables wrapping for all (or named) services. ManagedDeployment
    materializes the PVC at deploy time.

Consistency fix in ServiceSpec._get_args(): scalar string args are now
normalized via shlex.split() instead of naive str.split(), matching
set_arg's normalization so quoted launch args don't fragment.

Surgical edit to the legacy scenarios.py (deprecated header added):
  - Token-overflow scenario now uses
    deployment_spec[name].set_arg(arg, val) in place of the older
    deployment_spec.add_arg_to_service(name, arg, val). The legacy
    helper stays for back-compat callers.

ManagedDGDR (added by #7343 after this branch was last touched) is
preserved untouched. DGDR support in the unified framework is a
follow-up.

Refs: DGH-703, GH#7767

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: nnshah1 <[email protected]>
Introduce an event-based declarative API for k8s tests. A test author
writes only four things — deployment spec, events, loads, and checks —
and run_scenario() handles deploy / event execution / log extraction /
cleanup / validation. This is the unified surface that fault-tolerance,
deployment-validation, and load-based functional tests should converge
on; legacy fixture-driven scenarios remain in place for back-compat.

New modules under tests/fault_tolerance/deploy/:
  - events.py: Event ABC plus built-ins (StartLoad, StopLoad, Wait,
    DeletePod, RollingUpgrade, TerminateProcess, WaitForRecovery,
    WaitForLogPattern, RunCommand, WaitForLoadCompletion). Each event
    is a dataclass with execute() / stop() / description and is
    discoverable via __all__.
  - checks.py: Check ABC plus built-ins (ZeroErrors, MaxErrors,
    MinRequests, WasCancelled, ServiceLogContains,
    ServiceLogNotContains). Validation is explicit, not buried in
    fixture teardown.
  - reports.py: Report ABC for post-check artifact generation
    (HTML reports, metrics summaries, etc.).
  - scenario.py: run_scenario(deployment_spec, events, checks, ...)
    orchestrator and ScenarioContext. Self-contained narrative; no
    jumping between conftest, scenarios, and test files.
  - test_deployment_scenario.py: tests rewritten on the new API to
    exercise the framework end-to-end.
  - TESTING.md: agent-friendly reference doc cataloging events, checks,
    common patterns, and copy-paste examples.

Also adds tests/utils/resource_monitor.py — pulled in transitively by
the framework for periodic GPU/process snapshots.

conftest.py is updated additively, not replaced:
  - Legacy --include-custom-build flag, pytest_generate_tests, and
    pytest_collection_modifyitems are preserved so the existing
    test_fault_scenario parametrization keeps working.
  - --storage-class added for PVC log collection (RWX-capable class).
  - --restart-services added; with skip_service_restart now defaulting
    to True (the iteration-friendly path most users already hit via
    --skip-service-restart). The legacy flag is still honored when
    explicitly passed.
  - storage_class fixture exposes the new option.

Refs: DGH-703, GH#7767

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: nnshah1 <[email protected]>
…GH-703 phase 1)

Mark 10 legacy files in tests/fault_tolerance/deploy/ with a uniform
deprecation header pointing readers to the event-based framework
introduced in the previous commit. No behavior change — these files
remain importable so existing test_fault_scenario parametrization keeps
running.

Files frozen:
  - base_checker.py
  - checker_factory.py
  - checkers.py
  - client.py
  - client_factory.py
  - legacy_client.py
  - legacy_parse_results.py
  - parse_factory.py
  - parse_results.py
  - test_deployment.py

Migration of these tests onto run_scenario() (followed by deletion of
the legacy code) is DEP DGH-703 phase 2 and tracked separately.

Refs: DGH-703, GH#7767

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: nnshah1 <[email protected]>
…erance report (DGH-703 phase 1.5)

Closes the gaps the manual port from the _2 branch left behind so the
unified framework runs end-to-end on a real cluster.

ManagedDeployment:
- Ports the missing surface from _2: _create_log_collection_pvc,
  _verify_pvc_binding, _extract_logs_from_pvc, _cleanup_log_collection_pvc,
  _cleanup_orphaned_jobs, get_log_pvc_name, apply_service_changes,
  _exec_in_pod, _get_pod_metrics, _get_pod_manifest, plus ServiceSpec
  setters (set_env_var, set_readiness_probe, set_termination_grace_period)
  and DeploymentSpec.frontend_service / get_in_cluster_frontend_url.
- Adds _wait_for_cr_deleted + _wait_for_pods_terminated, called from
  _delete_deployment so a new run never races leftover pods from a
  previous failed run.
- Adds _scrub_namespace at __aenter__: deletes all DGDs, log-collection
  PVCs, and load/extract/verify jobs in the test namespace before
  starting. Test authors don't have to scrub between runs.
- Wires lifecycle helpers into the right points: PVC verify-binding
  after create, log extraction + orphan-jobs cleanup + PVC delete on
  __aexit__.
- apply_service_changes now publishes the FULL ServiceSpec instead of
  just envs, so callers can drive any rolling-upgrade-style mutation
  (image, replicas, args, envs, ...) through ServiceSpec setters and
  one publish call.
- Drops 0-caller delegators on DeploymentSpec (get_model,
  set_service_env_var, add_arg_to_service, get_service,
  get_service_env_vars, set_service_readiness_probe,
  set_service_termination_grace_period). Per-service mutations go
  through spec[name] directly.
- Drops redundant pre-delete _get_pod_logs call: the PVC log tee
  already captures stdout/stderr, kubectl logs at delete time is just
  a duplicate read.

Event base + scenario runner:
- Splits public template (Event.timed_execute) from subclass hook
  (Event.execute). Subclasses keep writing the natural execute(ctx);
  the framework wraps with started_at/ended_at timestamping that
  reports use to slice timestamped data by event boundary.
- Caches deployment.startup_seconds on ScenarioContext before
  clearing the deployment ref so reports can still see it.

FaultToleranceReport:
- New concrete Report subclass. Reads aiperf's per-request JSONL
  (profile_export.jsonl), buckets records around each fault event's
  started_at, and renders the legacy markdown table:
    | Failure | Startup (s) | Success Before | Failed Before
    | Success After | Failed After | Latency Before (ms)
    | Latency After (ms) |
- Detects request failures correctly: was_cancelled OR missing
  request_latency (the latter is what aiperf TimeoutError records
  look like in the JSONL).

Templates:
- Adds tests/utils/templates/load_job.yaml needed by managed_load.py
  but missing from the original port.

Verified by 3 sanity-test runs and 2 fault-test runs (DeletePod + report)
on a local k3s cluster with the bundled mocker backend.

Signed-off-by: nnshah1 <[email protected]>
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 3, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

…DGH-703 phase 1.5)

Before this change the framework spread per-test artifacts across two
roots (cwd-relative for ManagedLoad, /tmp/dynamo_tests for everything
else) AND two layouts within each root (Frontend/ for kubectl-side
manifests/logs, services/<service-lower>/ for PVC-tee'd stdout). Inside
a dev container with the worktree mounted at /workspace this made
artifacts effectively invisible from the host.

Consolidates:

- run_scenario resolves log_dir once via resolve_test_output_path so
  every component (ManagedDeployment, ManagedLoad, the
  FaultToleranceReport, the conftest test.log handler) writes to one
  absolute path.

- DYN_TEST_OUTPUT_PATH default for these k8s tests is set in the local
  conftest (not the shared resolver) to <cwd>/test_outputs — preserves
  the /tmp/dynamo_tests default for non-k8s tests on the host while
  giving us host-visible artifacts in a dev container.

- _get_pod_manifest, _get_pod_metrics, get_pod_manifest_logs_metrics
  all write to the LOWERCASED service dir so manifest/metrics paths
  overlay the PVC-extracted stdout paths (PVC tee uses lowercase).

- _extract_logs_from_pvc lands files directly into <log_dir> instead
  of <log_dir>/services — the PVC structure (<service-lower>/<pod>.log)
  becomes the per-service subdir, no extra middle layer.

- get_pod_manifest_logs_metrics no longer writes <pod>.log or
  <pod>.previous.log via kubectl. The container's stdout is already
  captured continuously to the PVC by the tee wrapper, so kubectl logs
  is just a duplicate. Dropping it also kills the empty .previous.log
  files that confused readers (kubectl previous-log only has content
  when the SAME pod's container restarted in place; for our DeletePod
  scenarios the replacement pod has no previous incarnation).

- .gitignore picks up test_outputs/ so artifacts don't pollute the git
  tree.

Result: each test produces one directory, with one subdir per service,
holding every per-pod artifact together:

    test_outputs/<test>/
    ├── decode/
    │   ├── <pod>_<ts>.log                       (PVC stdout, full lifetime)
    │   ├── <pod>.yaml                           (manifest at end of life)
    │   ├── <pod>.before_delete.yaml             (manifest pre-fault, if killed)
    │   └── <pod>_<ts>.metrics.before_delete.log (metrics pre-fault, if killed)
    ├── frontend/
    │   ├── <pod>_<ts>.log
    │   └── <pod>.yaml
    ├── load/                                     aiperf artifacts
    ├── test.log.txt                              framework log
    └── fault_tolerance_report.md                 report table

Verified by re-running test_worker_kill_mocker without any env var
set: artifacts land in the consolidated tree, report renders with
Startup populated and the killed pod's pre/post window bucketed.

Signed-off-by: nnshah1 <[email protected]>
…tion w/ conntrack flush, richer reports (DGH-703)

What changed:

- WaitForModelReady event: port-forwards to the frontend pod and polls
  /v1/models so vllm/sglang model-load (which can run tens of seconds
  past pod-Ready) doesn't bleed into the load window. Auto-detects
  the model name from the spec, parsing the bash-wrapped command for
  --model when ServiceSpec.model returns None (the log-collection
  wrapper hides args inside command[]).

- NetworkPartition: applies a NetworkPolicy targeting the per-service
  nvidia.com/dynamo-component label, then schedules a privileged
  hostNetwork pod on the source pod's node that runs `conntrack -D`
  for source<->target flows in both directions. NetworkPolicy alone
  is connection-tracked, so the pooled TCP request-plane socket
  trivially survives policy creation; the flush is what makes the
  partition observable as a fault. Mocker run goes from 0 errors →
  ~5k errors during the 20s window; vllm sees ~7k errors with the
  expected "Model not found" / "Connection refused" / "TCP timeout"
  mix in the error breakdown.

- LoadConfig.connection_reuse_strategy: maps to aiperf's
  --connection-reuse-strategy (pooled / never / sticky-user-sessions).
  Used by partition tests so each request opens a fresh aiperf<>
  frontend socket; not strictly required for the partition (the
  conntrack flush is the load-bearing step) but documents the intent
  and keeps aiperf-side connection state from masking other faults.

- LoadStopped / LoadCompleted checks: replace the ambiguous
  WasCancelled name. LoadStopped asserts the load was cut short by
  StopLoad (matches kill/partition scenarios); LoadCompleted asserts
  the load reached its configured request_count or duration without
  intervention (the sanity baseline).

- FaultToleranceReport rewrite: three narrow Markdown sections
  (Timing | Counts | Latency) with a Recovery (s) column derived
  from WaitForRecovery's elapsed bracket (or partition duration for
  transient partitions). JSON sidecar (fault_tolerance_report.json)
  with the same data so the vault aggregator can produce a multi-
  scenario combined comparison.

- ErrorBreakdownReport: per-load aiperf error-type breakdown so the
  dominant failure mode (404 Model not found vs. 500 Connection
  refused vs. TCP timeout) is visible at a glance.

- ManagedDeployment: spliced 19 helpers from the _2 branch
  (_create_log_collection_pvc, _wait_for_cr_deleted,
  _wait_for_pods_terminated, _scrub_namespace,
  _verify_namespace_scrubbed, get_log_pvc_name, apply_service_changes,
  _get_pod_manifest, _get_pod_metrics async, ...). PVC log collection
  is now always-on with per-pod tee wrappers; logs land in a single
  per-test directory tree alongside reports, manifests, and aiperf
  output. Cleanup is two-phase (drain log PVC before deleting CR,
  then scrub-verify post-condition) so a killed test leaves the
  namespace exactly as clean as it found it.

- README rewrite: leads with the event-based design (component
  architecture, data-flow, lifecycle, per-fault state diagrams for
  DeletePod / TerminateProcess / NetworkPartition+conntrack), the
  supported fault-event table, an example test, and an example
  combined comparison report including the aiperf server-metrics
  table (Inflight / Queued / Disconnected). Adds a "Current gaps"
  section enumerating what's not yet covered (partial-network
  faults, cross-namespace partitions, GPU faults, multi-replica,
  disagg topologies, SLA bounds) and a "Future direction" section
  pointing at Oviya's fault-injection-service work
  (oviya/fault-injection/dev and friends — API service, network
  injector DaemonSet, Chaos Mesh integration, GPU-fault tooling,
  Python client library) as the convergence path off the privileged
  conntrack-flush hack. Keeps the legacy parametrised scenario docs
  below as "Legacy scenario harness".

Signed-off-by: nnshah1 <[email protected]>
…t modes (DGH-703)

Eight per-mode tests (sanity / worker_kill / engine_kill /
network_partition × mocker / vllm) drive the event-based harness end
to end and emit the per-test FaultToleranceReport + ErrorBreakdownReport
that the vault aggregator combines into the cross-scenario comparison.
Useful both as a working smoke set for the framework and as the
fixtures the next round of work (Chaos Mesh, fault-injection-service)
will plug into.

Signed-off-by: nnshah1 <[email protected]>
…GH-703)

StallProcess pauses a named process via SIGSTOP, holds for an optional
duration, then resumes via SIGCONT. Models a "hung worker" — pod IP,
container, TCP sockets, and conntrack entries all stay alive, but the
worker stops servicing requests because it isn't scheduled. Exercises
frontend timeout / health-check behaviour rather than crash recovery
which is what TerminateProcess (SIGKILL) covers.

- StallProcess event with optional duration: transient stalls heal
  inside execute(); duration=None holds until scenario stop().
- test_engine_stall_mocker / test_engine_stall_vllm: 20 s mid-load
  stall on the decode worker; FaultToleranceReport + ErrorBreakdownReport
  show the queued/timed-out behaviour vs. clean recovery on SIGCONT.
- README: fault-events table now lists StallProcess; new per-fault
  state diagram for the stall path so the distinction vs.
  TerminateProcess and NetworkPartition is explicit.

Signed-off-by: nnshah1 <[email protected]>
…ty test (DGH-703)

Two changes that close the loop on capturing worker-side prometheus
metrics (vllm:* counters that the frontend's /metrics never sees,
including the NIXL counters tracked on the P instance in disagg):

- LoadConfig.extra_server_metrics_urls (Optional[list[str]]): URLs
  passed to aiperf's --server-metrics in addition to the inference
  endpoint base URL. StartLoad auto-populates this from each non-
  frontend worker's pod IP + system_port/metrics when the user does
  not explicitly set it. Pod IPs are valid for the duration of a
  pod's lifetime — tests that destroy and recreate a worker
  (DeletePod) will see the worker scrape stop on recovery; the
  framework's per-pod _get_pod_metrics snapshot at fault-injection
  time still captures that point.

- test_sanity_vllm_disagg.py: from_backend("vllm", "disagg") with
  Qwen3-0.6B, 1 prefill + 1 decode. The prefill loads the
  NixlConnector so vllm:nixl_num_kv_expired_reqs and
  vllm:nixl_num_failed_notifications register, and the StartLoad
  auto-scrape collects them. Documents the >= 2 GPU node
  requirement (single-GPU clusters can't schedule both workers).

Signed-off-by: nnshah1 <[email protected]>
…sagg sanity (DGH-703)

k8s mirror of examples/backends/vllm/launch/disagg_same_gpu.sh:
prefill + decode co-resident on one GPU via tiny
--gpu-memory-utilization (0.01), capped --kv-cache-memory-bytes
(~976 MiB), --enforce-eager, and --max-model-len 4096. NixlConnector
with kv_role=kv_both handles the prefill->decode KV transfer so
vllm:nixl_* counters register on the prefill (P) instance, where
they're scraped by the system_port pass-through that StartLoad's
auto-discovery hands to aiperf.

DYN_SYSTEM_PORT distinct per worker (8081 / 8082) so the two
prometheus servers don't collide. VLLM_NIXL_SIDE_CHANNEL_PORT bumped
on prefill (20097) to avoid the default-20096 conflict with decode.

test_sanity_vllm_disagg now loads templates/vllm/disagg_same_gpu.yaml
directly instead of from_backend("vllm", "disagg"), so it can run on
single-GPU clusters with nvidia-device-plugin time-slicing (still
needs both nvidia.com/gpu:1 slots, but they're satisfied by a
sliced GPU).

Signed-off-by: nnshah1 <[email protected]>
… port overrides on disagg yaml (DGH-703)

Two fixes that together let the disagg sanity test come up on a
single-GPU cluster with nvidia-device-plugin time-slicing
(replicas=2). Validated end-to-end: aiperf scrapes Frontend +
decode + prefill /metrics, the captured server_metrics_export.json
contains the full vllm:nixl_* counter set including
vllm:nixl_num_kv_expired_reqs.

- ServiceSpec.enable_log_collection: shlex.join the original
  command + args before pasting into the bash -c heredoc. The naive
  " ".join broke any arg with shell metacharacters — notably the
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
  JSON literal, where bash interpreted the braces as brace-expansion
  ("kv_connector:NixlConnector kv_role:kv_both") and vllm rejected
  the resulting non-JSON.

- disagg_same_gpu.yaml: removed the per-worker DYN_SYSTEM_PORT
  (8081/8082) and VLLM_NIXL_SIDE_CHANNEL_PORT (20097) overrides.
  Those were carried over from the standalone disagg_same_gpu.sh
  where both workers share a host network, but each k8s worker has
  its own pod IP — so the default ports (9090, 20096) don't collide.
  The overrides were causing the operator's startup probe to keep
  hitting 9090 while the worker exposed 8081/8082, looping into
  CrashLoopBackOff after the failureThreshold.

Signed-off-by: nnshah1 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant