Skip to content

feat(saturation): bisect + drain search, plateau detection, transition flags#145

Merged
brayniac merged 1 commit into
iopsystems:mainfrom
brayniac:fix/saturation-probe-retreat
Jun 12, 2026
Merged

feat(saturation): bisect + drain search, plateau detection, transition flags#145
brayniac merged 1 commit into
iopsystems:mainfrom
brayniac:fix/saturation-probe-retreat

Conversation

@brayniac

Copy link
Copy Markdown
Contributor

Summary

Rewrites the concurrency saturation search (#17 — the last of the review groups). The old search climbed multiplicatively only, judged each rung on a single window, stopped after N consecutive failures, and gated throughput against a linear extrapolation from a moving baseline — so it undershot the true ceiling by up to (step_multiplier − 1), was noise-sensitive, and mismodeled plateaus.

New design — a pure, unit-tested SearchPlanner state machine separated from a thin async driver:

  • Climb → drain → re-probe → bisect → confirm. Climb multiplicatively until an SLO breaks; drop to the last-good concurrency and drain to a clean slate (no measurement) before re-probing the failed rung, distinguishing a transient/metastable failure from a genuine one. On a genuine failure, binary-search the exact ceiling, measuring each rung from a drained state, then confirm the boundary over several windows (M-of-N) so a single noisy window can't decide it.
  • Marginal-gain throughput gate (replaces linear extrapolation): a rung trips it when tokens/s falls below min_throughput_ratio of the throughput projected from the previous rung (a fixed pre-plateau baseline during bisection) — detecting the plateau directly.
  • Transition flags: the knee (saturation onset) and any transient recoveries, in the console summary and JSON. Each step is labeled with its phase (climb/bisect/confirm) and whether it was drained.
  • Driver: grows via add_permits, shrinks via forget_permits, and on a drain waits for the in-flight gauge to fall to target (bounded by the sample window) plus a short settle before measuring.

Config

  • min_throughput_ratio keeps its name with the new marginal semantics (documented).
  • stop_after_failures is now unused (accepted for backward compatibility).

Test plan

  • cargo test — 94 lib + 16 integration tests pass, 0 failures. New planner_tests drive the planner against modeled servers: exact-knee bisection (converges to 50 where a multiplicative climb would report 40), no-compliant-at-start, transient recovery after drain, throughput-plateau detection (111), and confirm-window step-down (49).
  • cargo clippy --all-targets — clean
  • cargo fmt --check — clean
  • Independent review of the planner termination/convergence + drain interaction. It caught a real issue — the back-off drop was being measured as a full window (spurious results row + wasted ~sample_window per transient); fixed by making the drop a drain-only Action::Drain (no measurement). Declined its suggestion to drop the throughput gate during re-probe, since that would break plateau detection (a pure-latency transient has scaling throughput and passes; only a real plateau re-fails).
  • The async driver drain/forget_permits shrink is build + reasoning-verified; exercising the full climb→bisect→confirm against a live server is the recommended smoke test (low max_concurrency, watch the printed table).

Generated with Claude Code

…ition flags

Rewrite the concurrency saturation search (iopsystems#17). The old search climbed
multiplicatively only, judged each rung on a single window, terminated after N
consecutive failures, and gated throughput against a linear extrapolation from a
moving baseline — so it undershot the true ceiling by up to (step_multiplier-1),
was noise-sensitive, and mismodeled how servers plateau.

The new search separates a pure, unit-tested `SearchPlanner` state machine from
a thin async driver:

- Climb multiplicatively until an SLO breaks, then drop to the last-good
  concurrency and DRAIN to a clean slate (no measurement) before re-probing the
  failed rung — distinguishing a transient/metastable failure from a genuine one.
- On a genuine failure, binary-search the exact ceiling, measuring each rung from
  a drained state, then confirm the boundary over several windows (M-of-N) so a
  single noisy window can't decide it.
- Throughput is a marginal-gain gate: a rung trips it when tokens/s falls below
  min_throughput_ratio of the throughput projected from the previous rung (a
  fixed pre-plateau baseline during bisection) — detecting the plateau directly
  instead of extrapolating linearly.
- Emits transition flags: the knee (saturation onset) and any transient
  recoveries, surfaced in the console summary and JSON results. Each step is
  labeled with its phase (climb/bisect/confirm) and whether it was drained.

The driver grows via add_permits and shrinks via forget_permits, and on a drain
waits for the in-flight gauge to fall to the target (bounded) plus a short
settle before measuring.

`min_throughput_ratio` keeps its name with the new marginal semantics;
`stop_after_failures` is now unused (accepted for backward compatibility).

New unit tests drive the planner against modeled servers: exact-knee bisection,
no-compliant-at-start, transient recovery after drain, throughput-plateau
detection, and confirm-window step-down. An independent review flagged that the
back-off drop was being measured as a full window (spurious step + wasted time);
fixed by making it a drain-only action.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@brayniac brayniac merged commit 4afab3e into iopsystems:main Jun 12, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant