feat(saturation): bisect + drain search, plateau detection, transition flags#145
Merged
Merged
Conversation
…ition flags Rewrite the concurrency saturation search (iopsystems#17). The old search climbed multiplicatively only, judged each rung on a single window, terminated after N consecutive failures, and gated throughput against a linear extrapolation from a moving baseline — so it undershot the true ceiling by up to (step_multiplier-1), was noise-sensitive, and mismodeled how servers plateau. The new search separates a pure, unit-tested `SearchPlanner` state machine from a thin async driver: - Climb multiplicatively until an SLO breaks, then drop to the last-good concurrency and DRAIN to a clean slate (no measurement) before re-probing the failed rung — distinguishing a transient/metastable failure from a genuine one. - On a genuine failure, binary-search the exact ceiling, measuring each rung from a drained state, then confirm the boundary over several windows (M-of-N) so a single noisy window can't decide it. - Throughput is a marginal-gain gate: a rung trips it when tokens/s falls below min_throughput_ratio of the throughput projected from the previous rung (a fixed pre-plateau baseline during bisection) — detecting the plateau directly instead of extrapolating linearly. - Emits transition flags: the knee (saturation onset) and any transient recoveries, surfaced in the console summary and JSON results. Each step is labeled with its phase (climb/bisect/confirm) and whether it was drained. The driver grows via add_permits and shrinks via forget_permits, and on a drain waits for the in-flight gauge to fall to the target (bounded) plus a short settle before measuring. `min_throughput_ratio` keeps its name with the new marginal semantics; `stop_after_failures` is now unused (accepted for backward compatibility). New unit tests drive the planner against modeled servers: exact-knee bisection, no-compliant-at-start, transient recovery after drain, throughput-plateau detection, and confirm-window step-down. An independent review flagged that the back-off drop was being measured as a full window (spurious step + wasted time); fixed by making it a drain-only action. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Rewrites the concurrency saturation search (#17 — the last of the review groups). The old search climbed multiplicatively only, judged each rung on a single window, stopped after N consecutive failures, and gated throughput against a linear extrapolation from a moving baseline — so it undershot the true ceiling by up to
(step_multiplier − 1), was noise-sensitive, and mismodeled plateaus.New design — a pure, unit-tested
SearchPlannerstate machine separated from a thin async driver:min_throughput_ratioof the throughput projected from the previous rung (a fixed pre-plateau baseline during bisection) — detecting the plateau directly.add_permits, shrinks viaforget_permits, and on a drain waits for the in-flight gauge to fall to target (bounded by the sample window) plus a short settle before measuring.Config
min_throughput_ratiokeeps its name with the new marginal semantics (documented).stop_after_failuresis now unused (accepted for backward compatibility).Test plan
cargo test— 94 lib + 16 integration tests pass, 0 failures. Newplanner_testsdrive the planner against modeled servers: exact-knee bisection (converges to 50 where a multiplicative climb would report 40), no-compliant-at-start, transient recovery after drain, throughput-plateau detection (111), and confirm-window step-down (49).cargo clippy --all-targets— cleancargo fmt --check— cleanAction::Drain(no measurement). Declined its suggestion to drop the throughput gate during re-probe, since that would break plateau detection (a pure-latency transient has scaling throughput and passes; only a real plateau re-fails).forget_permitsshrink is build + reasoning-verified; exercising the full climb→bisect→confirm against a live server is the recommended smoke test (lowmax_concurrency, watch the printed table).Generated with Claude Code