feat: single-pass native HS capture with prefix-cache fill and multi-step accumulator#453
Conversation
Replace sequential per-sequence forward pass with hook-based capture during main generation. Add prefix-cache gap filling (within-group + cross-group LCP). Pad/trim to exact expected length with zero vectors.
…efix matching broke because: vLLM includes stop tokens in token_ids but strips from text, beam search processes/trims gen text before adding to trajectory, BPE re-tokenisation at boundaries produces different tokens. String prefix matching is immune to all three issues. Also resets vLLM APC in reset_hs_step_cache() to fix ~100 token zero-padding from cached system prompt prefix.
…th this in adaptive-scalling
vLLM's best-of-n / beam-search request_ids look like "0_0-947ca420":
{seq}_{prompt}-{hash}. The parent RequestOutput's request_id is just
the prompt-level id ("0" in this example), so output_short_ids has
{"0": 0}.
The existing _resolve_prompt_idx tries:
1. exact match "0_0-947ca420" -> miss
2. suffix after last "_" "0-947ca420" -> miss (has the hash)
3. prefix before first "-" "0_0" -> miss (has the seq)
All three miss, so _resolve_prompt_idx returns None for every captured
sub-request, hs_by_req_out stays [None, ..., None], and downstream
VLLMHiddenStates raises:
ValueError: VLLMHiddenStates requires
dependencies['vllm_hidden_states_output'] from
VllmHiddenStatesGenerator (enable output_hidden_states in your
wrapper).
Add a 4th attempt: when the post-_ suffix contains a "-", peel the
trailing -hash off and look up the bare prefix. Reproduced end-to-end
with thinkbooster's StrategyBeamSearch + StepScorerUncertainty +
luh.vllm.VLLMUncertaintyHeadFeatures + worker_extension_cls hook on
vLLM 0.15.1, Qwen3-8B, beam_size=2 candidates_per_beam=2. After the
fix the smoke produces validity_scores per step from UHead and the
strategy selects beams correctly.
Verification — safe to mergeI audited this PR end-to-end while reviewing the matching thinkbooster-side change (thinkbooster#249) and want to confirm this is good to land. What I tested
One important note for downstream callersThe Sequence with the other three open vLLM-adapter PRs (#454/#457/#458)These all touch a different file (
Then thinkbooster #249 bumps its LGTM. Thanks for the work. |
No description provided.