Skip to content

feat: single-pass native HS capture with prefix-cache fill and multi-step accumulator#453

Merged
ArtemVazh merged 10 commits into
IINemo:mainfrom
rvz16:feat/native-hs-multistep
May 4, 2026
Merged

feat: single-pass native HS capture with prefix-cache fill and multi-step accumulator#453
ArtemVazh merged 10 commits into
IINemo:mainfrom
rvz16:feat/native-hs-multistep

Conversation

@rvz16

@rvz16 rvz16 commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

No description provided.

rvz16 and others added 10 commits April 10, 2026 12:44
  Replace sequential per-sequence forward pass with hook-based capture
  during main generation. Add prefix-cache gap filling (within-group +
  cross-group LCP). Pad/trim to exact expected length with zero vectors.
…efix matching broke because: vLLM includes stop tokens in token_ids but strips from text, beam search processes/trims gen text before adding to trajectory, BPE re-tokenisation at boundaries produces different tokens. String prefix matching is immune to all three issues. Also resets vLLM APC in reset_hs_step_cache() to fix ~100 token zero-padding from cached system prompt prefix.
vLLM's best-of-n / beam-search request_ids look like "0_0-947ca420":
{seq}_{prompt}-{hash}. The parent RequestOutput's request_id is just
the prompt-level id ("0" in this example), so output_short_ids has
{"0": 0}.

The existing _resolve_prompt_idx tries:
  1. exact match              "0_0-947ca420" -> miss
  2. suffix after last "_"    "0-947ca420"   -> miss (has the hash)
  3. prefix before first "-"  "0_0"          -> miss (has the seq)

All three miss, so _resolve_prompt_idx returns None for every captured
sub-request, hs_by_req_out stays [None, ..., None], and downstream
VLLMHiddenStates raises:

    ValueError: VLLMHiddenStates requires
    dependencies['vllm_hidden_states_output'] from
    VllmHiddenStatesGenerator (enable output_hidden_states in your
    wrapper).

Add a 4th attempt: when the post-_ suffix contains a "-", peel the
trailing -hash off and look up the bare prefix. Reproduced end-to-end
with thinkbooster's StrategyBeamSearch + StepScorerUncertainty +
luh.vllm.VLLMUncertaintyHeadFeatures + worker_extension_cls hook on
vLLM 0.15.1, Qwen3-8B, beam_size=2 candidates_per_beam=2. After the
fix the smoke produces validity_scores per step from UHead and the
strategy selects beams correctly.
@smirnovlad

Copy link
Copy Markdown
Collaborator

Verification — safe to merge

I audited this PR end-to-end while reviewing the matching thinkbooster-side change (thinkbooster#249) and want to confirm this is good to land.

What I tested

  1. Contract integration with thinkbooster's HookHiddenStatesExtension. Drove a 3-request batch through the hook (req=0 full prefill, req=1 identical prompt with APC reuse, req=2 shared chat-template prefix), then ran _get_captured_states + _get_capture_metadata through this PR's _fill_prefix_gaps. Phase 1 (within-group) and Phase 2 (cross-group LCP) both fire correctly and produce the expected sizes.

  2. Real end-to-end smoke on Qwen2.5-1.5B-Instruct via vLLM 0.12.0 with enable_prefix_caching=True, 3 hooked layers [5, 14, 27]. After fixing a separate multiplier bug on the thinkbooster side (see below), _fill_prefix_gaps produces correct per-request hidden states (43/43/41 tokens vs. expected 44/44/42, the 1-token difference being the well-known last-token-no-forward-pass which lm-polygraph already zero-pads).

  3. Method surface checkreset_hs_step_cache, _find_cached_hs, _update_hs_step_cache, _cleanup_hs_step_cache, _fill_prefix_gaps all present and importable.

One important note for downstream callers

The _fill_prefix_gaps helper reads prefill_tokens from the metadata produced by the worker's _get_capture_metadata. If the worker's bookkeeping is run inside a per-layer hook closure (so it counts once per (step, hooked_layer)), the prefill_tokens counter is multiplied by len(hs_layer_ids) and Phase 1 over-fills (donor's full sequence prepended) while Phase 2 silently skips because the recipient looks fully prefilled. This was a bug on the thinkbooster side, not here. Fixed in IINemo/thinkbooster#249 (commit 1d465df) by gating the bookkeeping on min(layer_ids). The thinkbooster PR also added 25 unit/integration tests, 5 of which run real _fill_prefix_gaps from this PR's branch.

Sequence with the other three open vLLM-adapter PRs (#454/#457/#458)

These all touch a different file (whitebox_model_vllm.py) so there's no merge conflict with #453. Suggested order:

  1. feat: single-pass native HS capture with prefix-cache fill and multi-step accumulator #453 (this PR) — merge first, it's the one downstream depends on.
  2. Fix off-by-one error in vLLM adapter vocab_size calculation #458 (vocab_size off-by-one) — pure correctness fix, lands transparently.
  3. Fix vLLM compatibility in GreedyProbsCalculator and WhiteboxModelvLLM #454 and Disable thinking mode in vLLM instruct chat #457 — independent of the HS capture path; review/merge on their own merits.
  4. Cut a release that contains the lot.

Then thinkbooster #249 bumps its lm-polygraph pin to that release and merges. Without bumping the pin, anyone on the old lm-polygraph 0.6.0 wheel hits a TypeError: a bytes-like object is required on first uhead/native-HS call, because 0.6.0's _raw_generate does pickle.loads(pickled_bytes) directly on what is now a nested {req_id: bytes} dict.

LGTM. Thanks for the work.

@smirnovlad smirnovlad changed the title Fixing native hs multistep for online BoN and beam search feat: single-pass native HS capture with prefix-cache fill and multi-step accumulator May 2, 2026
@ArtemVazh ArtemVazh merged commit 6d3a574 into IINemo:main May 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants