feat: single-pass native HS capture with prefix-cache fill and multi-step accumulator by rvz16 · Pull Request #453 · IINemo/lm-polygraph

rvz16 · 2026-04-14T06:19:35Z

No description provided.

Replace sequential per-sequence forward pass with hook-based capture during main generation. Add prefix-cache gap filling (within-group + cross-group LCP). Pad/trim to exact expected length with zero vectors.

…(different arr)

…t}_{seq}

…efix matching broke because: vLLM includes stop tokens in token_ids but strips from text, beam search processes/trims gen text before adding to trajectory, BPE re-tokenisation at boundaries produces different tokens. String prefix matching is immune to all three issues. Also resets vLLM APC in reset_hs_step_cache() to fix ~100 token zero-padding from cached system prompt prefix.

…fline BoN

…th this in adaptive-scalling

vLLM's best-of-n / beam-search request_ids look like "0_0-947ca420": {seq}_{prompt}-{hash}. The parent RequestOutput's request_id is just the prompt-level id ("0" in this example), so output_short_ids has {"0": 0}. The existing _resolve_prompt_idx tries: 1. exact match "0_0-947ca420" -> miss 2. suffix after last "_" "0-947ca420" -> miss (has the hash) 3. prefix before first "-" "0_0" -> miss (has the seq) All three miss, so _resolve_prompt_idx returns None for every captured sub-request, hs_by_req_out stays [None, ..., None], and downstream VLLMHiddenStates raises: ValueError: VLLMHiddenStates requires dependencies['vllm_hidden_states_output'] from VllmHiddenStatesGenerator (enable output_hidden_states in your wrapper). Add a 4th attempt: when the post-_ suffix contains a "-", peel the trailing -hash off and look up the bare prefix. Reproduced end-to-end with thinkbooster's StrategyBeamSearch + StepScorerUncertainty + luh.vllm.VLLMUncertaintyHeadFeatures + worker_extension_cls hook on vLLM 0.15.1, Qwen3-8B, beam_size=2 candidates_per_beam=2. After the fix the smoke produces validity_scores per step from UHead and the strategy selects beams correctly.

smirnovlad · 2026-05-02T11:23:59Z

Verification — safe to merge

I audited this PR end-to-end while reviewing the matching thinkbooster-side change (thinkbooster#249) and want to confirm this is good to land.

What I tested

Contract integration with thinkbooster's HookHiddenStatesExtension. Drove a 3-request batch through the hook (req=0 full prefill, req=1 identical prompt with APC reuse, req=2 shared chat-template prefix), then ran _get_captured_states + _get_capture_metadata through this PR's _fill_prefix_gaps. Phase 1 (within-group) and Phase 2 (cross-group LCP) both fire correctly and produce the expected sizes.
Real end-to-end smoke on Qwen2.5-1.5B-Instruct via vLLM 0.12.0 with enable_prefix_caching=True, 3 hooked layers [5, 14, 27]. After fixing a separate multiplier bug on the thinkbooster side (see below), _fill_prefix_gaps produces correct per-request hidden states (43/43/41 tokens vs. expected 44/44/42, the 1-token difference being the well-known last-token-no-forward-pass which lm-polygraph already zero-pads).
Method surface check — reset_hs_step_cache, _find_cached_hs, _update_hs_step_cache, _cleanup_hs_step_cache, _fill_prefix_gaps all present and importable.

One important note for downstream callers

The _fill_prefix_gaps helper reads prefill_tokens from the metadata produced by the worker's _get_capture_metadata. If the worker's bookkeeping is run inside a per-layer hook closure (so it counts once per (step, hooked_layer)), the prefill_tokens counter is multiplied by len(hs_layer_ids) and Phase 1 over-fills (donor's full sequence prepended) while Phase 2 silently skips because the recipient looks fully prefilled. This was a bug on the thinkbooster side, not here. Fixed in IINemo/thinkbooster#249 (commit 1d465df) by gating the bookkeeping on min(layer_ids). The thinkbooster PR also added 25 unit/integration tests, 5 of which run real _fill_prefix_gaps from this PR's branch.

Sequence with the other three open vLLM-adapter PRs (#454/#457/#458)

These all touch a different file (whitebox_model_vllm.py) so there's no merge conflict with #453. Suggested order:

feat: single-pass native HS capture with prefix-cache fill and multi-step accumulator #453 (this PR) — merge first, it's the one downstream depends on.
Fix off-by-one error in vLLM adapter vocab_size calculation #458 (vocab_size off-by-one) — pure correctness fix, lands transparently.
Fix vLLM compatibility in GreedyProbsCalculator and WhiteboxModelvLLM #454 and Disable thinking mode in vLLM instruct chat #457 — independent of the HS capture path; review/merge on their own merits.
Cut a release that contains the lot.

Then thinkbooster #249 bumps its lm-polygraph pin to that release and merges. Without bumping the pin, anyone on the old lm-polygraph 0.6.0 wheel hits a TypeError: a bytes-like object is required on first uhead/native-HS call, because 0.6.0's _raw_generate does pickle.loads(pickled_bytes) directly on what is now a nested {req_id: bytes} dict.

LGTM. Thanks for the work.

rvz16 and others added 10 commits April 10, 2026 12:44

feat: single-pass hidden states capture for use_native_hs_capture

dd3ccae

Replace sequential per-sequence forward pass with hook-based capture during main generation. Add prefix-cache gap filling (within-group + cross-group LCP). Pad/trim to exact expected length with zero vectors.

fix: handle vLLM request ID formats with _ separator (4_0, 5_0)

bddf109

fix: map req_id sequence suffix to correct output index for best-of-n…

0055c4f

…(different arr)

fix: correct req_id format for best-of-n — {seq}_{prompt}, not {promp…

a49dfef

…t}_{seq}

feat: multi-step HS accumulator for beam search and online BoN

5e0afe6

fix: text-based step cache key + APC reset for multi-step HS capture

933975a

fix: include outtext in step cache key to avoid cross-pollution in of…

a3cdf0d

…fline BoN

fix: deferred step cache writes with prompt-only keys. Got problem wi…

33a7a57

…th this in adaptive-scalling

smirnovlad mentioned this pull request May 2, 2026

feat: single-pass hidden states capture for use_native_hs_capture #452

Closed

smirnovlad self-requested a review May 2, 2026 11:22

smirnovlad approved these changes May 2, 2026

View reviewed changes

smirnovlad changed the title ~~Fixing native hs multistep for online BoN and beam search~~ feat: single-pass native HS capture with prefix-cache fill and multi-step accumulator May 2, 2026

ArtemVazh approved these changes May 4, 2026

View reviewed changes

ArtemVazh merged commit 6d3a574 into IINemo:main May 4, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: single-pass native HS capture with prefix-cache fill and multi-step accumulator#453

feat: single-pass native HS capture with prefix-cache fill and multi-step accumulator#453
ArtemVazh merged 10 commits into
IINemo:mainfrom
rvz16:feat/native-hs-multistep

rvz16 commented Apr 14, 2026

Uh oh!

smirnovlad commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rvz16 commented Apr 14, 2026

Uh oh!

smirnovlad commented May 2, 2026

Verification — safe to merge

What I tested

One important note for downstream callers

Sequence with the other three open vLLM-adapter PRs (#454/#457/#458)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants