feat(report): flag chunk-granular ITL when the server batches tokens by brayniac · Pull Request #150 · iopsystems/llm-perf

brayniac · 2026-06-13T14:26:47Z

Summary

Design-limitation #4, option (b): detect + label, rather than change the ITL math.

ITL is recorded per streamed chunk — one inter-token-latency sample is the gap between consecutive SSE content chunks. When a server emits one token per chunk (the common case for vLLM/TGI/llama-server at normal load), that equals true per-token ITL. When it batches multiple tokens into one chunk (under load, speculative decoding, chunked streaming), each ITL sample is one inter-chunk gap covering several tokens — biased high and under-sampled.

Rather than tokenize each chunk's delta inline (which would add work to the very timing path being measured, perturbing the numbers), this detects batching cheaply from data already collected: count streamed content chunks and compare to the server-reported content token total.

New streamed_content_chunks counter; mean tokens-per-chunk surfaced as latency.itl_tokens_per_chunk in the JSON.
Console prints a note when it exceeds 1.5: "server streamed ~N tokens per chunk, so ITL is chunk-granular… and overstates true per-token latency."
Detection only fires when the server reports usage; without it the ratio is ~1 and no note appears.

This keeps ITL correct-and-unlabeled for the common 1-token-per-chunk case and makes the one failure mode (batching) visible, without touching the measurement path.

Test plan

cargo test — 99 lib + 16 integration tests pass, 0 failures.
cargo clippy --all-targets — clean
cargo fmt --check — clean
Small, low-risk reporting addition (no change to the latency measurement math); the ratio uses the existing safe_div guard, and the chunk counter accumulates per non-warmup content stream in both the single-turn and conversation paths.

Generated with Claude Code

ITL is recorded per streamed chunk (one inter-token-latency sample = the gap between consecutive SSE content chunks). When a server emits one token per chunk (the common case) that equals true per-token ITL. When it batches multiple tokens into one chunk (under load, speculative decoding, chunked streaming), each ITL sample is one inter-chunk gap covering several tokens — biased high and under-sampled. Rather than tokenize each chunk delta inline (which would add work to the very timing path being measured), detect batching cheaply from data already collected: count streamed content chunks and compare to the server-reported content token total. The mean tokens-per-chunk is reported as `latency.itl_tokens_per_chunk`, and the console prints a note when it exceeds 1.5 so users know ITL is chunk-granular for that run. Detection only fires when the server reports usage; without it the ratio is ~1 and no note is shown. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

brayniac merged commit 49ccd99 into iopsystems:main Jun 13, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(report): flag chunk-granular ITL when the server batches tokens#150

feat(report): flag chunk-granular ITL when the server batches tokens#150
brayniac merged 1 commit into
iopsystems:mainfrom
brayniac:feat/itl-batching-note

brayniac commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brayniac commented Jun 13, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant