feat(tokenizer): real per-model tokenizer for prompt sizing by brayniac · Pull Request #149 · iopsystems/llm-perf

brayniac · 2026-06-13T05:47:19Z

Summary

Design-limitation #3: the "proper" fix beyond #142's honesty pass. #142 labeled tiktoken an estimate and used server usage for reporting, but synthetic prompt generation still sized prompts with tiktoken — so a non-OpenAI server received ~30% off the requested token count. This adds a real tokenizer, resolved once at startup via a cascade:

endpoint.tokenizer = a local tokenizer.json path (if the file exists) or a HuggingFace model id (downloaded/cached via the existing hf-hub dep) → exact in-process tokenization via the tokenizers crate.
else, if the server exposes /tokenize → calibrate tiktoken: tokenize a few fixed samples via /tokenize, compute ratio = real/tiktoken, and size prompts as round(tiktoken × ratio) — a confident estimate with no per-prompt network calls (iterative generation can't afford /tokenize round-trips per prompt).
else → the prior rough tiktoken estimate.

The Tokenizer type is now an abstraction over these backends; the synthetic generator and truncation call count_tokens/truncate_to_tokens polymorphically. build_tokenizer never silently degrades to the worst option: a failed configured tokenizer falls through to calibration, and only a genuine tiktoken construction error propagates (returns Result). The report names the tokenizer actually used.

Notes for reviewers

New dependency: tokenizers (default-features off; onig for the byte-level BPE regex needed by Llama/Qwen tokenizers). Necessary for exact in-process tokenization.
A loaded tokenizer.json is smoke-tested (encode) at load, so a broken file fails fast and the cascade falls back rather than returning 0 tokens and mis-sizing every prompt.
JSON tokenizer field value changed: it now carries a label like exact (…) / calibrated estimate (server /tokenize, ratio 1.32) / tiktoken cl100k_base (estimate) instead of just cl100k_base. Worth knowing for any consumer that pattern-matched the old value.
The logprobs subcommand keeps the plain tiktoken tokenizer (separate path; it uses dataset prompts, not synthetic generation).

Test plan

cargo test — 99 lib + 16 integration tests pass, 0 failures. New: calibration_ratio_is_real_over_tiktoken, calibrated_scales_tiktoken_count.
cargo clippy --all-targets — clean
cargo fmt --check — clean
Independent review. It found four real issues, all fixed: .expect() on the final fallback could panic instead of falling back (→ build_tokenizer returns Result, propagates); a failed configured tokenizer skipped calibration (→ restructured cascade); /tokenize URL was wrong for base URLs ending in /v1/ (→ fixed stripping); and an Exact encode error would silently mis-size prompts (→ smoke-test at load). The async cascade / HF download / /tokenize calibration are build + reasoning-verified (exercised end-to-end only against a live server).

Generated with Claude Code

Beyond the honesty pass in iopsystems#142 (tiktoken labeled an estimate, server usage used for reporting), synthetic prompt *generation* still sized prompts with tiktoken, so a non-OpenAI server received ~30% off the requested token count. This adds a real tokenizer, resolved once at startup via a cascade: 1. endpoint.tokenizer = a local tokenizer.json path (if the file exists) OR a HuggingFace model id (downloaded/cached via the existing hf-hub dep) -> exact in-process tokenization via the `tokenizers` crate. 2. else, if the server exposes /tokenize -> calibrate tiktoken: tokenize a few fixed samples via /tokenize, compute ratio = real/tiktoken, and size prompts as round(tiktoken * ratio) -- a confident estimate with no per-prompt network. 3. else -> the prior rough tiktoken estimate. The Tokenizer type is now an abstraction over these backends; the synthetic generator and truncation call count_tokens/truncate_to_tokens polymorphically. build_tokenizer never silently degrades to the worst option: a failed configured tokenizer falls through to calibration, and only a genuine tiktoken construction error propagates. The report names the tokenizer actually used (exact / calibrated / tiktoken estimate); the JSON `tokenizer` field now carries that label instead of just the tiktoken vocab name. Adds the `tokenizers` crate (default-features off; `onig` for the byte-level BPE regex). A loaded tokenizer.json is smoke-tested (encode) at load so a broken file fails fast and the cascade falls back rather than mis-sizing every prompt. New unit tests: calibration_ratio, calibrated scaling/labeling. The async cascade, HF download, and /tokenize calibration are build + reasoning-verified (exercised end-to-end only against a live server). The logprobs subcommand keeps the plain tiktoken tokenizer (separate path, uses dataset prompts). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

brayniac merged commit 40d14f0 into iopsystems:main Jun 13, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tokenizer): real per-model tokenizer for prompt sizing#149

feat(tokenizer): real per-model tokenizer for prompt sizing#149
brayniac merged 1 commit into
iopsystems:mainfrom
brayniac:feat/real-tokenizer

brayniac commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brayniac commented Jun 13, 2026

Summary

Notes for reviewers

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant