Skip to content

feat(tokenizer): real per-model tokenizer for prompt sizing#149

Merged
brayniac merged 1 commit into
iopsystems:mainfrom
brayniac:feat/real-tokenizer
Jun 13, 2026
Merged

feat(tokenizer): real per-model tokenizer for prompt sizing#149
brayniac merged 1 commit into
iopsystems:mainfrom
brayniac:feat/real-tokenizer

Conversation

@brayniac

Copy link
Copy Markdown
Contributor

Summary

Design-limitation #3: the "proper" fix beyond #142's honesty pass. #142 labeled tiktoken an estimate and used server usage for reporting, but synthetic prompt generation still sized prompts with tiktoken — so a non-OpenAI server received ~30% off the requested token count. This adds a real tokenizer, resolved once at startup via a cascade:

  1. endpoint.tokenizer = a local tokenizer.json path (if the file exists) or a HuggingFace model id (downloaded/cached via the existing hf-hub dep) → exact in-process tokenization via the tokenizers crate.
  2. else, if the server exposes /tokenizecalibrate tiktoken: tokenize a few fixed samples via /tokenize, compute ratio = real/tiktoken, and size prompts as round(tiktoken × ratio) — a confident estimate with no per-prompt network calls (iterative generation can't afford /tokenize round-trips per prompt).
  3. else → the prior rough tiktoken estimate.

The Tokenizer type is now an abstraction over these backends; the synthetic generator and truncation call count_tokens/truncate_to_tokens polymorphically. build_tokenizer never silently degrades to the worst option: a failed configured tokenizer falls through to calibration, and only a genuine tiktoken construction error propagates (returns Result). The report names the tokenizer actually used.

Notes for reviewers

  • New dependency: tokenizers (default-features off; onig for the byte-level BPE regex needed by Llama/Qwen tokenizers). Necessary for exact in-process tokenization.
  • A loaded tokenizer.json is smoke-tested (encode) at load, so a broken file fails fast and the cascade falls back rather than returning 0 tokens and mis-sizing every prompt.
  • JSON tokenizer field value changed: it now carries a label like exact (…) / calibrated estimate (server /tokenize, ratio 1.32) / tiktoken cl100k_base (estimate) instead of just cl100k_base. Worth knowing for any consumer that pattern-matched the old value.
  • The logprobs subcommand keeps the plain tiktoken tokenizer (separate path; it uses dataset prompts, not synthetic generation).

Test plan

  • cargo test — 99 lib + 16 integration tests pass, 0 failures. New: calibration_ratio_is_real_over_tiktoken, calibrated_scales_tiktoken_count.
  • cargo clippy --all-targets — clean
  • cargo fmt --check — clean
  • Independent review. It found four real issues, all fixed: .expect() on the final fallback could panic instead of falling back (→ build_tokenizer returns Result, propagates); a failed configured tokenizer skipped calibration (→ restructured cascade); /tokenize URL was wrong for base URLs ending in /v1/ (→ fixed stripping); and an Exact encode error would silently mis-size prompts (→ smoke-test at load). The async cascade / HF download / /tokenize calibration are build + reasoning-verified (exercised end-to-end only against a live server).

Generated with Claude Code

Beyond the honesty pass in iopsystems#142 (tiktoken labeled an estimate, server usage used
for reporting), synthetic prompt *generation* still sized prompts with tiktoken,
so a non-OpenAI server received ~30% off the requested token count. This adds a
real tokenizer, resolved once at startup via a cascade:

1. endpoint.tokenizer = a local tokenizer.json path (if the file exists) OR a
   HuggingFace model id (downloaded/cached via the existing hf-hub dep) -> exact
   in-process tokenization via the `tokenizers` crate.
2. else, if the server exposes /tokenize -> calibrate tiktoken: tokenize a few
   fixed samples via /tokenize, compute ratio = real/tiktoken, and size prompts
   as round(tiktoken * ratio) -- a confident estimate with no per-prompt network.
3. else -> the prior rough tiktoken estimate.

The Tokenizer type is now an abstraction over these backends; the synthetic
generator and truncation call count_tokens/truncate_to_tokens polymorphically.
build_tokenizer never silently degrades to the worst option: a failed configured
tokenizer falls through to calibration, and only a genuine tiktoken construction
error propagates. The report names the tokenizer actually used (exact /
calibrated / tiktoken estimate); the JSON `tokenizer` field now carries that
label instead of just the tiktoken vocab name.

Adds the `tokenizers` crate (default-features off; `onig` for the byte-level BPE
regex). A loaded tokenizer.json is smoke-tested (encode) at load so a broken file
fails fast and the cascade falls back rather than mis-sizing every prompt.

New unit tests: calibration_ratio, calibrated scaling/labeling. The async
cascade, HF download, and /tokenize calibration are build + reasoning-verified
(exercised end-to-end only against a live server). The logprobs subcommand keeps
the plain tiktoken tokenizer (separate path, uses dataset prompts).

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@brayniac brayniac merged commit 40d14f0 into iopsystems:main Jun 13, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant