feat(tokenizer): real per-model tokenizer for prompt sizing#149
Merged
Conversation
Beyond the honesty pass in iopsystems#142 (tiktoken labeled an estimate, server usage used for reporting), synthetic prompt *generation* still sized prompts with tiktoken, so a non-OpenAI server received ~30% off the requested token count. This adds a real tokenizer, resolved once at startup via a cascade: 1. endpoint.tokenizer = a local tokenizer.json path (if the file exists) OR a HuggingFace model id (downloaded/cached via the existing hf-hub dep) -> exact in-process tokenization via the `tokenizers` crate. 2. else, if the server exposes /tokenize -> calibrate tiktoken: tokenize a few fixed samples via /tokenize, compute ratio = real/tiktoken, and size prompts as round(tiktoken * ratio) -- a confident estimate with no per-prompt network. 3. else -> the prior rough tiktoken estimate. The Tokenizer type is now an abstraction over these backends; the synthetic generator and truncation call count_tokens/truncate_to_tokens polymorphically. build_tokenizer never silently degrades to the worst option: a failed configured tokenizer falls through to calibration, and only a genuine tiktoken construction error propagates. The report names the tokenizer actually used (exact / calibrated / tiktoken estimate); the JSON `tokenizer` field now carries that label instead of just the tiktoken vocab name. Adds the `tokenizers` crate (default-features off; `onig` for the byte-level BPE regex). A loaded tokenizer.json is smoke-tested (encode) at load so a broken file fails fast and the cascade falls back rather than mis-sizing every prompt. New unit tests: calibration_ratio, calibrated scaling/labeling. The async cascade, HF download, and /tokenize calibration are build + reasoning-verified (exercised end-to-end only against a live server). The logprobs subcommand keeps the plain tiktoken tokenizer (separate path, uses dataset prompts). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Design-limitation #3: the "proper" fix beyond #142's honesty pass. #142 labeled tiktoken an estimate and used server
usagefor reporting, but synthetic prompt generation still sized prompts with tiktoken — so a non-OpenAI server received ~30% off the requested token count. This adds a real tokenizer, resolved once at startup via a cascade:endpoint.tokenizer= a localtokenizer.jsonpath (if the file exists) or a HuggingFace model id (downloaded/cached via the existinghf-hubdep) → exact in-process tokenization via thetokenizerscrate./tokenize→ calibrate tiktoken: tokenize a few fixed samples via/tokenize, computeratio = real/tiktoken, and size prompts asround(tiktoken × ratio)— a confident estimate with no per-prompt network calls (iterative generation can't afford/tokenizeround-trips per prompt).The
Tokenizertype is now an abstraction over these backends; the synthetic generator and truncation callcount_tokens/truncate_to_tokenspolymorphically.build_tokenizernever silently degrades to the worst option: a failed configured tokenizer falls through to calibration, and only a genuine tiktoken construction error propagates (returnsResult). The report names the tokenizer actually used.Notes for reviewers
tokenizers(default-features off;onigfor the byte-level BPE regex needed by Llama/Qwen tokenizers). Necessary for exact in-process tokenization.tokenizer.jsonis smoke-tested (encode) at load, so a broken file fails fast and the cascade falls back rather than returning 0 tokens and mis-sizing every prompt.tokenizerfield value changed: it now carries a label likeexact (…)/calibrated estimate (server /tokenize, ratio 1.32)/tiktoken cl100k_base (estimate)instead of justcl100k_base. Worth knowing for any consumer that pattern-matched the old value.logprobssubcommand keeps the plain tiktoken tokenizer (separate path; it uses dataset prompts, not synthetic generation).Test plan
cargo test— 99 lib + 16 integration tests pass, 0 failures. New:calibration_ratio_is_real_over_tiktoken,calibrated_scales_tiktoken_count.cargo clippy --all-targets— cleancargo fmt --check— clean.expect()on the final fallback could panic instead of falling back (→build_tokenizerreturnsResult, propagates); a failed configured tokenizer skipped calibration (→ restructured cascade);/tokenizeURL was wrong for base URLs ending in/v1/(→ fixed stripping); and an Exactencodeerror would silently mis-size prompts (→ smoke-test at load). The async cascade / HF download //tokenizecalibration are build + reasoning-verified (exercised end-to-end only against a live server).Generated with Claude Code