feat(parsers): cross-impl parser-parity harness (Method 2)#9186
feat(parsers): cross-impl parser-parity harness (Method 2)#9186keivenchang wants to merge 6 commits intomainfrom
Conversation
WalkthroughThis PR introduces Python bindings for tool call parsing in the Rust layer, exposes them via type stubs, and establishes a comprehensive parity testing framework across three parser implementations (Dynamo, vLLM, SGLang). It includes a fixture generator script, test fixtures for seven parser families, and a parametrized test harness that validates parsing consistency. ChangesTool Call Parsing Bindings & Parity Testing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (8)
tests/parser_parity/fixtures/_generate.py (1)
12-14: ⚡ Quick winDescribe these fixtures as canonicalized output, not verbatim binding output.
_run_one()JSON-decodesfunction.argumentsand normalizes missingnormal_textto"", so the generator is not preserving the rawparse_tool_call()payload. Rewording this avoids people treating the fixtures as a wire-contract snapshot.Proposed wording
-Re-running overwrites existing fixtures. Edit INPUTS to add or change -templates; the generator captures Dynamo's output verbatim, so -fixtures always reflect the current Rust contract. +Re-running overwrites existing fixtures. Edit INPUTS to add or change +templates; the generator canonicalizes Dynamo's parsed output into the +shared parity format used by the test harness.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/parser_parity/fixtures/_generate.py` around lines 12 - 14, Update the top comment that describes generated fixtures to say these are canonicalized/normalized outputs rather than verbatim binding/wire snapshots; specifically mention that _run_one() JSON-decodes function.arguments and normalizes missing normal_text to "" (so the generator captures Dynamo’s processed output, not the raw parse_tool_call payload), and instruct contributors to edit INPUTS to change templates rather than expecting byte-for-byte binding output.tests/parser_parity/impls/parser/vllm.py (1)
76-94: ⚡ Quick winParser lookup and construction are outside the try/except.
ToolParserManager.get_tool_parser(key)(L80) andparser_cls(tokenizer=_StubTokenizer())(L85) can both raise — the comment on L82–84 even acknowledges that vLLM'sToolParser.__init__actively raises on a falsy tokenizer, and a future vLLM release that tightens the vocab/special-token checks could throw here too. Today those failures escape as raw exceptions instead of being captured intoParseResult.error, so the harness's "skip on impl error" path (test_parityL190–191) is bypassed.Move them into the try, and please annotate the broad catch on L93 with
# noqa: BLE001plus a brief justification.♻️ Proposed scoping
- parser_cls = ToolParserManager.get_tool_parser(key) - - # vLLM's ToolParser constructor checks `if not self.model_tokenizer:` and raises - # if falsy. None of the parsers we test actually call tokenizer methods inside - # extract_tool_calls(), so a truthy stub satisfies the check. - parser: ToolParser = parser_cls(tokenizer=_StubTokenizer()) - - # vLLM's extract_tool_calls signature: (model_output, request) → ExtractedToolCallInformation. - # We construct a minimal request shape with only the tools field, since most parsers ignore the rest. - request = _build_chat_request(tools) - try: + parser_cls = ToolParserManager.get_tool_parser(key) + # vLLM's ToolParser constructor checks `if not self.model_tokenizer:` and raises + # if falsy. None of the parsers we test actually call tokenizer methods inside + # extract_tool_calls(), so a truthy stub satisfies the check. + parser: ToolParser = parser_cls(tokenizer=_StubTokenizer()) + # vLLM's extract_tool_calls signature: (model_output, request) → ExtractedToolCallInformation. + # We construct a minimal request shape with only the tools field, since most parsers ignore the rest. + request = _build_chat_request(tools) info = parser.extract_tool_calls(raw_text, request) - except Exception as e: + except Exception as e: # noqa: BLE001 — vLLM parsers raise heterogeneous types; surface as ParseResult.error return ParseResult(error=f"{type(e).__name__}: {e}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/parser_parity/impls/parser/vllm.py` around lines 76 - 94, Move the ToolParser lookup and construction into the try block so exceptions from ToolParserManager.get_tool_parser(key) and parser_cls(tokenizer=_StubTokenizer()) are captured into ParseResult.error; specifically wrap the calls to ToolParserManager.get_tool_parser, the instantiation of parser (ToolParser(...)), and the subsequent parser.extract_tool_calls(...) inside the same try, and on the existing broad except that returns ParseResult(error=...) add a comment "# noqa: BLE001 — intentionally catching all exceptions to convert implementation failures into ParseResult.error for the test harness" to justify the broad catch.tests/parser_parity/impls/common.py (1)
54-56: 💤 Low valueFunction name
diffis counterintuitive — returns True when results are equivalent.Conventionally a
difffunction returns the difference (truthy when results differ). Here it returnsTruefor equivalence, which is the opposite. Consider renaming toequivalent/equalsto avoid future readers misreading the semantics.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/parser_parity/impls/common.py` around lines 54 - 56, The function diff returns True when two ParseResult values are equivalent, which is counterintuitive; rename the function (e.g., equivalent or equals) and update all usages to match the new name. Change def diff(a: ParseResult, b: ParseResult) -> bool: to def equivalent(a: ParseResult, b: ParseResult) -> bool: (or equals), keep the implementation canonical(a.to_dict()) == canonical(b.to_dict()), and search/replace all callers of diff(...) to use the new name so tests and imports still resolve correctly.tests/parser_parity/test_parity_parser.py (2)
30-38: 💤 Low valueUse logging instead of
-sflag. Alogging.warning(...)(orwarnings.warn(...)) keeps the message visible under standard pytest output without requiring-s, and integrates with pytest's log capture. Minor, but improves observability of the optional-dep skip path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/parser_parity/test_parity_parser.py` around lines 30 - 38, Replace the stdout print in _maybe_import with a logging call: import logging (or use an existing module logger via logger = logging.getLogger(__name__)) and change the except block to log a warning including the impl_name and the exception (e.g., logger.warning("skipping impl %r: %s", impl_name, e)). This ensures skipped-impl notices are emitted via Python logging and properly captured by pytest.
188-204: ⚡ Quick winSkipping on
got.errorcan silently mask real regressions.If a wrapper starts returning errors for an entire family (e.g., a vLLM/SGLang version bump tightens parser
__init__and every case begins failing inparser_cls(...)construction), the harness will skip every test in that family rather than failing — the regression appears as "many skips" in CI, which is easy to miss.Two reasonable improvements:
- Distinguish "impl unavailable" (env-shaped, e.g., import or registration miss) from "impl raised on input" (wrapper-shaped). Only the former should
pytest.skip; the latter shouldpytest.failor be asserted againstexpected.error.- Or, track per-family error counts at session-end and emit a warning/fail when error rate crosses a threshold, so a wholesale break is visible.
♻️ Sketch: classify env vs runtime errors
The wrappers already emit error strings prefixed with the exception class name. You could reserve a sentinel prefix (e.g.,
ParseResult(error="UNAVAILABLE: ...")) for the cases that are genuinely environmental —KeyErrorfromToolParserManager.get_tool_parserfor an unregistered family, missing detector class — and treat any othererroras a hard failure here:if got.error: - pytest.skip(f"{impl_name} unavailable for {family}: {got.error}") + if got.error.startswith("UNAVAILABLE:"): + pytest.skip(f"{impl_name} unavailable for {family}: {got.error}") + pytest.fail(f"{impl_name} crashed on {family}/{case_id}: {got.error}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/parser_parity/test_parity_parser.py` around lines 188 - 204, Change the current blanket skip on got.error to distinguish environment/unavailable errors from runtime/input errors: when impl_fn(family, ...) returns got.error, first check for a sentinel/unavailable prefix (e.g., "UNAVAILABLE:" or specific env exception names emitted by parser constructors) and only call pytest.skip for those; otherwise if fixture["expected"] contains an "error" field assert equality against got.error or call pytest.fail with a clear message including impl_name, family, case_id and got.error; update the wrapper that produces ParseResult(error=...) if necessary to emit the sentinel ("UNAVAILABLE:") for truly unavailable implementations so parser_cls/impl_fn and this test can classify correctly.tests/parser_parity/impls/parser/sglang.py (2)
58-61: ⚡ Quick winAnnotate the broad
except Exceptionto satisfy guidelines and the linter.The catch-and-stash-in-
ParseResult.errorpattern here is intentional — third-party detectors raise heterogeneous exception types and the harness chooses how to react (skip vs assert) based ongot.error. However, ruffBLE001is firing and.ai/python-guidelines.mddirects reviewers to prefer specific catches. Add a brief justification comment plus a targeted# noqa: BLE001so the intent is documented and the linter is silenced for the right reason.♻️ Proposed annotation
try: info = detector.detect_and_parse(raw_text, sg_tools) - except Exception as e: + except Exception as e: # noqa: BLE001 — third-party detectors raise heterogeneous types; surface as ParseResult.error for the harness return ParseResult(error=f"{type(e).__name__}: {e}")As per coding guidelines: "Flag
# noqasuppressions that hide anti-patterns documented in the guidelines; only accept# noqafor genuine false positives with explanation."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/parser_parity/impls/parser/sglang.py` around lines 58 - 61, The broad except in the call to detector.detect_and_parse should be annotated and justified: add a brief inline comment explaining that third‑party detectors raise heterogeneous, intentional exception types and that the harness records them in ParseResult(error) for downstream handling, then append a targeted "# noqa: BLE001" to the except line to silence ruff; reference the detector.detect_and_parse call and the ParseResult(error=...) usage so the reviewer can see this is an intentional catch-and-stash pattern.
89-105: ⚡ Quick winTool dict shape assumption could KeyError on minimal definitions.
_build_toolsassumes eithert["function"]["name"]ort["name"]is present. Fixtures consistently provide flat{"name": ..., "parameters": ...}, so this works today, but a malformed tool entry would surface as aKeyErrorrather than the standardizedParseResult.error. Since this wrapper is called from inside thetryblock inparse()(L58–61) — wait, actually_build_tools(tools)is called on L56 before the try, so a KeyError here would not be caught.Move the
_build_toolscall inside the try, or make the field access defensive.♻️ One option: move `_build_tools` inside the try
- sg_tools = _build_tools(tools) - try: + sg_tools = _build_tools(tools) info = detector.detect_and_parse(raw_text, sg_tools) except Exception as e: # noqa: BLE001 return ParseResult(error=f"{type(e).__name__}: {e}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/parser_parity/impls/parser/sglang.py` around lines 89 - 105, The call to _build_tools(...) is happening outside parse()'s try block and can raise KeyError on malformed tool dicts; move the invocation of _build_tools(tools) into the try block inside parse() (so exceptions become ParseResult.error) or alternatively make _build_tools defensive by using t.get("function"), inner.get("name") / inner.get("parameters") and validating presence of name before wrapping (raise a clear exception if missing) — update references in parse() and keep the wrapper function name _build_tools and the produced _Tool usage intact.tests/parser_parity/impls/parser/dynamo.py (1)
37-47: ⚡ Quick win
json.loads(result_json)on L47 is outside the try/except.If
parse_tool_callever returns a string that fails to parse as JSON (e.g., a future change to the Rust binding's return contract), the resultingJSONDecodeErrorpropagates out ofparse()and crashes the test, rather than being captured intoParseResult.errorlike every other failure mode. Move the decode (and the call-walk after it) inside the try, or wrap them in a second narrowly-scoped try.Also, please annotate the broad catch on L44 with
# noqa: BLE001and a brief reason comment to keep ruff and the python-guidelines aligned ("Favor specific exception handling over broad catches").♻️ Proposed adjustment
try: # The PyO3 binding returns a future that registers with a running # event loop, so we must call it from inside an async context. async def _run() -> str: return await parse_tool_call(parser_name, raw_text, tools_json) result_json: str = asyncio.run(_run()) - except Exception as e: + raw = json.loads(result_json) + except Exception as e: # noqa: BLE001 — PyO3 binding raises heterogeneous types; surface as ParseResult.error return ParseResult(error=f"{type(e).__name__}: {e}") - raw = json.loads(result_json) calls = []🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/parser_parity/impls/parser/dynamo.py` around lines 37 - 47, Move the JSON decoding and any subsequent call-walk into the try block (or add a narrow second try) so JSONDecodeError from json.loads(result_json) is captured and returned as ParseResult(error=...), i.e., after awaiting parse_tool_call in the async _run and assigning result_json, call json.loads(result_json) inside the same try and handle failures by returning ParseResult(error=...). Also annotate the broad except Exception as e with "# noqa: BLE001 - broad catch to map any parser/runtime failures into ParseResult.error" so ruff is satisfied; reference symbols to update: parse_tool_call, _run, result_json, json.loads and ParseResult.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/parser_parity/fixtures/_generate.py`:
- Around line 452-453: The fixture write currently calls (family_dir /
f"PARSER.{mode}.json").write_text(json.dumps(out, indent=2, ensure_ascii=False)
+ "\n") without an explicit encoding; change the write to pass encoding='utf-8'
so non-ASCII tokens are preserved regardless of process locale — i.e., update
the write_text call in tests/parser_parity/fixtures/_generate.py (the line
constructing (family_dir / f"PARSER.{mode}.json").write_text(...)) to include
the encoding argument.
In `@tests/parser_parity/fixtures/qwen3_coder/PARSER.batch.json`:
- Around line 99-140: Two test entries ("4" and "5") use identical model_text
and expected values, collapsing two scenarios; update the fixture so case "4"
represents a distinct malformed input (for example remove the inner </parameter>
tag or another different tag error) or delete the entire "4" entry; after
changing the "4" object’s model_text (and adjust expected.calls/normal_text if
needed) re-run _generate.py with Dynamo as reference to capture the updated
expected output and update KNOWN_DIVERGENCES entries for qwen3_coder
PARSER.batch.4/.5 accordingly.
In `@tests/parser_parity/test_parity_parser.py`:
- Around line 4-12: Update the module docstring in
tests/parser_parity/test_parity_parser.py: remove or replace the Linear ticket
reference "DIS-1906" with a GitHub issue identifier (e.g., "#NNNN" or omit it)
and correct the example pytest command path from "tests/parser-parity/" to
"tests/parser_parity/" so the runnable example matches the actual directory
name; edit the top-level triple-quoted docstring accordingly.
---
Nitpick comments:
In `@tests/parser_parity/fixtures/_generate.py`:
- Around line 12-14: Update the top comment that describes generated fixtures to
say these are canonicalized/normalized outputs rather than verbatim binding/wire
snapshots; specifically mention that _run_one() JSON-decodes function.arguments
and normalizes missing normal_text to "" (so the generator captures Dynamo’s
processed output, not the raw parse_tool_call payload), and instruct
contributors to edit INPUTS to change templates rather than expecting
byte-for-byte binding output.
In `@tests/parser_parity/impls/common.py`:
- Around line 54-56: The function diff returns True when two ParseResult values
are equivalent, which is counterintuitive; rename the function (e.g., equivalent
or equals) and update all usages to match the new name. Change def diff(a:
ParseResult, b: ParseResult) -> bool: to def equivalent(a: ParseResult, b:
ParseResult) -> bool: (or equals), keep the implementation
canonical(a.to_dict()) == canonical(b.to_dict()), and search/replace all callers
of diff(...) to use the new name so tests and imports still resolve correctly.
In `@tests/parser_parity/impls/parser/dynamo.py`:
- Around line 37-47: Move the JSON decoding and any subsequent call-walk into
the try block (or add a narrow second try) so JSONDecodeError from
json.loads(result_json) is captured and returned as ParseResult(error=...),
i.e., after awaiting parse_tool_call in the async _run and assigning
result_json, call json.loads(result_json) inside the same try and handle
failures by returning ParseResult(error=...). Also annotate the broad except
Exception as e with "# noqa: BLE001 - broad catch to map any parser/runtime
failures into ParseResult.error" so ruff is satisfied; reference symbols to
update: parse_tool_call, _run, result_json, json.loads and ParseResult.
In `@tests/parser_parity/impls/parser/sglang.py`:
- Around line 58-61: The broad except in the call to detector.detect_and_parse
should be annotated and justified: add a brief inline comment explaining that
third‑party detectors raise heterogeneous, intentional exception types and that
the harness records them in ParseResult(error) for downstream handling, then
append a targeted "# noqa: BLE001" to the except line to silence ruff; reference
the detector.detect_and_parse call and the ParseResult(error=...) usage so the
reviewer can see this is an intentional catch-and-stash pattern.
- Around line 89-105: The call to _build_tools(...) is happening outside
parse()'s try block and can raise KeyError on malformed tool dicts; move the
invocation of _build_tools(tools) into the try block inside parse() (so
exceptions become ParseResult.error) or alternatively make _build_tools
defensive by using t.get("function"), inner.get("name") /
inner.get("parameters") and validating presence of name before wrapping (raise a
clear exception if missing) — update references in parse() and keep the wrapper
function name _build_tools and the produced _Tool usage intact.
In `@tests/parser_parity/impls/parser/vllm.py`:
- Around line 76-94: Move the ToolParser lookup and construction into the try
block so exceptions from ToolParserManager.get_tool_parser(key) and
parser_cls(tokenizer=_StubTokenizer()) are captured into ParseResult.error;
specifically wrap the calls to ToolParserManager.get_tool_parser, the
instantiation of parser (ToolParser(...)), and the subsequent
parser.extract_tool_calls(...) inside the same try, and on the existing broad
except that returns ParseResult(error=...) add a comment "# noqa: BLE001 —
intentionally catching all exceptions to convert implementation failures into
ParseResult.error for the test harness" to justify the broad catch.
In `@tests/parser_parity/test_parity_parser.py`:
- Around line 30-38: Replace the stdout print in _maybe_import with a logging
call: import logging (or use an existing module logger via logger =
logging.getLogger(__name__)) and change the except block to log a warning
including the impl_name and the exception (e.g., logger.warning("skipping impl
%r: %s", impl_name, e)). This ensures skipped-impl notices are emitted via
Python logging and properly captured by pytest.
- Around line 188-204: Change the current blanket skip on got.error to
distinguish environment/unavailable errors from runtime/input errors: when
impl_fn(family, ...) returns got.error, first check for a sentinel/unavailable
prefix (e.g., "UNAVAILABLE:" or specific env exception names emitted by parser
constructors) and only call pytest.skip for those; otherwise if
fixture["expected"] contains an "error" field assert equality against got.error
or call pytest.fail with a clear message including impl_name, family, case_id
and got.error; update the wrapper that produces ParseResult(error=...) if
necessary to emit the sentinel ("UNAVAILABLE:") for truly unavailable
implementations so parser_cls/impl_fn and this test can classify correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 86e2948c-d05f-469f-a138-4ce6852f0120
📒 Files selected for processing (18)
lib/bindings/python/rust/parsers.rslib/bindings/python/src/dynamo/_core.pyitests/parser_parity/__init__.pytests/parser_parity/fixtures/_generate.pytests/parser_parity/fixtures/deepseek_v3_1/PARSER.batch.jsontests/parser_parity/fixtures/glm47/PARSER.batch.jsontests/parser_parity/fixtures/harmony/PARSER.batch.jsontests/parser_parity/fixtures/kimi_k2/PARSER.batch.jsontests/parser_parity/fixtures/minimax_m2/PARSER.batch.jsontests/parser_parity/fixtures/nemotron_deci/PARSER.batch.jsontests/parser_parity/fixtures/qwen3_coder/PARSER.batch.jsontests/parser_parity/impls/__init__.pytests/parser_parity/impls/common.pytests/parser_parity/impls/parser/__init__.pytests/parser_parity/impls/parser/dynamo.pytests/parser_parity/impls/parser/sglang.pytests/parser_parity/impls/parser/vllm.pytests/parser_parity/test_parity_parser.py
822c701 to
7add6d1
Compare
7add6d1 to
8396ad6
Compare
8396ad6 to
6bd9471
Compare
Method 2 of the cross-impl parser-parity effort: drive Dynamo,
vLLM, and SGLang parsers from the same fixtures and diff their
outputs to surface cross-impl behavioral divergences.
Sibling PR ships Method 3 (e2e variant — drives a real `vllm serve`
HTTP process on the same fixtures). Independent of this PR; whichever
lands second needs a small rebase on the shared scaffolding.
What lands here
---------------
- New PyO3 binding: `dynamo._core.parse_tool_call(parser_name,
message, tools_json)` returns a JSON-serialized
`{calls, normal_text}` payload.
- `tests/parity/parser/` harness with three impl wrappers:
- `dynamo.py` via PyO3 binding
- `vllm.py` via `ToolParserManager`, with an `_OmnivorousVocab`
stub tokenizer so qwen3_coder / glm47 / minimax_m2 / deepseekv31
parsers construct without a real tokenizer
- `sglang.py` via per-module detectors, with duck-typed tool
objects via `types.SimpleNamespace` since detectors access
`tool.function.name` via attribute, not dict subscript
- 70 cases in 7 fixture files (`fixtures/<family>/PARSER.batch.json`,
schema `{family, mode, cases: {1: {...}, ..., 10: {...}}}`) covering
PARSER.batch.{1..10} for kimi_k2, qwen3_coder, glm47, deepseek_v3_1,
harmony, minimax_m2, nemotron_deci. Auto-generated by
`regenerate_fixtures.py` (Dynamo as reference oracle).
- `regenerate_fixtures.py`: non-destructive by default. Pass
`--overwrite-if-exists` to refresh after an intentional Dynamo
parser-behavior change. Cases on disk that aren't in INPUTS are
always preserved (so editing your INPUTS section can't accidentally
delete other contributors' cases). Run as a module
(`python3 -m tests.parity.parser.regenerate_fixtures`) to avoid
the local `dynamo.py` shadowing the real `dynamo` package.
- `tests/parity/README.md`: top-level overview covering the layout,
M2/M3 method tradeoffs, fixture schema, why per-family JSONs look
similar by design, the regenerator flags, and a checklist for
adding new parser families.
- `tests/parity/common.py`: shared `ParseResult`, canonical-JSON
diff with `""`/`None` normalization on `normal_text`, and a
`decode_arguments()` helper (parses JSON-encoded arguments,
falls back to raw bytes on failure).
CI gating
---------
Module-level markers: `unit, pre_merge, gpu_0`. Per-parametrize
impl markers (`vllm`/`sglang`) so CI marker filters
(`pre_merge and vllm and gpu_0`, `pre_merge and sglang and gpu_0`)
pick up only the right subset in each container. `dynamo` params
get no impl marker — they only run when the harness is explicitly
selected, since Dynamo doesn't have a per-container CI job here.
Findings (xfail, not silenced)
------------------------------
KNOWN_DIVERGENCES registry records 22 cross-impl findings:
- Trailing-text drop after wrapper end (vLLM and SGLang on three
XML-style families: kimi_k2, qwen3_coder, glm47).
- SGLang GptOssDetector requires the strict
<|start|>assistant<|channel|>commentary envelope; rejects bare
commentary/analysis variants Dynamo accepts.
- SGLang drops the analysis-channel preamble on harmony PARSER.batch.7.
- Dynamo drops back-to-back commentary blocks on harmony
PARSER.batch.{2,10}; SGLang extracts both — Dynamo bug class.
- Per-impl whitespace handling around wrapper boundaries.
- PARSER.batch.4 (malformed JSON) and PARSER.batch.5 (missing
end-token) recovery contracts are impl-defined per
PARSER_CASES.md; divergences documented rather than
asserted-against.
Coverage
--------
111 pass / 90 cross-container-skipped / 9 xfailed in vLLM container;
34 pass / 160 skipped / 16 xfailed in SGLang container.
Signed-off-by: Keiven Chang <[email protected]>
Add a "Eventual goal" section to tests/parity/README.md describing the intended end state: one shared set of JSON fixtures consumed by both a Python harness (this PR + M3) and a future thin Rust harness, replacing the ~70 black-box Rust unit tests that currently duplicate inputs already encoded as M2 fixtures. Captures: - The current overlap (M2 fixtures were extracted from Rust tests by hand; ~70 cases now live in both languages). - What stays Rust-only (white-box helper tests, tokenizer / panic cases). - A rough PR-stack sketch for the migration (Rust harness, then mechanical Rust-test deletion, then expanded coverage for format / regression cases). Streaming follows separately. Subsequent PRs (not in this one's scope) would land the Rust harness and prune the duplicates. M2's real value-add — the cross-impl half (vLLM + SGLang) — is unaffected. No code changes. Signed-off-by: Keiven Chang <[email protected]>
Replace the "Two methods" table with a fuller section covering all three methods (M1 fallback-path, M2 parser-class, M3 e2e HTTP). Each gets a short block describing what runs, what's substituted, and what bug class it surfaces, plus a comparison table mapping each method to which slice of the request lifecycle it actually exercises. M1 is described as future / not in this PR. The harness here implements M2 only; M3 lands in the sibling PR. M1 belongs to a separate harness because it tests Dynamo's wrapping of upstream parsers, not parity between impls. No code changes. Signed-off-by: Keiven Chang <[email protected]>
Update README so the framing relative to this PR is clearer. M2 is this PR's harness; M1 (Dynamo-frontend fallback-path test) and M3 (e2e HTTP test, sibling PR #9189) are presented as upcoming, next steps rather than vague "future" / "(sibling PR ...)" mentions. No code changes. Signed-off-by: Keiven Chang <[email protected]>
6bd9471 to
55eab8e
Compare
Convert the 7 PARSER.batch fixture files from JSON to YAML so the
multi-line `model_text` field reads as the actual wire format
instead of a `\n`-escaped one-liner. Same data, more readable
during PR review:
- XML-style families (qwen3_coder, kimi_k2, glm47, harmony) get
proper line breaks via YAML literal block scalars (`|-`).
- DeepSeek embedded-JSON args lose the escaped-quote noise
(`{\"location\":\"NYC\"}` -> `{"location":"NYC"}`).
- Special tokens (`|` U+FF5C, `▁` U+2581) still round-trip
literally via `allow_unicode=True`.
Mechanical changes:
- `regenerate_fixtures.py`: dump via PyYAML with a custom string
presenter that picks block-scalar style for multi-line strings.
Test pass: 111 passed / 90 skipped / 9 xfailed (unchanged from
the JSON baseline). `--overwrite-if-exists` round-trip is
byte-stable.
- `test_parity_parser.py`: load via `yaml.safe_load`; glob shifts
from `*.json` to `*.yaml`.
- README: schema example reframed in YAML; mentions JSON only
where it refers to wire-format JSON (tool-call args), not the
fixture format.
- Fix a pre-existing bug in `regenerate_fixtures.py` where
`FIXTURES_ROOT = Path(__file__).parent` wrote outputs one level
above the canonical `fixtures/` tree. Now writes back to
`fixtures/`.
PyYAML is already an ambient runtime dep; no new top-level
requirement. Eventual Rust harness can use `serde_yaml` (already
present transitively via `kube-client`).
Signed-off-by: Keiven Chang <[email protected]>
Switch on-disk YAML case keys from bare ints (`1`, `2`, ...) to full
PARSER_CASES.md IDs (`PARSER.batch.1`, `PARSER.batch.2`, ...). One
grep now finds a case across docs, fixtures, KNOWN_DIVERGENCES,
pytest IDs, and Rust source comments.
- `regenerate_fixtures.py`: writes full IDs at write time, strips
prefix at read time (internal bookkeeping stays keyed by case
number).
- `test_parity_parser.py`: drops the `f"PARSER.{mode}.{n}"`
reconstruction since the on-disk key is the case ID directly.
- 7 fixture files: rekeyed in place.
- README: schema example updated; case-keys paragraph rewritten
to point at PARSER_CASES.md as the canonical ID source.
Test pass: 111 passed / 90 skipped / 9 xfailed (unchanged).
Signed-off-by: Keiven Chang <[email protected]>
Overview:
Method 2 of DIS-1906: drive Dynamo, vLLM, and SGLang parsers from the same fixtures and diff their outputs to catch cross-impl divergences.
Sibling PR #9189 ships Method 3 (same fixtures, real
vllm serve/sglang.launch_serverover HTTP).Size: +2579 / -2 lines. Roughly ~1050 YAML fixtures (auto-generated) / ~1200 code / ~340 docs.
Details:
dynamo._core.parse_tool_callso Dynamo's Rust parser can be driven from Python alongside vLLM and SGLangtests/parity/parser/with thin per-impl wrappers; lazy imports skip cleanly in single-framework containersPARSER.batch.{1..10}× 7 families (kimi_k2, qwen3_coder, glm47, deepseek_v3_1, harmony, minimax_m2, nemotron_deci); regenerated viaregenerate_fixtures.pywith Dynamo as the oracle (--overwrite-if-existsto refresh)xfail(not silenced) in aKNOWN_DIVERGENCESregistryFuture direction:
Eventual goal is YAML fixtures as the single source of truth — same files consumed by Python (this PR + M3) and a future Rust harness, replacing ~70 black-box Rust unit tests that currently duplicate inputs already encoded here. See
tests/parity/README.md"Eventual goal" section.Where should the reviewer start?
tests/parity/README.md, thentests/parity/parser/test_parity_parser.pyRelated Issues:
DIS-1906
/coderabbit profile chill