Skip to content

feat(parsers): cross-impl parser-parity harness (Method 2)#9186

Draft
keivenchang wants to merge 6 commits intomainfrom
keivenchang/DIS-1906__method-2-cross-impl-parser-parity-harness
Draft

feat(parsers): cross-impl parser-parity harness (Method 2)#9186
keivenchang wants to merge 6 commits intomainfrom
keivenchang/DIS-1906__method-2-cross-impl-parser-parity-harness

Conversation

@keivenchang
Copy link
Copy Markdown
Contributor

@keivenchang keivenchang commented May 5, 2026

Overview:

Method 2 of DIS-1906: drive Dynamo, vLLM, and SGLang parsers from the same fixtures and diff their outputs to catch cross-impl divergences.

Sibling PR #9189 ships Method 3 (same fixtures, real vllm serve / sglang.launch_server over HTTP).

Size: +2579 / -2 lines. Roughly ~1050 YAML fixtures (auto-generated) / ~1200 code / ~340 docs.

Details:

  • New PyO3 binding dynamo._core.parse_tool_call so Dynamo's Rust parser can be driven from Python alongside vLLM and SGLang
  • New harness at tests/parity/parser/ with thin per-impl wrappers; lazy imports skip cleanly in single-framework containers
  • 70 YAML fixtures covering PARSER.batch.{1..10} × 7 families (kimi_k2, qwen3_coder, glm47, deepseek_v3_1, harmony, minimax_m2, nemotron_deci); regenerated via regenerate_fixtures.py with Dynamo as the oracle (--overwrite-if-exists to refresh)
  • 22 cross-impl divergences recorded as xfail (not silenced) in a KNOWN_DIVERGENCES registry
  • Test pass: 111 / 90 skipped / 9 xfailed in vLLM container; 34 / 160 skipped / 16 xfailed in SGLang

Future direction:

Eventual goal is YAML fixtures as the single source of truth — same files consumed by Python (this PR + M3) and a future Rust harness, replacing ~70 black-box Rust unit tests that currently duplicate inputs already encoded here. See tests/parity/README.md "Eventual goal" section.

Where should the reviewer start?

tests/parity/README.md, then tests/parity/parser/test_parity_parser.py

Related Issues:

DIS-1906

/coderabbit profile chill

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

Walkthrough

This PR introduces Python bindings for tool call parsing in the Rust layer, exposes them via type stubs, and establishes a comprehensive parity testing framework across three parser implementations (Dynamo, vLLM, SGLang). It includes a fixture generator script, test fixtures for seven parser families, and a parametrized test harness that validates parsing consistency.

Changes

Tool Call Parsing Bindings & Parity Testing

Layer / File(s) Summary
Rust Bindings & Python API
lib/bindings/python/rust/parsers.rs, lib/bindings/python/src/dynamo/_core.pyi
New parse_tool_call() async function added to Rust bindings that accepts parser name, message, and optional tools JSON; returns JSON with parsed calls and normal text. Helper parse_tools_json() converts flat or OpenAI-wrapped tool schemas to ToolDefinition vec. Type stub declares the same signature with full docstring.
Parity Test Infrastructure
tests/parser_parity/impls/common.py
Shared contract layer: ParseResult dataclass (calls, normal_text, error), canonical() for stable JSON comparison, and diff() utility for assertions.
Parser Implementation Wrappers
tests/parser_parity/impls/parser/dynamo.py, tests/parser_parity/impls/parser/sglang.py, tests/parser_parity/impls/parser/vllm.py
Three parallel wrapper modules exposing a uniform parse(parser_family, raw_text, tools) -> ParseResult interface. Dynamo wraps the new Rust binding via asyncio; SGLang and vLLM wrap their respective detector/parser APIs with tool schema adaptation and error handling.
Test Fixture Generator
tests/parser_parity/fixtures/_generate.py
Async fixture generator that enumerates tool-parsing scenarios across parser families (kimi_k2, qwen3_coder, glm47, deepseek_v3_1, harmony, minimax_m2, nemotron_deci). Groups test cases by (family, mode), invokes parse_tool_call via Dynamo binding, and writes per-family fixture files in PARSER.{mode}.json format.
Test Fixtures
tests/parser_parity/fixtures/*/PARSER.batch.json
Seven generated batch-mode fixture files (deepseek_v3_1, glm47, harmony, kimi_k2, minimax_m2, nemotron_deci, qwen3_coder), each containing 10 test cases with model_text, tool definitions, and expected parsed calls and normal_text. Cases cover single/multiple calls, empty input, malformed JSON, missing markers, empty arguments, complex nested arguments, interleaved text, and duplicate calls.
Parity Test Suite
tests/parser_parity/test_parity_parser.py
Parametrized test harness that loads all fixtures, lazily imports available implementations (dynamo, vllm, sglang), and runs test_parity() across all (family, case_id, fixture, impl) combinations. Skips missing implementations, xfails known divergences, invokes parsers, and validates canonical JSON equivalence. Includes documented divergence registry and recovery contracts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: a cross-implementation parser-parity harness. It directly matches the primary objective of implementing Method 2 of DIS-1906.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description is complete and well-structured, following the template with all required sections filled in: Overview, Details, Where to start, and Related Issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (8)
tests/parser_parity/fixtures/_generate.py (1)

12-14: ⚡ Quick win

Describe these fixtures as canonicalized output, not verbatim binding output.

_run_one() JSON-decodes function.arguments and normalizes missing normal_text to "", so the generator is not preserving the raw parse_tool_call() payload. Rewording this avoids people treating the fixtures as a wire-contract snapshot.

Proposed wording
-Re-running overwrites existing fixtures. Edit INPUTS to add or change
-templates; the generator captures Dynamo's output verbatim, so
-fixtures always reflect the current Rust contract.
+Re-running overwrites existing fixtures. Edit INPUTS to add or change
+templates; the generator canonicalizes Dynamo's parsed output into the
+shared parity format used by the test harness.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/parser_parity/fixtures/_generate.py` around lines 12 - 14, Update the
top comment that describes generated fixtures to say these are
canonicalized/normalized outputs rather than verbatim binding/wire snapshots;
specifically mention that _run_one() JSON-decodes function.arguments and
normalizes missing normal_text to "" (so the generator captures Dynamo’s
processed output, not the raw parse_tool_call payload), and instruct
contributors to edit INPUTS to change templates rather than expecting
byte-for-byte binding output.
tests/parser_parity/impls/parser/vllm.py (1)

76-94: ⚡ Quick win

Parser lookup and construction are outside the try/except.

ToolParserManager.get_tool_parser(key) (L80) and parser_cls(tokenizer=_StubTokenizer()) (L85) can both raise — the comment on L82–84 even acknowledges that vLLM's ToolParser.__init__ actively raises on a falsy tokenizer, and a future vLLM release that tightens the vocab/special-token checks could throw here too. Today those failures escape as raw exceptions instead of being captured into ParseResult.error, so the harness's "skip on impl error" path (test_parity L190–191) is bypassed.

Move them into the try, and please annotate the broad catch on L93 with # noqa: BLE001 plus a brief justification.

♻️ Proposed scoping
-    parser_cls = ToolParserManager.get_tool_parser(key)
-
-    # vLLM's ToolParser constructor checks `if not self.model_tokenizer:` and raises
-    # if falsy. None of the parsers we test actually call tokenizer methods inside
-    # extract_tool_calls(), so a truthy stub satisfies the check.
-    parser: ToolParser = parser_cls(tokenizer=_StubTokenizer())
-
-    # vLLM's extract_tool_calls signature: (model_output, request) → ExtractedToolCallInformation.
-    # We construct a minimal request shape with only the tools field, since most parsers ignore the rest.
-    request = _build_chat_request(tools)
-
     try:
+        parser_cls = ToolParserManager.get_tool_parser(key)
+        # vLLM's ToolParser constructor checks `if not self.model_tokenizer:` and raises
+        # if falsy. None of the parsers we test actually call tokenizer methods inside
+        # extract_tool_calls(), so a truthy stub satisfies the check.
+        parser: ToolParser = parser_cls(tokenizer=_StubTokenizer())
+        # vLLM's extract_tool_calls signature: (model_output, request) → ExtractedToolCallInformation.
+        # We construct a minimal request shape with only the tools field, since most parsers ignore the rest.
+        request = _build_chat_request(tools)
         info = parser.extract_tool_calls(raw_text, request)
-    except Exception as e:
+    except Exception as e:  # noqa: BLE001 — vLLM parsers raise heterogeneous types; surface as ParseResult.error
         return ParseResult(error=f"{type(e).__name__}: {e}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/parser_parity/impls/parser/vllm.py` around lines 76 - 94, Move the
ToolParser lookup and construction into the try block so exceptions from
ToolParserManager.get_tool_parser(key) and
parser_cls(tokenizer=_StubTokenizer()) are captured into ParseResult.error;
specifically wrap the calls to ToolParserManager.get_tool_parser, the
instantiation of parser (ToolParser(...)), and the subsequent
parser.extract_tool_calls(...) inside the same try, and on the existing broad
except that returns ParseResult(error=...) add a comment "# noqa: BLE001 —
intentionally catching all exceptions to convert implementation failures into
ParseResult.error for the test harness" to justify the broad catch.
tests/parser_parity/impls/common.py (1)

54-56: 💤 Low value

Function name diff is counterintuitive — returns True when results are equivalent.

Conventionally a diff function returns the difference (truthy when results differ). Here it returns True for equivalence, which is the opposite. Consider renaming to equivalent/equals to avoid future readers misreading the semantics.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/parser_parity/impls/common.py` around lines 54 - 56, The function diff
returns True when two ParseResult values are equivalent, which is
counterintuitive; rename the function (e.g., equivalent or equals) and update
all usages to match the new name. Change def diff(a: ParseResult, b:
ParseResult) -> bool: to def equivalent(a: ParseResult, b: ParseResult) -> bool:
(or equals), keep the implementation canonical(a.to_dict()) ==
canonical(b.to_dict()), and search/replace all callers of diff(...) to use the
new name so tests and imports still resolve correctly.
tests/parser_parity/test_parity_parser.py (2)

30-38: 💤 Low value

Use logging instead of print for skipped-impl notices.

print writes to stdout at import time and is captured/shown inconsistently depending on pytest's -s flag. A logging.warning(...) (or warnings.warn(...)) keeps the message visible under standard pytest output without requiring -s, and integrates with pytest's log capture. Minor, but improves observability of the optional-dep skip path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/parser_parity/test_parity_parser.py` around lines 30 - 38, Replace the
stdout print in _maybe_import with a logging call: import logging (or use an
existing module logger via logger = logging.getLogger(__name__)) and change the
except block to log a warning including the impl_name and the exception (e.g.,
logger.warning("skipping impl %r: %s", impl_name, e)). This ensures skipped-impl
notices are emitted via Python logging and properly captured by pytest.

188-204: ⚡ Quick win

Skipping on got.error can silently mask real regressions.

If a wrapper starts returning errors for an entire family (e.g., a vLLM/SGLang version bump tightens parser __init__ and every case begins failing in parser_cls(...) construction), the harness will skip every test in that family rather than failing — the regression appears as "many skips" in CI, which is easy to miss.

Two reasonable improvements:

  • Distinguish "impl unavailable" (env-shaped, e.g., import or registration miss) from "impl raised on input" (wrapper-shaped). Only the former should pytest.skip; the latter should pytest.fail or be asserted against expected.error.
  • Or, track per-family error counts at session-end and emit a warning/fail when error rate crosses a threshold, so a wholesale break is visible.
♻️ Sketch: classify env vs runtime errors

The wrappers already emit error strings prefixed with the exception class name. You could reserve a sentinel prefix (e.g., ParseResult(error="UNAVAILABLE: ...")) for the cases that are genuinely environmental — KeyError from ToolParserManager.get_tool_parser for an unregistered family, missing detector class — and treat any other error as a hard failure here:

     if got.error:
-        pytest.skip(f"{impl_name} unavailable for {family}: {got.error}")
+        if got.error.startswith("UNAVAILABLE:"):
+            pytest.skip(f"{impl_name} unavailable for {family}: {got.error}")
+        pytest.fail(f"{impl_name} crashed on {family}/{case_id}: {got.error}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/parser_parity/test_parity_parser.py` around lines 188 - 204, Change the
current blanket skip on got.error to distinguish environment/unavailable errors
from runtime/input errors: when impl_fn(family, ...) returns got.error, first
check for a sentinel/unavailable prefix (e.g., "UNAVAILABLE:" or specific env
exception names emitted by parser constructors) and only call pytest.skip for
those; otherwise if fixture["expected"] contains an "error" field assert
equality against got.error or call pytest.fail with a clear message including
impl_name, family, case_id and got.error; update the wrapper that produces
ParseResult(error=...) if necessary to emit the sentinel ("UNAVAILABLE:") for
truly unavailable implementations so parser_cls/impl_fn and this test can
classify correctly.
tests/parser_parity/impls/parser/sglang.py (2)

58-61: ⚡ Quick win

Annotate the broad except Exception to satisfy guidelines and the linter.

The catch-and-stash-in-ParseResult.error pattern here is intentional — third-party detectors raise heterogeneous exception types and the harness chooses how to react (skip vs assert) based on got.error. However, ruff BLE001 is firing and .ai/python-guidelines.md directs reviewers to prefer specific catches. Add a brief justification comment plus a targeted # noqa: BLE001 so the intent is documented and the linter is silenced for the right reason.

♻️ Proposed annotation
     try:
         info = detector.detect_and_parse(raw_text, sg_tools)
-    except Exception as e:
+    except Exception as e:  # noqa: BLE001 — third-party detectors raise heterogeneous types; surface as ParseResult.error for the harness
         return ParseResult(error=f"{type(e).__name__}: {e}")

As per coding guidelines: "Flag # noqa suppressions that hide anti-patterns documented in the guidelines; only accept # noqa for genuine false positives with explanation."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/parser_parity/impls/parser/sglang.py` around lines 58 - 61, The broad
except in the call to detector.detect_and_parse should be annotated and
justified: add a brief inline comment explaining that third‑party detectors
raise heterogeneous, intentional exception types and that the harness records
them in ParseResult(error) for downstream handling, then append a targeted "#
noqa: BLE001" to the except line to silence ruff; reference the
detector.detect_and_parse call and the ParseResult(error=...) usage so the
reviewer can see this is an intentional catch-and-stash pattern.

89-105: ⚡ Quick win

Tool dict shape assumption could KeyError on minimal definitions.

_build_tools assumes either t["function"]["name"] or t["name"] is present. Fixtures consistently provide flat {"name": ..., "parameters": ...}, so this works today, but a malformed tool entry would surface as a KeyError rather than the standardized ParseResult.error. Since this wrapper is called from inside the try block in parse() (L58–61) — wait, actually _build_tools(tools) is called on L56 before the try, so a KeyError here would not be caught.

Move the _build_tools call inside the try, or make the field access defensive.

♻️ One option: move `_build_tools` inside the try
-    sg_tools = _build_tools(tools)
-
     try:
+        sg_tools = _build_tools(tools)
         info = detector.detect_and_parse(raw_text, sg_tools)
     except Exception as e:  # noqa: BLE001
         return ParseResult(error=f"{type(e).__name__}: {e}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/parser_parity/impls/parser/sglang.py` around lines 89 - 105, The call
to _build_tools(...) is happening outside parse()'s try block and can raise
KeyError on malformed tool dicts; move the invocation of _build_tools(tools)
into the try block inside parse() (so exceptions become ParseResult.error) or
alternatively make _build_tools defensive by using t.get("function"),
inner.get("name") / inner.get("parameters") and validating presence of name
before wrapping (raise a clear exception if missing) — update references in
parse() and keep the wrapper function name _build_tools and the produced _Tool
usage intact.
tests/parser_parity/impls/parser/dynamo.py (1)

37-47: ⚡ Quick win

json.loads(result_json) on L47 is outside the try/except.

If parse_tool_call ever returns a string that fails to parse as JSON (e.g., a future change to the Rust binding's return contract), the resulting JSONDecodeError propagates out of parse() and crashes the test, rather than being captured into ParseResult.error like every other failure mode. Move the decode (and the call-walk after it) inside the try, or wrap them in a second narrowly-scoped try.

Also, please annotate the broad catch on L44 with # noqa: BLE001 and a brief reason comment to keep ruff and the python-guidelines aligned ("Favor specific exception handling over broad catches").

♻️ Proposed adjustment
     try:
         # The PyO3 binding returns a future that registers with a running
         # event loop, so we must call it from inside an async context.
         async def _run() -> str:
             return await parse_tool_call(parser_name, raw_text, tools_json)

         result_json: str = asyncio.run(_run())
-    except Exception as e:
+        raw = json.loads(result_json)
+    except Exception as e:  # noqa: BLE001 — PyO3 binding raises heterogeneous types; surface as ParseResult.error
         return ParseResult(error=f"{type(e).__name__}: {e}")

-    raw = json.loads(result_json)
     calls = []
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/parser_parity/impls/parser/dynamo.py` around lines 37 - 47, Move the
JSON decoding and any subsequent call-walk into the try block (or add a narrow
second try) so JSONDecodeError from json.loads(result_json) is captured and
returned as ParseResult(error=...), i.e., after awaiting parse_tool_call in the
async _run and assigning result_json, call json.loads(result_json) inside the
same try and handle failures by returning ParseResult(error=...). Also annotate
the broad except Exception as e with "# noqa: BLE001 - broad catch to map any
parser/runtime failures into ParseResult.error" so ruff is satisfied; reference
symbols to update: parse_tool_call, _run, result_json, json.loads and
ParseResult.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/parser_parity/fixtures/_generate.py`:
- Around line 452-453: The fixture write currently calls (family_dir /
f"PARSER.{mode}.json").write_text(json.dumps(out, indent=2, ensure_ascii=False)
+ "\n") without an explicit encoding; change the write to pass encoding='utf-8'
so non-ASCII tokens are preserved regardless of process locale — i.e., update
the write_text call in tests/parser_parity/fixtures/_generate.py (the line
constructing (family_dir / f"PARSER.{mode}.json").write_text(...)) to include
the encoding argument.

In `@tests/parser_parity/fixtures/qwen3_coder/PARSER.batch.json`:
- Around line 99-140: Two test entries ("4" and "5") use identical model_text
and expected values, collapsing two scenarios; update the fixture so case "4"
represents a distinct malformed input (for example remove the inner </parameter>
tag or another different tag error) or delete the entire "4" entry; after
changing the "4" object’s model_text (and adjust expected.calls/normal_text if
needed) re-run _generate.py with Dynamo as reference to capture the updated
expected output and update KNOWN_DIVERGENCES entries for qwen3_coder
PARSER.batch.4/.5 accordingly.

In `@tests/parser_parity/test_parity_parser.py`:
- Around line 4-12: Update the module docstring in
tests/parser_parity/test_parity_parser.py: remove or replace the Linear ticket
reference "DIS-1906" with a GitHub issue identifier (e.g., "#NNNN" or omit it)
and correct the example pytest command path from "tests/parser-parity/" to
"tests/parser_parity/" so the runnable example matches the actual directory
name; edit the top-level triple-quoted docstring accordingly.

---

Nitpick comments:
In `@tests/parser_parity/fixtures/_generate.py`:
- Around line 12-14: Update the top comment that describes generated fixtures to
say these are canonicalized/normalized outputs rather than verbatim binding/wire
snapshots; specifically mention that _run_one() JSON-decodes function.arguments
and normalizes missing normal_text to "" (so the generator captures Dynamo’s
processed output, not the raw parse_tool_call payload), and instruct
contributors to edit INPUTS to change templates rather than expecting
byte-for-byte binding output.

In `@tests/parser_parity/impls/common.py`:
- Around line 54-56: The function diff returns True when two ParseResult values
are equivalent, which is counterintuitive; rename the function (e.g., equivalent
or equals) and update all usages to match the new name. Change def diff(a:
ParseResult, b: ParseResult) -> bool: to def equivalent(a: ParseResult, b:
ParseResult) -> bool: (or equals), keep the implementation
canonical(a.to_dict()) == canonical(b.to_dict()), and search/replace all callers
of diff(...) to use the new name so tests and imports still resolve correctly.

In `@tests/parser_parity/impls/parser/dynamo.py`:
- Around line 37-47: Move the JSON decoding and any subsequent call-walk into
the try block (or add a narrow second try) so JSONDecodeError from
json.loads(result_json) is captured and returned as ParseResult(error=...),
i.e., after awaiting parse_tool_call in the async _run and assigning
result_json, call json.loads(result_json) inside the same try and handle
failures by returning ParseResult(error=...). Also annotate the broad except
Exception as e with "# noqa: BLE001 - broad catch to map any parser/runtime
failures into ParseResult.error" so ruff is satisfied; reference symbols to
update: parse_tool_call, _run, result_json, json.loads and ParseResult.

In `@tests/parser_parity/impls/parser/sglang.py`:
- Around line 58-61: The broad except in the call to detector.detect_and_parse
should be annotated and justified: add a brief inline comment explaining that
third‑party detectors raise heterogeneous, intentional exception types and that
the harness records them in ParseResult(error) for downstream handling, then
append a targeted "# noqa: BLE001" to the except line to silence ruff; reference
the detector.detect_and_parse call and the ParseResult(error=...) usage so the
reviewer can see this is an intentional catch-and-stash pattern.
- Around line 89-105: The call to _build_tools(...) is happening outside
parse()'s try block and can raise KeyError on malformed tool dicts; move the
invocation of _build_tools(tools) into the try block inside parse() (so
exceptions become ParseResult.error) or alternatively make _build_tools
defensive by using t.get("function"), inner.get("name") /
inner.get("parameters") and validating presence of name before wrapping (raise a
clear exception if missing) — update references in parse() and keep the wrapper
function name _build_tools and the produced _Tool usage intact.

In `@tests/parser_parity/impls/parser/vllm.py`:
- Around line 76-94: Move the ToolParser lookup and construction into the try
block so exceptions from ToolParserManager.get_tool_parser(key) and
parser_cls(tokenizer=_StubTokenizer()) are captured into ParseResult.error;
specifically wrap the calls to ToolParserManager.get_tool_parser, the
instantiation of parser (ToolParser(...)), and the subsequent
parser.extract_tool_calls(...) inside the same try, and on the existing broad
except that returns ParseResult(error=...) add a comment "# noqa: BLE001 —
intentionally catching all exceptions to convert implementation failures into
ParseResult.error for the test harness" to justify the broad catch.

In `@tests/parser_parity/test_parity_parser.py`:
- Around line 30-38: Replace the stdout print in _maybe_import with a logging
call: import logging (or use an existing module logger via logger =
logging.getLogger(__name__)) and change the except block to log a warning
including the impl_name and the exception (e.g., logger.warning("skipping impl
%r: %s", impl_name, e)). This ensures skipped-impl notices are emitted via
Python logging and properly captured by pytest.
- Around line 188-204: Change the current blanket skip on got.error to
distinguish environment/unavailable errors from runtime/input errors: when
impl_fn(family, ...) returns got.error, first check for a sentinel/unavailable
prefix (e.g., "UNAVAILABLE:" or specific env exception names emitted by parser
constructors) and only call pytest.skip for those; otherwise if
fixture["expected"] contains an "error" field assert equality against got.error
or call pytest.fail with a clear message including impl_name, family, case_id
and got.error; update the wrapper that produces ParseResult(error=...) if
necessary to emit the sentinel ("UNAVAILABLE:") for truly unavailable
implementations so parser_cls/impl_fn and this test can classify correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 86e2948c-d05f-469f-a138-4ce6852f0120

📥 Commits

Reviewing files that changed from the base of the PR and between 78e876f and 904f21b.

📒 Files selected for processing (18)
  • lib/bindings/python/rust/parsers.rs
  • lib/bindings/python/src/dynamo/_core.pyi
  • tests/parser_parity/__init__.py
  • tests/parser_parity/fixtures/_generate.py
  • tests/parser_parity/fixtures/deepseek_v3_1/PARSER.batch.json
  • tests/parser_parity/fixtures/glm47/PARSER.batch.json
  • tests/parser_parity/fixtures/harmony/PARSER.batch.json
  • tests/parser_parity/fixtures/kimi_k2/PARSER.batch.json
  • tests/parser_parity/fixtures/minimax_m2/PARSER.batch.json
  • tests/parser_parity/fixtures/nemotron_deci/PARSER.batch.json
  • tests/parser_parity/fixtures/qwen3_coder/PARSER.batch.json
  • tests/parser_parity/impls/__init__.py
  • tests/parser_parity/impls/common.py
  • tests/parser_parity/impls/parser/__init__.py
  • tests/parser_parity/impls/parser/dynamo.py
  • tests/parser_parity/impls/parser/sglang.py
  • tests/parser_parity/impls/parser/vllm.py
  • tests/parser_parity/test_parity_parser.py

Comment thread tests/parser_parity/fixtures/_generate.py Outdated
Comment thread tests/parity/parser/fixtures/qwen3_coder/PARSER.batch.json Outdated
Comment thread tests/parser_parity/test_parity_parser.py Outdated
@keivenchang keivenchang marked this pull request as draft May 5, 2026 22:54
@keivenchang keivenchang force-pushed the keivenchang/DIS-1906__method-2-cross-impl-parser-parity-harness branch from 822c701 to 7add6d1 Compare May 5, 2026 23:32
@keivenchang keivenchang force-pushed the keivenchang/DIS-1906__method-2-cross-impl-parser-parity-harness branch from 7add6d1 to 8396ad6 Compare May 6, 2026 02:21
@keivenchang keivenchang force-pushed the keivenchang/DIS-1906__method-2-cross-impl-parser-parity-harness branch from 8396ad6 to 6bd9471 Compare May 6, 2026 03:10
Method 2 of the cross-impl parser-parity effort: drive Dynamo,
vLLM, and SGLang parsers from the same fixtures and diff their
outputs to surface cross-impl behavioral divergences.

Sibling PR ships Method 3 (e2e variant — drives a real `vllm serve`
HTTP process on the same fixtures). Independent of this PR; whichever
lands second needs a small rebase on the shared scaffolding.

What lands here
---------------
- New PyO3 binding: `dynamo._core.parse_tool_call(parser_name,
  message, tools_json)` returns a JSON-serialized
  `{calls, normal_text}` payload.

- `tests/parity/parser/` harness with three impl wrappers:
  - `dynamo.py` via PyO3 binding
  - `vllm.py` via `ToolParserManager`, with an `_OmnivorousVocab`
    stub tokenizer so qwen3_coder / glm47 / minimax_m2 / deepseekv31
    parsers construct without a real tokenizer
  - `sglang.py` via per-module detectors, with duck-typed tool
    objects via `types.SimpleNamespace` since detectors access
    `tool.function.name` via attribute, not dict subscript

- 70 cases in 7 fixture files (`fixtures/<family>/PARSER.batch.json`,
  schema `{family, mode, cases: {1: {...}, ..., 10: {...}}}`) covering
  PARSER.batch.{1..10} for kimi_k2, qwen3_coder, glm47, deepseek_v3_1,
  harmony, minimax_m2, nemotron_deci. Auto-generated by
  `regenerate_fixtures.py` (Dynamo as reference oracle).

- `regenerate_fixtures.py`: non-destructive by default. Pass
  `--overwrite-if-exists` to refresh after an intentional Dynamo
  parser-behavior change. Cases on disk that aren't in INPUTS are
  always preserved (so editing your INPUTS section can't accidentally
  delete other contributors' cases). Run as a module
  (`python3 -m tests.parity.parser.regenerate_fixtures`) to avoid
  the local `dynamo.py` shadowing the real `dynamo` package.

- `tests/parity/README.md`: top-level overview covering the layout,
  M2/M3 method tradeoffs, fixture schema, why per-family JSONs look
  similar by design, the regenerator flags, and a checklist for
  adding new parser families.

- `tests/parity/common.py`: shared `ParseResult`, canonical-JSON
  diff with `""`/`None` normalization on `normal_text`, and a
  `decode_arguments()` helper (parses JSON-encoded arguments,
  falls back to raw bytes on failure).

CI gating
---------
Module-level markers: `unit, pre_merge, gpu_0`. Per-parametrize
impl markers (`vllm`/`sglang`) so CI marker filters
(`pre_merge and vllm and gpu_0`, `pre_merge and sglang and gpu_0`)
pick up only the right subset in each container. `dynamo` params
get no impl marker — they only run when the harness is explicitly
selected, since Dynamo doesn't have a per-container CI job here.

Findings (xfail, not silenced)
------------------------------
KNOWN_DIVERGENCES registry records 22 cross-impl findings:

- Trailing-text drop after wrapper end (vLLM and SGLang on three
  XML-style families: kimi_k2, qwen3_coder, glm47).
- SGLang GptOssDetector requires the strict
  <|start|>assistant<|channel|>commentary envelope; rejects bare
  commentary/analysis variants Dynamo accepts.
- SGLang drops the analysis-channel preamble on harmony PARSER.batch.7.
- Dynamo drops back-to-back commentary blocks on harmony
  PARSER.batch.{2,10}; SGLang extracts both — Dynamo bug class.
- Per-impl whitespace handling around wrapper boundaries.
- PARSER.batch.4 (malformed JSON) and PARSER.batch.5 (missing
  end-token) recovery contracts are impl-defined per
  PARSER_CASES.md; divergences documented rather than
  asserted-against.

Coverage
--------
111 pass / 90 cross-container-skipped / 9 xfailed in vLLM container;
34 pass / 160 skipped / 16 xfailed in SGLang container.

Signed-off-by: Keiven Chang <[email protected]>
Add a "Eventual goal" section to tests/parity/README.md describing
the intended end state: one shared set of JSON fixtures consumed by
both a Python harness (this PR + M3) and a future thin Rust
harness, replacing the ~70 black-box Rust unit tests that currently
duplicate inputs already encoded as M2 fixtures.

Captures:
- The current overlap (M2 fixtures were extracted from Rust tests
  by hand; ~70 cases now live in both languages).
- What stays Rust-only (white-box helper tests, tokenizer / panic
  cases).
- A rough PR-stack sketch for the migration (Rust harness, then
  mechanical Rust-test deletion, then expanded coverage for
  format / regression cases). Streaming follows separately.

Subsequent PRs (not in this one's scope) would land the Rust
harness and prune the duplicates. M2's real value-add — the
cross-impl half (vLLM + SGLang) — is unaffected.

No code changes.

Signed-off-by: Keiven Chang <[email protected]>
@keivenchang keivenchang marked this pull request as ready for review May 6, 2026 04:13
@keivenchang keivenchang marked this pull request as draft May 6, 2026 04:13
Replace the "Two methods" table with a fuller section covering all
three methods (M1 fallback-path, M2 parser-class, M3 e2e HTTP).
Each gets a short block describing what runs, what's substituted,
and what bug class it surfaces, plus a comparison table mapping
each method to which slice of the request lifecycle it actually
exercises.

M1 is described as future / not in this PR. The harness here
implements M2 only; M3 lands in the sibling PR. M1 belongs to a
separate harness because it tests Dynamo's wrapping of upstream
parsers, not parity between impls.

No code changes.

Signed-off-by: Keiven Chang <[email protected]>
Update README so the framing relative to this PR is clearer. M2 is
this PR's harness; M1 (Dynamo-frontend fallback-path test) and M3
(e2e HTTP test, sibling PR #9189) are presented as upcoming, next
steps rather than vague "future" / "(sibling PR ...)" mentions.

No code changes.

Signed-off-by: Keiven Chang <[email protected]>
@keivenchang keivenchang force-pushed the keivenchang/DIS-1906__method-2-cross-impl-parser-parity-harness branch from 6bd9471 to 55eab8e Compare May 6, 2026 04:23
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Convert the 7 PARSER.batch fixture files from JSON to YAML so the
multi-line `model_text` field reads as the actual wire format
instead of a `\n`-escaped one-liner. Same data, more readable
during PR review:

- XML-style families (qwen3_coder, kimi_k2, glm47, harmony) get
  proper line breaks via YAML literal block scalars (`|-`).
- DeepSeek embedded-JSON args lose the escaped-quote noise
  (`{\"location\":\"NYC\"}` -> `{"location":"NYC"}`).
- Special tokens (`|` U+FF5C, `▁` U+2581) still round-trip
  literally via `allow_unicode=True`.

Mechanical changes:
- `regenerate_fixtures.py`: dump via PyYAML with a custom string
  presenter that picks block-scalar style for multi-line strings.
  Test pass: 111 passed / 90 skipped / 9 xfailed (unchanged from
  the JSON baseline). `--overwrite-if-exists` round-trip is
  byte-stable.
- `test_parity_parser.py`: load via `yaml.safe_load`; glob shifts
  from `*.json` to `*.yaml`.
- README: schema example reframed in YAML; mentions JSON only
  where it refers to wire-format JSON (tool-call args), not the
  fixture format.
- Fix a pre-existing bug in `regenerate_fixtures.py` where
  `FIXTURES_ROOT = Path(__file__).parent` wrote outputs one level
  above the canonical `fixtures/` tree. Now writes back to
  `fixtures/`.

PyYAML is already an ambient runtime dep; no new top-level
requirement. Eventual Rust harness can use `serde_yaml` (already
present transitively via `kube-client`).

Signed-off-by: Keiven Chang <[email protected]>
Switch on-disk YAML case keys from bare ints (`1`, `2`, ...) to full
PARSER_CASES.md IDs (`PARSER.batch.1`, `PARSER.batch.2`, ...). One
grep now finds a case across docs, fixtures, KNOWN_DIVERGENCES,
pytest IDs, and Rust source comments.

- `regenerate_fixtures.py`: writes full IDs at write time, strips
  prefix at read time (internal bookkeeping stays keyed by case
  number).
- `test_parity_parser.py`: drops the `f"PARSER.{mode}.{n}"`
  reconstruction since the on-disk key is the case ID directly.
- 7 fixture files: rekeyed in place.
- README: schema example updated; case-keys paragraph rewritten
  to point at PARSER_CASES.md as the canonical ID source.

Test pass: 111 passed / 90 skipped / 9 xfailed (unchanged).

Signed-off-by: Keiven Chang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation feat size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant