feat: add diagram-type-aware label rendering across strategies#71
Conversation
CREWAI_TRACING_ENABLED=false governs only the enabled path, not the first-execution 'view traces? [y/N]' prompt, which blocks on stdin for a 20s timeout per crew. The file-based declined preference is wiped on every docker --rm, so it re-fires on every cell: ~1500 crew_ai cells means hours of dead waiting and corrupted duration_ms. CREWAI_TESTING=true short circuits the prompt before any stdin read.
…ep-3 validation Empirical review of strategy outputs found the multi-step strategies scored entity_name_f1 near 0 on C4 inputs: their JSON IR flattened the name/[Type]/tech label structure the ground truth uses, and the shared rules never taught that format. Fixes, verified to lift name_f1 to ~1.0 on both C4 and BPMN across all four strategies: - Label rule is now notation-aware: C4 and network-topology diagrams get the name/[Type]/tech multi-line label; BPMN process and collaboration diagrams keep bare names. diagram_type is read from the input metadata and passed to every strategy's render step as task context (the baseline already saw it; this makes it uniform and explicit), so the convention is applied consistently rather than guessed. - Step-1 IR carries a short tech label (not the long description) and extracts every referenced parent group as a named entity, so container and boundary subgraphs render with their labels instead of empty brackets. The hierarchy rule now covers C4 boundaries and deployment environments and requires a quoted subgraph label. - Step-3 gains a light structural check that rejects empty-label brackets and concatenated node defs, so a malformed diagram consumes the retry budget instead of landing as a scored parse failure. - Strip CrewAI's expected-output scaffolding so its delivered prompt is byte-identical to SOP, removing a confound between the two strategies.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe PR threads a Changesdiagram_type propagation, Mermaid validation, input-aware metrics, and prompt updates
Sequence DiagramsequenceDiagram
participant InputFile as Input JSON
participant Strategy as Strategy.run()
participant extract_type as extract_diagram_type()
participant BuildPrompt as _build_prompt()
participant LLM as LLM Provider
participant Metrics as evaluate_run()
participant InputAware as Metrics with input_path
InputFile->>Strategy: raw JSON string
Strategy->>extract_type: raw input
extract_type-->>Strategy: diagram_type or "unspecified"
Strategy->>BuildPrompt: input_data, diagram_type, step_outputs
BuildPrompt-->>Strategy: formatted prompt with diagram_type
Strategy->>LLM: complete prompt
LLM-->>Strategy: Mermaid output
Strategy->>Metrics: run_id, output, ground_truth, input_path
Metrics->>InputAware: extract unnamed_ids from input_path
InputAware-->>Metrics: unnamed node set
Metrics-->>Metrics: score entities with input-aware exemption
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/maestro/strategies/_extraction.py`:
- Around line 166-168: The return statement in the extraction method does not
normalize the diagram_type value before returning it, allowing None or
non-string values to be returned when the key exists but is malformed. Instead
of relying solely on the get() default parameter, explicitly validate and
normalize the diagram_type value after extraction. Ensure that if the extracted
value is None, empty, or not a string, the method returns the "unspecified"
fallback string. This should be done before the return statement to guarantee
callers always receive a valid string value.
- Around line 253-256: The _CONCATENATED_NODES regex pattern is matching inside
quoted label text, causing false positives for valid Mermaid nodes like
a["Service [v1] Gateway [public]"]. Modify the concatenated node detection logic
in the loop where splitlines() is used to only check for the pattern outside of
quoted strings. You can either strip out quoted portions before searching, or
skip the _CONCATENATED_NODES check entirely for lines that contain quoted labels
(lines with square brackets inside double quotes).
In `@src/maestro/strategies/crew.py`:
- Around line 188-190: The code uses find() to locate
cls._CREW_EXPECTED_OUTPUT_MARKER, which returns the first occurrence. This
causes incorrect truncation if the marker text appears in the task description
itself. Change find() to rfind() to locate the last occurrence of the marker
instead, since the CrewAI scaffold containing the marker is appended after the
task description, not embedded within it.
In `@src/maestro/strategies/single.py`:
- Around line 81-84: Replace the inline diagram_type extraction logic in the
PROMPT_TEMPLATE.format call with a call to the shared extract_diagram_type
function. Instead of using input_data.get("metadata", {}).get("diagram_type",
"unspecified"), pass the raw input_data parameter to extract_diagram_type() to
get the diagram_type value. This will ensure the single-agent strategy handles
edge cases like null or non-dict metadata safely and maintains consistency with
the SOP, Crew, and LangGraph strategies by falling back to "unspecified" in
those cases.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bb063cbb-bcda-490a-bb60-552a1fc1001d
📒 Files selected for processing (9)
src/maestro/prompts.pysrc/maestro/run.pysrc/maestro/strategies/_extraction.pysrc/maestro/strategies/crew.pysrc/maestro/strategies/langgraph.pysrc/maestro/strategies/single.pysrc/maestro/strategies/sop.pytests/strategies/test_step_validation.pytests/test_prompts.py
- extract_diagram_type normalises present-but-malformed values (null, a number, blank) to "unspecified"; get()'s default only guarded a missing key. - step-3 concatenation check blanks quoted labels first, so brackets inside a label (e.g. a["Service [v1] Gateway"]) are no longer a false positive. - single-agent shares extract_diagram_type instead of duplicating the lookup, so the baseline gets the same edge-case handling as the multi-step strategies. - crew marker strip uses rfind so an appended scaffold is cut even if the marker phrase ever appears in the prompt body.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/strategies/test_step_validation.py (1)
18-29: ⚡ Quick winAdd a whitespace-only
diagram_typecase to lock the normalization contract.
extract_diagram_typetreats blank and whitespace-only strings as"unspecified", but this test currently only checks"". Adding a" "case will prevent regressions on that branch.Proposed test addition
def test_extract_diagram_type_normalizes_malformed_values(): @@ assert extract_diagram_type('{"metadata": {"diagram_type": ""}}') == "unspecified" + assert extract_diagram_type('{"metadata": {"diagram_type": " "}}') == "unspecified" assert extract_diagram_type('{"metadata": null}') == "unspecified" assert extract_diagram_type("not json") == "unspecified"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/strategies/test_step_validation.py` around lines 18 - 29, The test_extract_diagram_type_normalizes_malformed_values function tests various malformed values but does not cover the whitespace-only string case. Add an assertion after the empty string test case that calls extract_diagram_type with a JSON object containing a diagram_type value of whitespace-only characters (such as spaces) and verify it returns "unspecified", ensuring the normalization logic properly handles this edge case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/maestro/strategies/single.py`:
- Line 26: Add the required future import statement at the very beginning of the
file before all other imports, specifically add `from __future__ import
annotations` as the first line in the file, above the existing import from
maestro.strategies._extraction. This import is a required coding guideline that
must be present at the top of all Python files in this project.
---
Nitpick comments:
In `@tests/strategies/test_step_validation.py`:
- Around line 18-29: The test_extract_diagram_type_normalizes_malformed_values
function tests various malformed values but does not cover the whitespace-only
string case. Add an assertion after the empty string test case that calls
extract_diagram_type with a JSON object containing a diagram_type value of
whitespace-only characters (such as spaces) and verify it returns "unspecified",
ensuring the normalization logic properly handles this edge case.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3c5f9b06-a8db-42e9-b793-a365001c7250
📒 Files selected for processing (4)
src/maestro/strategies/_extraction.pysrc/maestro/strategies/crew.pysrc/maestro/strategies/single.pytests/strategies/test_step_validation.py
🚧 Files skipped from review as they are similar to previous changes (2)
- src/maestro/strategies/_extraction.py
- src/maestro/strategies/crew.py
|
|
||
| from maestro.prompts import render_rules | ||
| from maestro.schemas import InputFile, RunConfig, RunResult, SubResult | ||
| from maestro.strategies._extraction import extract_diagram_type |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check if the file starts with the required future import
head -n 5 src/maestro/strategies/single.py | grep -n "from __future__ import annotations"Repository: Colinho22/maestro
Length of output: 43
Add from __future__ import annotations at the top of the file.
The coding guidelines require from __future__ import annotations at the top of all Python files. This import is currently missing from this file and must be added before any other imports.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/maestro/strategies/single.py` at line 26, Add the required future import
statement at the very beginning of the file before all other imports,
specifically add `from __future__ import annotations` as the first line in the
file, above the existing import from maestro.strategies._extraction. This import
is a required coding guideline that must be present at the top of all Python
files in this project.
Source: Coding guidelines
A dropped `end` leaves a subgraph open: invalid Mermaid that mmdc rejects but the prior checks passed, so it scored as a parse failure instead of consuming the retry budget. Deeply nested diagrams (pools > lanes > subprocesses, network zones) are where a weaker model drops one. Count `subgraph` openers against standalone `end` closers and reject an imbalance; `end` is matched only as a whole-line closer so node ids like end_event_1 and labels like "End Event" never count. Verified against all 30 ground truths (zero false positives). Also pins extract_diagram_type on a whitespace-only diagram_type value.
Two ground-truth authoring conventions are not derivable from the input, so scoring them penalised models for an unreachable target: - The optional third descriptor line of a C4 / network label is authored inconsistently (network topology includes it for some nodes and omits it for others, though the input always carries the field). The entity-name metric now compares on the name and [Type] lines only. - Nodes the input leaves unnamed (BPMN gateways/events with name "") are labelled in the ground truth from convention the input never supplies (type name, unicode symbols, split/join). These are scored by id match only. This is conditional on the input: a node the input did name is still scored on its label, so blanking a nameable node is still penalised. evaluate_run takes an optional input_path to read the unnamed-id set; it is backward compatible (omitted means strict label scoring) and fails soft on a missing or malformed input. On the pre-flight, network name_f1 rose from ~0.28-0.38 to ~0.88-1.0 with no regression on BPMN or C4.
|
@coderabbitai please review the new changes introduced with fccc5ac and 625bb81. One addresses your earlier comments, the other one tackles the issues found after running more of the corpus (see PR description for more). |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
REPO="Colinho22/maestro"
echo "=== COMMIT fccc5ac ==="
gh api "repos/$REPO/commits/fccc5ac94ab8d9c6669c492ff04407e50c148f65" \
--jq '.commit.message' 2>/dev/null || true
echo "---FILES CHANGED---"
gh api "repos/$REPO/commits/fccc5ac94ab8d9c6669c492ff04407e50c148f65" \
--jq '.files[] | {filename: .filename, additions: .additions, deletions: .deletions, patch: .patch}' 2>/dev/null || true
echo ""
echo "=== COMMIT 625bb81 ==="
gh api "repos/$REPO/commits/625bb81603e1671575d76b65617a030482bad91e" \
--jq '.commit.message' 2>/dev/null || true
echo "---FILES CHANGED---"
gh api "repos/$REPO/commits/625bb81603e1671575d76b65617a030482bad91e" \
--jq '.files[] | {filename: .filename, additions: .additions, deletions: .deletions, patch: .patch}' 2>/dev/null || trueLength of output: 19959 🏁 Script executed: #!/bin/bash
set -euo pipefail
REPO="Colinho22/maestro"
echo "=== Previous CodeRabbit comments on PR `#71` ==="
gh pr view 71 --repo "$REPO" --comments 2>/dev/null | head -200 || true
echo ""
echo "=== PR review comments ==="
gh api "repos/$REPO/pulls/71/comments" \
--jq '.[] | select(.user.login == "coderabbitai") | {id: .id, body: .body, path: .path, line: .line}' 2>/dev/null | head -300 || trueLength of output: 13512
|
A bare `subgraph` (no id) is a valid opener, but the balance check only matched `subgraph ` with a trailing space, so an anonymous subgraph would be counted as 0 opens against its 1 `end` and falsely rejected as unbalanced. Match the bare form too. None appear in the corpus, but a model emitting one during a run should not produce a spurious failed cell.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/maestro/analysis/metrics.py (1)
110-133: 💤 Low valueRedundant
Path()wrapper on line 123.
input_pathis already typed asPath | None, so wrapping it inPath(input_path)is unnecessary. While it works (Path accepts another Path), it adds a pointless conversion.Suggested simplification
try: - data = json.loads(Path(input_path).read_text(encoding="utf-8")) + data = json.loads(input_path.read_text(encoding="utf-8")) except (OSError, json.JSONDecodeError):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/maestro/analysis/metrics.py` around lines 110 - 133, In the extract_input_unnamed_ids function, the try block wraps input_path in an unnecessary Path() constructor when calling read_text(). Since input_path is already typed as Path | None, the Path() wrapper is redundant and should be removed. Simply call read_text() directly on input_path to eliminate the pointless conversion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/maestro/analysis/metrics.py`:
- Around line 110-133: In the extract_input_unnamed_ids function, the try block
wraps input_path in an unnecessary Path() constructor when calling read_text().
Since input_path is already typed as Path | None, the Path() wrapper is
redundant and should be removed. Simply call read_text() directly on input_path
to eliminate the pointless conversion.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 565c0cc5-751e-4ed5-9190-021cbdcf565d
📒 Files selected for processing (5)
src/maestro/analysis/metrics.pysrc/maestro/run.pysrc/maestro/strategies/_extraction.pytests/analysis/test_label_scoring.pytests/strategies/test_step_validation.py
🚧 Files skipped from review as they are similar to previous changes (2)
- src/maestro/run.py
- src/maestro/strategies/_extraction.py
input_path is already Path after the None guard, so wrapping it in Path() before read_text() was a no-op.
Summary
Reviewing real strategy outputs across the corpus surfaced several places where
a strategy scored badly not because of model capability, but because the prompt
contract or the metric did not match what the input actually provides. This PR
closes those gaps so the scored run measures orchestration quality rather than
unpredictable authoring conventions.
This is an RC (
v1.0.0-rc.1) change that alters scoring behaviour, so it is aversion bump: the pre-change measurement rows must be cleared before the scored
run (nothing here is in a published dataset yet). The prompt-contract and metric
changes are deliberately bundled: both are the same "make the task well-posed"
pass and were validated together in an 80-cell pre-flight covering every
diagram type and tier.
What the investigation found
The headline early misread, now corrected: the multi-step strategies were not
hallucinating nodes.
false_entitiesmeans "id matches truth but labeldiffers", not "invented node". The real issues were label fidelity and two
ground-truth conventions the input does not supply.
Changes
Notation-aware labels (
prompts.py,_extraction.py, all strategies)name\n[Type]\ntechmulti-line label; BPMN process and collaboration keepbare names. This fixes C4 (was broken) without regressing BPMN.
diagram_typeis read from input metadata and passed to every strategy'srender step as task context (the baseline already saw it implicitly; this
makes it explicit and uniform), so the convention is applied consistently
rather than guessed. The same context reaches single-agent, SOP, CrewAI, and
LangGraph identically, so it is a controlled variable.
IR fidelity and subgraph labels (
_extraction.py)techlabel (not the long description) and extractsevery referenced parent group as a named entity, so container and boundary
subgraphs render with their labels instead of empty brackets. The hierarchy
rule now covers C4 boundaries and deployment environments and requires a
quoted subgraph label.
Step-3 structural validation (
_extraction.py)subgraph/end(a droppedendis invalid Mermaid that mmdc rejects). Theseconsume the retry budget instead of scoring as a parse failure.
endismatched only as a whole-line closer, and the concatenation scan blanks quoted
labels first so brackets inside a label are not a false positive. Verified
against all 30 ground truths with zero false positives.
Metric scores only the input-derivable label (
metrics.py,run.py)inconsistently (network topology includes it for some nodes and omits it for
others), so the entity-name metric compares on the name and
[Type]linesonly.
name "") arelabelled in the ground truth from convention the input never supplies (type
name, unicode symbols, split/join). These are scored by id match only.
Conditional on the input: a node the input did name is still scored on its
label, so blanking a nameable node is still penalised.
evaluate_rungains an optionalinput_path; backward compatible (omitted =strict label scoring) and fails soft on a missing or malformed input.
CrewAI confound removed (
crew.py)_collapse_messagesstrips CrewAI's injected "expected output" scaffolding soits delivered prompt is byte-identical to SOP, removing a prompt-content
difference that confounded the orchestration comparison.
CrewAI trace prompt (
run.py)CREWAI_TESTING=trueshort-circuits CrewAI's first-execution "view traces?"prompt, which blocked on stdin for a 20s timeout per crew. In a headless
container the declined preference is wiped each
--rm, so it re-fired onevery cell: hours of dead waiting on a full run, and it corrupted
duration_ms.Pre-flight results
80 cells (one input per diagram_type x tier, 4 strategies, 2 Mistral models).
After the changes,
entity_name_f1on the previously-failing inputs:BPMN and C4 tier-1 stayed at ~1.0 (no regression). Remaining lower scores (e.g.
a weak model on a tier-3 BPMN) are genuine model failures, correctly recorded.
Testing
tests/analysis/test_label_scoring.py: descriptor stripping, theinput-unnamed-node exemption, and the guardrail that a named node is still
scored on its label.
tests/strategies/test_step_validation.py: empty-bracket, concatenated-node,and unbalanced-subgraph cases, plus
extract_diagram_typenormalisation(null / number / blank / whitespace).
tests/test_prompts.py: snapshot updated for the new label and hierarchyrules;
diagram_typeplaceholder assertions.ruff checkandruff format --checkclean.Note for reviewers
The label convention is genuinely input-dependent (C4 vs BPMN notation) and two
ground-truth conventions are not derivable from the input, so this PR defines the
scored surface: the name + type label core, with the descriptor line and
labels for input-unnamed nodes explicitly out of contract. This is the
SQuAD / ToTTo-style "score what the input determines" principle, and it is
documented here so the thesis metric section can cite it.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests