feat: add diagram-type-aware label rendering across strategies by Colinho22 · Pull Request #71 · Colinho22/maestro

Colinho22 · 2026-06-20T10:22:47Z

Summary

Reviewing real strategy outputs across the corpus surfaced several places where
a strategy scored badly not because of model capability, but because the prompt
contract or the metric did not match what the input actually provides. This PR
closes those gaps so the scored run measures orchestration quality rather than
unpredictable authoring conventions.

This is an RC (v1.0.0-rc.1) change that alters scoring behaviour, so it is a
version bump: the pre-change measurement rows must be cleared before the scored
run (nothing here is in a published dataset yet). The prompt-contract and metric
changes are deliberately bundled: both are the same "make the task well-posed"
pass and were validated together in an 80-cell pre-flight covering every
diagram type and tier.

What the investigation found

The headline early misread, now corrected: the multi-step strategies were not
hallucinating nodes. false_entities means "id matches truth but label
differs", not "invented node". The real issues were label fidelity and two
ground-truth conventions the input does not supply.

Changes

Notation-aware labels (prompts.py, _extraction.py, all strategies)

The label rule branches by notation: C4 container and network topology get the
name\n[Type]\ntech multi-line label; BPMN process and collaboration keep
bare names. This fixes C4 (was broken) without regressing BPMN.
diagram_type is read from input metadata and passed to every strategy's
render step as task context (the baseline already saw it implicitly; this
makes it explicit and uniform), so the convention is applied consistently
rather than guessed. The same context reaches single-agent, SOP, CrewAI, and
LangGraph identically, so it is a controlled variable.

IR fidelity and subgraph labels (_extraction.py)

Step-1 IR carries a short tech label (not the long description) and extracts
every referenced parent group as a named entity, so container and boundary
subgraphs render with their labels instead of empty brackets. The hierarchy
rule now covers C4 boundaries and deployment environments and requires a
quoted subgraph label.
Step 1 no longer turns a metadata or summary field into a phantom entity.

Step-3 structural validation (_extraction.py)

Rejects empty-label brackets, concatenated node defs, and unbalanced
subgraph/end (a dropped end is invalid Mermaid that mmdc rejects). These
consume the retry budget instead of scoring as a parse failure. end is
matched only as a whole-line closer, and the concatenation scan blanks quoted
labels first so brackets inside a label are not a false positive. Verified
against all 30 ground truths with zero false positives.

Metric scores only the input-derivable label (metrics.py, run.py)

The optional third descriptor line of a C4 / network label is authored
inconsistently (network topology includes it for some nodes and omits it for
others), so the entity-name metric compares on the name and [Type] lines
only.
Nodes the input leaves unnamed (BPMN gateways/events with name "") are
labelled in the ground truth from convention the input never supplies (type
name, unicode symbols, split/join). These are scored by id match only.
Conditional on the input: a node the input did name is still scored on its
label, so blanking a nameable node is still penalised.
evaluate_run gains an optional input_path; backward compatible (omitted =
strict label scoring) and fails soft on a missing or malformed input.

CrewAI confound removed (crew.py)

_collapse_messages strips CrewAI's injected "expected output" scaffolding so
its delivered prompt is byte-identical to SOP, removing a prompt-content
difference that confounded the orchestration comparison.

CrewAI trace prompt (run.py)

CREWAI_TESTING=true short-circuits CrewAI's first-execution "view traces?"
prompt, which blocked on stdin for a 20s timeout per crew. In a headless
container the declined preference is wiped each --rm, so it re-fired on
every cell: hours of dead waiting on a full run, and it corrupted
duration_ms.

Pre-flight results

80 cells (one input per diagram_type x tier, 4 strategies, 2 Mistral models).
After the changes, entity_name_f1 on the previously-failing inputs:

input	type	before	after
it_3_26	network topology, tier 3	0.284	0.996
it_1_10	network topology, tier 1	0.375	0.875
it_2_17	network topology, tier 2	0.769	1.000
it_3_28	c4 container, tier 3	0.569	0.943

BPMN and C4 tier-1 stayed at ~1.0 (no regression). Remaining lower scores (e.g.
a weak model on a tier-3 BPMN) are genuine model failures, correctly recorded.

Testing

New tests/analysis/test_label_scoring.py: descriptor stripping, the
input-unnamed-node exemption, and the guardrail that a named node is still
scored on its label.
tests/strategies/test_step_validation.py: empty-bracket, concatenated-node,
and unbalanced-subgraph cases, plus extract_diagram_type normalisation
(null / number / blank / whitespace).
tests/test_prompts.py: snapshot updated for the new label and hierarchy
rules; diagram_type placeholder assertions.
Full suite: 257 passed. ruff check and ruff format --check clean.

Note for reviewers

The label convention is genuinely input-dependent (C4 vs BPMN notation) and two
ground-truth conventions are not derivable from the input, so this PR defines the
scored surface: the name + type label core, with the descriptor line and
labels for input-unnamed nodes explicitly out of contract. This is the
SQuAD / ToTTo-style "score what the input determines" principle, and it is
documented here so the thesis metric section can cite it.

Summary by CodeRabbit

New Features
- Added diagram-type context end-to-end to improve diagram-specific prompt and rendering behavior.
- Improved scoring so entities left unnamed in the input can still be evaluated via identifier-based matching.
Bug Fixes
- Strengthened Mermaid validation to reject common malformed outputs (empty labels, concatenated nodes, and unbalanced subgraphs).
- Non-interactive runs now behave more predictably by disabling the “view your traces?” prompt behavior.
Documentation
- Updated Mermaid generation rules with expanded node-label formatting guidance and broader hierarchy preservation instructions.
Tests
- Expanded Step 3 and label-scoring coverage, including malformed/edge-case handling.

CREWAI_TRACING_ENABLED=false governs only the enabled path, not the first-execution 'view traces? [y/N]' prompt, which blocks on stdin for a 20s timeout per crew. The file-based declined preference is wiped on every docker --rm, so it re-fires on every cell: ~1500 crew_ai cells means hours of dead waiting and corrupted duration_ms. CREWAI_TESTING=true short circuits the prompt before any stdin read.

…ep-3 validation Empirical review of strategy outputs found the multi-step strategies scored entity_name_f1 near 0 on C4 inputs: their JSON IR flattened the name/[Type]/tech label structure the ground truth uses, and the shared rules never taught that format. Fixes, verified to lift name_f1 to ~1.0 on both C4 and BPMN across all four strategies: - Label rule is now notation-aware: C4 and network-topology diagrams get the name/[Type]/tech multi-line label; BPMN process and collaboration diagrams keep bare names. diagram_type is read from the input metadata and passed to every strategy's render step as task context (the baseline already saw it; this makes it uniform and explicit), so the convention is applied consistently rather than guessed. - Step-1 IR carries a short tech label (not the long description) and extracts every referenced parent group as a named entity, so container and boundary subgraphs render with their labels instead of empty brackets. The hierarchy rule now covers C4 boundaries and deployment environments and requires a quoted subgraph label. - Step-3 gains a light structural check that rejects empty-label brackets and concatenated node defs, so a malformed diagram consumes the retry budget instead of landing as a scored parse failure. - Strip CrewAI's expected-output scaffolding so its delivered prompt is byte-identical to SOP, removing a confound between the two strategies.

coderabbitai · 2026-06-20T10:22:58Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a8c3c9be-125e-4f42-b8a9-b4b7a6537098

📥 Commits

Reviewing files that changed from the base of the PR and between 53780cd and cbdbe7d.

📒 Files selected for processing (1)

src/maestro/analysis/metrics.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/maestro/analysis/metrics.py

📝 Walkthrough

Walkthrough

The PR threads a diagram_type value extracted from input metadata through all four strategy implementations (single, sop, crew, langgraph) into prompt construction. It expands the MERMAID_RULES contract with node-label and hierarchy rules, adds regex-based structural validation for step-3 Mermaid outputs, extends entity scoring to recognize unnamed input nodes and exempt them from label-mismatch penalties, strips CrewAI's auto-appended expected-output scaffolding from composed prompts, and sets CREWAI_TESTING=true to suppress interactive trace prompts.

Changes

diagram_type propagation, Mermaid validation, input-aware metrics, and prompt updates

Layer / File(s)	Summary
Prompt contracts, entity schema, and `extract_diagram_type` helper `src/maestro/prompts.py`, `src/maestro/strategies/_extraction.py`	`MERMAID_RULES` gains node-label construction rules (architecture/infrastructure multi-line formatting, process diagram simplification) and broadened hierarchy guidance for system boundaries and deployment environments. Step-1 extraction prompt adds optional `tech` field to entity objects and expands parent-group inclusion. Step-3 template gains `{diagram_type}` placeholder. New `extract_diagram_type(raw_input)` parses `metadata.diagram_type` from raw JSON with `"unspecified"` fallback on parse/type errors. `re` module imported.
Step-3 Mermaid structural smoke-check `src/maestro/strategies/_extraction.py`	`validate_step_output` routes step-3 outputs through `_validate_mermaid_shape` instead of generic non-empty check. Validator uses compiled regexes (`_EMPTY_LABEL`, `_CONCATENATED_NODES`) to reject empty-bracket node labels and concatenated definitions without separators. Blanks quoted spans before concatenation scan. Counts `subgraph`/`end` nesting and rejects unbalanced diagrams.
`diagram_type` threading in SingleAgentStrategy `src/maestro/strategies/single.py`	`PROMPT_TEMPLATE` extended with `{diagram_type}` placeholder. `run()` derives `diagram_type` via `extract_diagram_type(raw)` and formats template with both `diagram_type` and `input_data`.
`diagram_type` threading in SOPStrategy `src/maestro/strategies/sop.py`	Imports `extract_diagram_type`. `run()` computes `diagram_type` from raw input. `_build_prompt` signature extended to accept `diagram_type`. Prompt formatting variables include `diagram_type` alongside `input_data`.
`diagram_type` threading in LangGraphStrategy `src/maestro/strategies/langgraph.py`	Imports `extract_diagram_type`. `GraphState` extended with `diagram_type: str` channel. `run()` computes and passes `diagram_type` to compiled graph initial state. `generate_mermaid_node` formats `STEP_3_PROMPT` with `diagram_type` from state.
`diagram_type` threading in CrewAIStrategy `src/maestro/strategies/crew.py`	Imports `extract_diagram_type`. `run()` computes `diagram_type` from raw input and passes into `_build_prompt`. `_build_prompt` signature extended to accept `diagram_type`; template variables now include `diagram_type`.
CrewAI prompt hygiene and trace suppression `src/maestro/strategies/crew.py`, `src/maestro/run.py`	`_collapse_messages` rebuilds provider prompt from user-role content, locates `_CREW_EXPECTED_OUTPUT_MARKER` via `rfind`, truncates at marker, and returns trimmed prompt—removing CrewAI's expected-output scaffolding for byte-identical SOP input. `run.py` adds `CREWAI_TESTING=true` via `os.environ.setdefault` to suppress interactive trace prompt in batch runs.
Input-aware entity scoring for unnamed input nodes `src/maestro/analysis/metrics.py`	New `extract_input_unnamed_ids(input_path)` parses input JSON for node IDs with empty `name` field. `_label_core` normalizes labels by dropping optional third descriptor line. `_fuzzy_match` accepts `input_unnamed_ids` and treats ID matches as correct name-matches for unnamed nodes (exempting label mismatch). Exported `compute_entity_metrics_fuzzy`, `compute_entity_metrics_lemma`, `compute_entity_taxonomy` accept `input_unnamed_ids`. `evaluate_run` accepts optional `input_path`, extracts `unnamed_ids`, and passes into scoring/taxonomy.
Tests: validation, snapshots, and label-scoring `tests/strategies/test_step_validation.py`, `tests/test_prompts.py`, `tests/analysis/test_label_scoring.py`	`test_step_validation` adds `extract_diagram_type` normalization test, well-formed Mermaid acceptance, and negative cases for empty-bracket labels and concatenated nodes. `test_prompts` extends `MERMAID_RULES` snapshot and asserts `{diagram_type}` in both templates. New `test_label_scoring.py` validates `_label_core` behavior, descriptor-line invariance, unnamed-node id-matching, exemption guardrail, and `extract_input_unnamed_ids` robustness.

Sequence Diagram

sequenceDiagram
  participant InputFile as Input JSON
  participant Strategy as Strategy.run()
  participant extract_type as extract_diagram_type()
  participant BuildPrompt as _build_prompt()
  participant LLM as LLM Provider
  participant Metrics as evaluate_run()
  participant InputAware as Metrics with input_path

  InputFile->>Strategy: raw JSON string
  Strategy->>extract_type: raw input
  extract_type-->>Strategy: diagram_type or "unspecified"
  Strategy->>BuildPrompt: input_data, diagram_type, step_outputs
  BuildPrompt-->>Strategy: formatted prompt with diagram_type
  Strategy->>LLM: complete prompt
  LLM-->>Strategy: Mermaid output
  Strategy->>Metrics: run_id, output, ground_truth, input_path
  Metrics->>InputAware: extract unnamed_ids from input_path
  InputAware-->>Metrics: unnamed node set
  Metrics-->>Metrics: score entities with input-aware exemption

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Possibly related PRs

Colinho22/maestro#61: Directly modifies the same MERMAID_RULES contract in src/maestro/prompts.py and corresponding tests/test_prompts.py snapshot coverage that this PR expands.
Colinho22/maestro#13: Introduces the shared src/maestro/strategies/_extraction.py extraction/validation flow that this PR extends with extract_diagram_type and Mermaid structural validation for step 3.
Colinho22/maestro#1: Modifies the core evaluation/strategy pipeline including src/maestro/analysis/metrics.py's evaluate_run() signature and src/maestro/run.py orchestration that this PR builds upon.

Suggested labels

enhancement

Poem

🐇 Hoppity-hop through the diagram land,
Where diagram_type flows from each node to hand!
Empty brackets? Rejected with flair,
Concatenated nodes? Caught right there.
Unnamed entities? Scored by their ID—
A rabbit's clean prompts make diagrams pristine! 🎨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding diagram-type awareness to label rendering across all strategies (single-agent, SOP, CrewAI, LangGraph).
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Docstring Coverage (Src Only)	✅ Passed	All 36 public entities (top-level functions/classes and public methods) in the 8 changed src/ files have docstrings, achieving 100% coverage well above the 80% threshold.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-prompt-strategies

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/maestro/strategies/_extraction.py`:
- Around line 166-168: The return statement in the extraction method does not
normalize the diagram_type value before returning it, allowing None or
non-string values to be returned when the key exists but is malformed. Instead
of relying solely on the get() default parameter, explicitly validate and
normalize the diagram_type value after extraction. Ensure that if the extracted
value is None, empty, or not a string, the method returns the "unspecified"
fallback string. This should be done before the return statement to guarantee
callers always receive a valid string value.
- Around line 253-256: The _CONCATENATED_NODES regex pattern is matching inside
quoted label text, causing false positives for valid Mermaid nodes like
a["Service [v1] Gateway [public]"]. Modify the concatenated node detection logic
in the loop where splitlines() is used to only check for the pattern outside of
quoted strings. You can either strip out quoted portions before searching, or
skip the _CONCATENATED_NODES check entirely for lines that contain quoted labels
(lines with square brackets inside double quotes).

In `@src/maestro/strategies/crew.py`:
- Around line 188-190: The code uses find() to locate
cls._CREW_EXPECTED_OUTPUT_MARKER, which returns the first occurrence. This
causes incorrect truncation if the marker text appears in the task description
itself. Change find() to rfind() to locate the last occurrence of the marker
instead, since the CrewAI scaffold containing the marker is appended after the
task description, not embedded within it.

In `@src/maestro/strategies/single.py`:
- Around line 81-84: Replace the inline diagram_type extraction logic in the
PROMPT_TEMPLATE.format call with a call to the shared extract_diagram_type
function. Instead of using input_data.get("metadata", {}).get("diagram_type",
"unspecified"), pass the raw input_data parameter to extract_diagram_type() to
get the diagram_type value. This will ensure the single-agent strategy handles
edge cases like null or non-dict metadata safely and maintains consistency with
the SOP, Crew, and LangGraph strategies by falling back to "unspecified" in
those cases.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bb063cbb-bcda-490a-bb60-552a1fc1001d

📥 Commits

Reviewing files that changed from the base of the PR and between e3ffba1 and be99dd9.

📒 Files selected for processing (9)

src/maestro/prompts.py
src/maestro/run.py
src/maestro/strategies/_extraction.py
src/maestro/strategies/crew.py
src/maestro/strategies/langgraph.py
src/maestro/strategies/single.py
src/maestro/strategies/sop.py
tests/strategies/test_step_validation.py
tests/test_prompts.py

- extract_diagram_type normalises present-but-malformed values (null, a number, blank) to "unspecified"; get()'s default only guarded a missing key. - step-3 concatenation check blanks quoted labels first, so brackets inside a label (e.g. a["Service [v1] Gateway"]) are no longer a false positive. - single-agent shares extract_diagram_type instead of duplicating the lookup, so the baseline gets the same edge-case handling as the multi-step strategies. - crew marker strip uses rfind so an appended scaffold is cut even if the marker phrase ever appears in the prompt body.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/strategies/test_step_validation.py (1)

18-29: ⚡ Quick win

Add a whitespace-only diagram_type case to lock the normalization contract.

extract_diagram_type treats blank and whitespace-only strings as "unspecified", but this test currently only checks "". Adding a " " case will prevent regressions on that branch.

Proposed test addition

 def test_extract_diagram_type_normalizes_malformed_values():
@@
     assert extract_diagram_type('{"metadata": {"diagram_type": ""}}') == "unspecified"
+    assert extract_diagram_type('{"metadata": {"diagram_type": "   "}}') == "unspecified"
     assert extract_diagram_type('{"metadata": null}') == "unspecified"
     assert extract_diagram_type("not json") == "unspecified"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/strategies/test_step_validation.py` around lines 18 - 29, The
test_extract_diagram_type_normalizes_malformed_values function tests various
malformed values but does not cover the whitespace-only string case. Add an
assertion after the empty string test case that calls extract_diagram_type with
a JSON object containing a diagram_type value of whitespace-only characters
(such as spaces) and verify it returns "unspecified", ensuring the normalization
logic properly handles this edge case.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/maestro/strategies/single.py`:
- Line 26: Add the required future import statement at the very beginning of the
file before all other imports, specifically add `from __future__ import
annotations` as the first line in the file, above the existing import from
maestro.strategies._extraction. This import is a required coding guideline that
must be present at the top of all Python files in this project.

---

Nitpick comments:
In `@tests/strategies/test_step_validation.py`:
- Around line 18-29: The test_extract_diagram_type_normalizes_malformed_values
function tests various malformed values but does not cover the whitespace-only
string case. Add an assertion after the empty string test case that calls
extract_diagram_type with a JSON object containing a diagram_type value of
whitespace-only characters (such as spaces) and verify it returns "unspecified",
ensuring the normalization logic properly handles this edge case.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3c5f9b06-a8db-42e9-b793-a365001c7250

📥 Commits

Reviewing files that changed from the base of the PR and between be99dd9 and 24cee1b.

📒 Files selected for processing (4)

src/maestro/strategies/_extraction.py
src/maestro/strategies/crew.py
src/maestro/strategies/single.py
tests/strategies/test_step_validation.py

🚧 Files skipped from review as they are similar to previous changes (2)

src/maestro/strategies/_extraction.py
src/maestro/strategies/crew.py

coderabbitai · 2026-06-20T12:36:59Z


 from maestro.prompts import render_rules
 from maestro.schemas import InputFile, RunConfig, RunResult, SubResult
+from maestro.strategies._extraction import extract_diagram_type


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Check if the file starts with the required future import head -n 5 src/maestro/strategies/single.py | grep -n "from __future__ import annotations"

Repository: Colinho22/maestro

Length of output: 43

Add from __future__ import annotations at the top of the file.

The coding guidelines require from __future__ import annotations at the top of all Python files. This import is currently missing from this file and must be added before any other imports.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/maestro/strategies/single.py` at line 26, Add the required future import statement at the very beginning of the file before all other imports, specifically add `from __future__ import annotations` as the first line in the file, above the existing import from maestro.strategies._extraction. This import is a required coding guideline that must be present at the top of all Python files in this project.

Source: Coding guidelines

A dropped `end` leaves a subgraph open: invalid Mermaid that mmdc rejects but the prior checks passed, so it scored as a parse failure instead of consuming the retry budget. Deeply nested diagrams (pools > lanes > subprocesses, network zones) are where a weaker model drops one. Count `subgraph` openers against standalone `end` closers and reject an imbalance; `end` is matched only as a whole-line closer so node ids like end_event_1 and labels like "End Event" never count. Verified against all 30 ground truths (zero false positives). Also pins extract_diagram_type on a whitespace-only diagram_type value.

Two ground-truth authoring conventions are not derivable from the input, so scoring them penalised models for an unreachable target: - The optional third descriptor line of a C4 / network label is authored inconsistently (network topology includes it for some nodes and omits it for others, though the input always carries the field). The entity-name metric now compares on the name and [Type] lines only. - Nodes the input leaves unnamed (BPMN gateways/events with name "") are labelled in the ground truth from convention the input never supplies (type name, unicode symbols, split/join). These are scored by id match only. This is conditional on the input: a node the input did name is still scored on its label, so blanking a nameable node is still penalised. evaluate_run takes an optional input_path to read the unnamed-id set; it is backward compatible (omitted means strict label scoring) and fails soft on a missing or malformed input. On the pre-flight, network name_f1 rose from ~0.28-0.38 to ~0.88-1.0 with no regression on BPMN or C4.

Colinho22 · 2026-06-20T13:41:13Z

@coderabbitai please review the new changes introduced with fccc5ac and 625bb81. One addresses your earlier comments, the other one tackles the issues found after running more of the corpus (see PR description for more).

coderabbitai · 2026-06-20T13:43:07Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

REPO="Colinho22/maestro"

echo "=== COMMIT fccc5ac ==="
gh api "repos/$REPO/commits/fccc5ac94ab8d9c6669c492ff04407e50c148f65" \
  --jq '.commit.message' 2>/dev/null || true
echo "---FILES CHANGED---"
gh api "repos/$REPO/commits/fccc5ac94ab8d9c6669c492ff04407e50c148f65" \
  --jq '.files[] | {filename: .filename, additions: .additions, deletions: .deletions, patch: .patch}' 2>/dev/null || true

echo ""
echo "=== COMMIT 625bb81 ==="
gh api "repos/$REPO/commits/625bb81603e1671575d76b65617a030482bad91e" \
  --jq '.commit.message' 2>/dev/null || true
echo "---FILES CHANGED---"
gh api "repos/$REPO/commits/625bb81603e1671575d76b65617a030482bad91e" \
  --jq '.files[] | {filename: .filename, additions: .additions, deletions: .deletions, patch: .patch}' 2>/dev/null || true

Length of output: 19959

🏁 Script executed:

#!/bin/bash
set -euo pipefail

REPO="Colinho22/maestro"

echo "=== Previous CodeRabbit comments on PR `#71` ==="
gh pr view 71 --repo "$REPO" --comments 2>/dev/null | head -200 || true

echo ""
echo "=== PR review comments ==="
gh api "repos/$REPO/pulls/71/comments" \
  --jq '.[] | select(.user.login == "coderabbitai") | {id: .id, body: .body, path: .path, line: .line}' 2>/dev/null | head -300 || true

Length of output: 13512

@Colinho22 Here's my review of the two commits:

`fccc5ac` — Metrics: score only input-derivable label parts

Overall: Well-designed and correctly implemented. The two-part fix (strip descriptor line, exempt unnamed nodes) is the right approach.

✅ What looks good

_label_core correctly splits on "\\n" (literal two-char Mermaid escape) and keeps only name + [Type], dropping the inconsistently-authored third line. Applied symmetrically to both output and truth before comparison.
extract_input_unnamed_ids fails soft (empty set on any parse/IO error) rather than crashing evaluation — good defensive design.
_fuzzy_match unnamed-node path is logically correct: id-in-unnamed AND id-in-truth → count as match; falls through to label matching if id not in truth. The matched_truth index-set is used consistently across both the new path and the existing label path.
evaluate_run signature is backward compatible (input_path: Path | None = None). Existing callers are unaffected.
Test coverage is thorough — _label_core unit test, descriptor-not-scored integration test, the exemption test, and the guardrail test confirming nameable-but-blank nodes are still penalized.

⚠️ Minor observation — anonymous subgraph opener (in `625bb81` context, but worth noting for `_label_core` too)

In _label_core, the detection of the [Type] line uses p.strip().startswith("["). A BPMN label like name\n[alias text with bracket would be matched as the type line, but this is contrived and doesn't exist in the corpus. No action needed.

`625bb81` — Strategies: catch unbalanced `subgraph`/`end`

Overall: Focused and correct. The counting approach is the right tool here.

✅ What looks good

stripped == "end" (exact equality after .strip()) correctly ignores end_event_1, "End Event", end_event_1[...] — the test test_step3_balance_ignores_end_in_ids_and_labels pins this precisely.
The whitespace-only diagram_type is now pinned with an assertion.
Error message includes the counts (f"unbalanced subgraph/end ({opens} subgraph, {closes} end)"), which is helpful for debugging retries.

⚠️ Minor gap — anonymous subgraph openers

startswith("subgraph ") (with trailing space) won't match an anonymous subgraph declaration that has no ID or label:

flowchart LR
subgraph
  a["A"]
end

Such a line would be counted as opens=0 while the end is counted as closes=1, producing a false "unbalanced" rejection. Anonymous subgraphs appear to be rare/absent in your corpus (the commit message notes "verified against all 30 ground truths"), but if a future diagram uses one it will be incorrectly rejected. A simple fix would be to also count stripped == "subgraph":

if stripped.startswith("subgraph ") or stripped == "subgraph":
    opens += 1

This is low priority given the corpus verification, but worth a follow-up.

Previous review comments — status

All four earlier comments appear to have been addressed:

✅ extract_diagram_type whitespace normalization — test pinned in 625bb81
✅ _CONCATENATED_NODES false positives in quoted labels — _QUOTED_SPAN.sub blanking in place
✅ crew.py find() → rfind() — per PR description
✅ single.py inline extraction → shared extract_diagram_type — per PR description

The anonymous-subgraph gap in 625bb81 is the only new item. Everything else looks solid.

A bare `subgraph` (no id) is a valid opener, but the balance check only matched `subgraph ` with a trailing space, so an anonymous subgraph would be counted as 0 opens against its 1 `end` and falsely rejected as unbalanced. Match the bare form too. None appear in the corpus, but a model emitting one during a run should not produce a spurious failed cell.

coderabbitai

🧹 Nitpick comments (1)

src/maestro/analysis/metrics.py (1)

110-133: 💤 Low value

Redundant Path() wrapper on line 123.

input_path is already typed as Path | None, so wrapping it in Path(input_path) is unnecessary. While it works (Path accepts another Path), it adds a pointless conversion.

Suggested simplification

     try:
-        data = json.loads(Path(input_path).read_text(encoding="utf-8"))
+        data = json.loads(input_path.read_text(encoding="utf-8"))
     except (OSError, json.JSONDecodeError):

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/maestro/analysis/metrics.py` around lines 110 - 133, In the
extract_input_unnamed_ids function, the try block wraps input_path in an
unnecessary Path() constructor when calling read_text(). Since input_path is
already typed as Path | None, the Path() wrapper is redundant and should be
removed. Simply call read_text() directly on input_path to eliminate the
pointless conversion.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/maestro/analysis/metrics.py`:
- Around line 110-133: In the extract_input_unnamed_ids function, the try block
wraps input_path in an unnecessary Path() constructor when calling read_text().
Since input_path is already typed as Path | None, the Path() wrapper is
redundant and should be removed. Simply call read_text() directly on input_path
to eliminate the pointless conversion.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 565c0cc5-751e-4ed5-9190-021cbdcf565d

📥 Commits

Reviewing files that changed from the base of the PR and between 24cee1b and 53780cd.

📒 Files selected for processing (5)

src/maestro/analysis/metrics.py
src/maestro/run.py
src/maestro/strategies/_extraction.py
tests/analysis/test_label_scoring.py
tests/strategies/test_step_validation.py

🚧 Files skipped from review as they are similar to previous changes (2)

src/maestro/run.py
src/maestro/strategies/_extraction.py

input_path is already Path after the None guard, so wrapping it in Path() before read_text() was a no-op.

Colinho22 added 2 commits June 19, 2026 23:17

Colinho22 added this to the 🧪 Experimental Artifact milestone Jun 20, 2026

Colinho22 self-assigned this Jun 20, 2026

Colinho22 added the bug Something isn't working label Jun 20, 2026

coderabbitai Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread src/maestro/strategies/_extraction.py Outdated

Comment thread src/maestro/strategies/_extraction.py

Comment thread src/maestro/strategies/crew.py Outdated

Comment thread src/maestro/strategies/single.py Outdated

Colinho22 changed the title ~~fix(prompts): give every strategy a fair shot at the label contract~~ feat: add diagram-type-aware label rendering across strategies Jun 20, 2026

coderabbitai Bot reviewed Jun 20, 2026

View reviewed changes

Colinho22 added 2 commits June 20, 2026 15:25

coderabbitai Bot reviewed Jun 20, 2026

View reviewed changes

fix(metrics): drop redundant Path() wrapper in extract_input_unnamed_ids

cbdbe7d

input_path is already Path after the None guard, so wrapping it in Path() before read_text() was a no-op.

Colinho22 merged commit 557ff65 into main Jun 20, 2026
2 checks passed

Colinho22 deleted the fix-prompt-strategies branch June 20, 2026 19:16

Conversation

Colinho22 commented Jun 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What the investigation found

Changes

Pre-flight results

Testing

Note for reviewers

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Colinho22 commented Jun 20, 2026

Uh oh!

coderabbitai Bot commented Jun 20, 2026

fccc5ac — Metrics: score only input-derivable label parts

✅ What looks good

⚠️ Minor observation — anonymous subgraph opener (in 625bb81 context, but worth noting for _label_core too)

625bb81 — Strategies: catch unbalanced subgraph/end

✅ What looks good

⚠️ Minor gap — anonymous subgraph openers

Previous review comments — status

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Colinho22 commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading

`fccc5ac` — Metrics: score only input-derivable label parts

⚠️ Minor observation — anonymous subgraph opener (in `625bb81` context, but worth noting for `_label_core` too)

`625bb81` — Strategies: catch unbalanced `subgraph`/`end`