Skip to content

feat: add diagram-type-aware label rendering across strategies#71

Merged
Colinho22 merged 7 commits into
mainfrom
fix-prompt-strategies
Jun 20, 2026
Merged

feat: add diagram-type-aware label rendering across strategies#71
Colinho22 merged 7 commits into
mainfrom
fix-prompt-strategies

Conversation

@Colinho22

@Colinho22 Colinho22 commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

Reviewing real strategy outputs across the corpus surfaced several places where
a strategy scored badly not because of model capability, but because the prompt
contract or the metric did not match what the input actually provides. This PR
closes those gaps so the scored run measures orchestration quality rather than
unpredictable authoring conventions.

This is an RC (v1.0.0-rc.1) change that alters scoring behaviour, so it is a
version bump: the pre-change measurement rows must be cleared before the scored
run (nothing here is in a published dataset yet). The prompt-contract and metric
changes are deliberately bundled: both are the same "make the task well-posed"
pass and were validated together in an 80-cell pre-flight covering every
diagram type and tier.

What the investigation found

The headline early misread, now corrected: the multi-step strategies were not
hallucinating nodes
. false_entities means "id matches truth but label
differs", not "invented node". The real issues were label fidelity and two
ground-truth conventions the input does not supply.

Changes

Notation-aware labels (prompts.py, _extraction.py, all strategies)

  • The label rule branches by notation: C4 container and network topology get the
    name\n[Type]\ntech multi-line label; BPMN process and collaboration keep
    bare names. This fixes C4 (was broken) without regressing BPMN.
  • diagram_type is read from input metadata and passed to every strategy's
    render step as task context (the baseline already saw it implicitly; this
    makes it explicit and uniform), so the convention is applied consistently
    rather than guessed. The same context reaches single-agent, SOP, CrewAI, and
    LangGraph identically, so it is a controlled variable.

IR fidelity and subgraph labels (_extraction.py)

  • Step-1 IR carries a short tech label (not the long description) and extracts
    every referenced parent group as a named entity, so container and boundary
    subgraphs render with their labels instead of empty brackets. The hierarchy
    rule now covers C4 boundaries and deployment environments and requires a
    quoted subgraph label.
  • Step 1 no longer turns a metadata or summary field into a phantom entity.

Step-3 structural validation (_extraction.py)

  • Rejects empty-label brackets, concatenated node defs, and unbalanced
    subgraph/end (a dropped end is invalid Mermaid that mmdc rejects). These
    consume the retry budget instead of scoring as a parse failure. end is
    matched only as a whole-line closer, and the concatenation scan blanks quoted
    labels first so brackets inside a label are not a false positive. Verified
    against all 30 ground truths with zero false positives.

Metric scores only the input-derivable label (metrics.py, run.py)

  • The optional third descriptor line of a C4 / network label is authored
    inconsistently (network topology includes it for some nodes and omits it for
    others), so the entity-name metric compares on the name and [Type] lines
    only.
  • Nodes the input leaves unnamed (BPMN gateways/events with name "") are
    labelled in the ground truth from convention the input never supplies (type
    name, unicode symbols, split/join). These are scored by id match only.
    Conditional on the input: a node the input did name is still scored on its
    label, so blanking a nameable node is still penalised.
  • evaluate_run gains an optional input_path; backward compatible (omitted =
    strict label scoring) and fails soft on a missing or malformed input.

CrewAI confound removed (crew.py)

  • _collapse_messages strips CrewAI's injected "expected output" scaffolding so
    its delivered prompt is byte-identical to SOP, removing a prompt-content
    difference that confounded the orchestration comparison.

CrewAI trace prompt (run.py)

  • CREWAI_TESTING=true short-circuits CrewAI's first-execution "view traces?"
    prompt, which blocked on stdin for a 20s timeout per crew. In a headless
    container the declined preference is wiped each --rm, so it re-fired on
    every cell: hours of dead waiting on a full run, and it corrupted
    duration_ms.

Pre-flight results

80 cells (one input per diagram_type x tier, 4 strategies, 2 Mistral models).
After the changes, entity_name_f1 on the previously-failing inputs:

input type before after
it_3_26 network topology, tier 3 0.284 0.996
it_1_10 network topology, tier 1 0.375 0.875
it_2_17 network topology, tier 2 0.769 1.000
it_3_28 c4 container, tier 3 0.569 0.943

BPMN and C4 tier-1 stayed at ~1.0 (no regression). Remaining lower scores (e.g.
a weak model on a tier-3 BPMN) are genuine model failures, correctly recorded.

Testing

  • New tests/analysis/test_label_scoring.py: descriptor stripping, the
    input-unnamed-node exemption, and the guardrail that a named node is still
    scored on its label.
  • tests/strategies/test_step_validation.py: empty-bracket, concatenated-node,
    and unbalanced-subgraph cases, plus extract_diagram_type normalisation
    (null / number / blank / whitespace).
  • tests/test_prompts.py: snapshot updated for the new label and hierarchy
    rules; diagram_type placeholder assertions.
  • Full suite: 257 passed. ruff check and ruff format --check clean.

Note for reviewers

The label convention is genuinely input-dependent (C4 vs BPMN notation) and two
ground-truth conventions are not derivable from the input, so this PR defines the
scored surface: the name + type label core, with the descriptor line and
labels for input-unnamed nodes explicitly out of contract. This is the
SQuAD / ToTTo-style "score what the input determines" principle, and it is
documented here so the thesis metric section can cite it.

Summary by CodeRabbit

  • New Features

    • Added diagram-type context end-to-end to improve diagram-specific prompt and rendering behavior.
    • Improved scoring so entities left unnamed in the input can still be evaluated via identifier-based matching.
  • Bug Fixes

    • Strengthened Mermaid validation to reject common malformed outputs (empty labels, concatenated nodes, and unbalanced subgraphs).
    • Non-interactive runs now behave more predictably by disabling the “view your traces?” prompt behavior.
  • Documentation

    • Updated Mermaid generation rules with expanded node-label formatting guidance and broader hierarchy preservation instructions.
  • Tests

    • Expanded Step 3 and label-scoring coverage, including malformed/edge-case handling.

CREWAI_TRACING_ENABLED=false governs only the enabled path, not the
first-execution 'view traces? [y/N]' prompt, which blocks on stdin for a
20s timeout per crew. The file-based declined preference is wiped on every
docker --rm, so it re-fires on every cell: ~1500 crew_ai cells means hours
of dead waiting and corrupted duration_ms. CREWAI_TESTING=true short
circuits the prompt before any stdin read.
…ep-3 validation

Empirical review of strategy outputs found the multi-step strategies
scored entity_name_f1 near 0 on C4 inputs: their JSON IR flattened the
name/[Type]/tech label structure the ground truth uses, and the shared
rules never taught that format. Fixes, verified to lift name_f1 to ~1.0
on both C4 and BPMN across all four strategies:

- Label rule is now notation-aware: C4 and network-topology diagrams get
  the name/[Type]/tech multi-line label; BPMN process and collaboration
  diagrams keep bare names. diagram_type is read from the input metadata
  and passed to every strategy's render step as task context (the baseline
  already saw it; this makes it uniform and explicit), so the convention
  is applied consistently rather than guessed.
- Step-1 IR carries a short tech label (not the long description) and
  extracts every referenced parent group as a named entity, so container
  and boundary subgraphs render with their labels instead of empty
  brackets. The hierarchy rule now covers C4 boundaries and deployment
  environments and requires a quoted subgraph label.
- Step-3 gains a light structural check that rejects empty-label brackets
  and concatenated node defs, so a malformed diagram consumes the retry
  budget instead of landing as a scored parse failure.
- Strip CrewAI's expected-output scaffolding so its delivered prompt is
  byte-identical to SOP, removing a confound between the two strategies.
@Colinho22 Colinho22 added this to the 🧪 Experimental Artifact milestone Jun 20, 2026
@Colinho22 Colinho22 self-assigned this Jun 20, 2026
@Colinho22 Colinho22 added the bug Something isn't working label Jun 20, 2026
@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a8c3c9be-125e-4f42-b8a9-b4b7a6537098

📥 Commits

Reviewing files that changed from the base of the PR and between 53780cd and cbdbe7d.

📒 Files selected for processing (1)
  • src/maestro/analysis/metrics.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/maestro/analysis/metrics.py

📝 Walkthrough

Walkthrough

The PR threads a diagram_type value extracted from input metadata through all four strategy implementations (single, sop, crew, langgraph) into prompt construction. It expands the MERMAID_RULES contract with node-label and hierarchy rules, adds regex-based structural validation for step-3 Mermaid outputs, extends entity scoring to recognize unnamed input nodes and exempt them from label-mismatch penalties, strips CrewAI's auto-appended expected-output scaffolding from composed prompts, and sets CREWAI_TESTING=true to suppress interactive trace prompts.

Changes

diagram_type propagation, Mermaid validation, input-aware metrics, and prompt updates

Layer / File(s) Summary
Prompt contracts, entity schema, and extract_diagram_type helper
src/maestro/prompts.py, src/maestro/strategies/_extraction.py
MERMAID_RULES gains node-label construction rules (architecture/infrastructure multi-line formatting, process diagram simplification) and broadened hierarchy guidance for system boundaries and deployment environments. Step-1 extraction prompt adds optional tech field to entity objects and expands parent-group inclusion. Step-3 template gains {diagram_type} placeholder. New extract_diagram_type(raw_input) parses metadata.diagram_type from raw JSON with "unspecified" fallback on parse/type errors. re module imported.
Step-3 Mermaid structural smoke-check
src/maestro/strategies/_extraction.py
validate_step_output routes step-3 outputs through _validate_mermaid_shape instead of generic non-empty check. Validator uses compiled regexes (_EMPTY_LABEL, _CONCATENATED_NODES) to reject empty-bracket node labels and concatenated definitions without separators. Blanks quoted spans before concatenation scan. Counts subgraph/end nesting and rejects unbalanced diagrams.
diagram_type threading in SingleAgentStrategy
src/maestro/strategies/single.py
PROMPT_TEMPLATE extended with {diagram_type} placeholder. run() derives diagram_type via extract_diagram_type(raw) and formats template with both diagram_type and input_data.
diagram_type threading in SOPStrategy
src/maestro/strategies/sop.py
Imports extract_diagram_type. run() computes diagram_type from raw input. _build_prompt signature extended to accept diagram_type. Prompt formatting variables include diagram_type alongside input_data.
diagram_type threading in LangGraphStrategy
src/maestro/strategies/langgraph.py
Imports extract_diagram_type. GraphState extended with diagram_type: str channel. run() computes and passes diagram_type to compiled graph initial state. generate_mermaid_node formats STEP_3_PROMPT with diagram_type from state.
diagram_type threading in CrewAIStrategy
src/maestro/strategies/crew.py
Imports extract_diagram_type. run() computes diagram_type from raw input and passes into _build_prompt. _build_prompt signature extended to accept diagram_type; template variables now include diagram_type.
CrewAI prompt hygiene and trace suppression
src/maestro/strategies/crew.py, src/maestro/run.py
_collapse_messages rebuilds provider prompt from user-role content, locates _CREW_EXPECTED_OUTPUT_MARKER via rfind, truncates at marker, and returns trimmed prompt—removing CrewAI's expected-output scaffolding for byte-identical SOP input. run.py adds CREWAI_TESTING=true via os.environ.setdefault to suppress interactive trace prompt in batch runs.
Input-aware entity scoring for unnamed input nodes
src/maestro/analysis/metrics.py
New extract_input_unnamed_ids(input_path) parses input JSON for node IDs with empty name field. _label_core normalizes labels by dropping optional third descriptor line. _fuzzy_match accepts input_unnamed_ids and treats ID matches as correct name-matches for unnamed nodes (exempting label mismatch). Exported compute_entity_metrics_fuzzy, compute_entity_metrics_lemma, compute_entity_taxonomy accept input_unnamed_ids. evaluate_run accepts optional input_path, extracts unnamed_ids, and passes into scoring/taxonomy.
Tests: validation, snapshots, and label-scoring
tests/strategies/test_step_validation.py, tests/test_prompts.py, tests/analysis/test_label_scoring.py
test_step_validation adds extract_diagram_type normalization test, well-formed Mermaid acceptance, and negative cases for empty-bracket labels and concatenated nodes. test_prompts extends MERMAID_RULES snapshot and asserts {diagram_type} in both templates. New test_label_scoring.py validates _label_core behavior, descriptor-line invariance, unnamed-node id-matching, exemption guardrail, and extract_input_unnamed_ids robustness.

Sequence Diagram

sequenceDiagram
  participant InputFile as Input JSON
  participant Strategy as Strategy.run()
  participant extract_type as extract_diagram_type()
  participant BuildPrompt as _build_prompt()
  participant LLM as LLM Provider
  participant Metrics as evaluate_run()
  participant InputAware as Metrics with input_path

  InputFile->>Strategy: raw JSON string
  Strategy->>extract_type: raw input
  extract_type-->>Strategy: diagram_type or "unspecified"
  Strategy->>BuildPrompt: input_data, diagram_type, step_outputs
  BuildPrompt-->>Strategy: formatted prompt with diagram_type
  Strategy->>LLM: complete prompt
  LLM-->>Strategy: Mermaid output
  Strategy->>Metrics: run_id, output, ground_truth, input_path
  Metrics->>InputAware: extract unnamed_ids from input_path
  InputAware-->>Metrics: unnamed node set
  Metrics-->>Metrics: score entities with input-aware exemption
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Possibly related PRs

  • Colinho22/maestro#61: Directly modifies the same MERMAID_RULES contract in src/maestro/prompts.py and corresponding tests/test_prompts.py snapshot coverage that this PR expands.
  • Colinho22/maestro#13: Introduces the shared src/maestro/strategies/_extraction.py extraction/validation flow that this PR extends with extract_diagram_type and Mermaid structural validation for step 3.
  • Colinho22/maestro#1: Modifies the core evaluation/strategy pipeline including src/maestro/analysis/metrics.py's evaluate_run() signature and src/maestro/run.py orchestration that this PR builds upon.

Suggested labels

enhancement

Poem

🐇 Hoppity-hop through the diagram land,
Where diagram_type flows from each node to hand!
Empty brackets? Rejected with flair,
Concatenated nodes? Caught right there.
Unnamed entities? Scored by their ID—
A rabbit's clean prompts make diagrams pristine! 🎨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding diagram-type awareness to label rendering across all strategies (single-agent, SOP, CrewAI, LangGraph).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Docstring Coverage (Src Only) ✅ Passed All 36 public entities (top-level functions/classes and public methods) in the 8 changed src/ files have docstrings, achieving 100% coverage well above the 80% threshold.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-prompt-strategies

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/maestro/strategies/_extraction.py`:
- Around line 166-168: The return statement in the extraction method does not
normalize the diagram_type value before returning it, allowing None or
non-string values to be returned when the key exists but is malformed. Instead
of relying solely on the get() default parameter, explicitly validate and
normalize the diagram_type value after extraction. Ensure that if the extracted
value is None, empty, or not a string, the method returns the "unspecified"
fallback string. This should be done before the return statement to guarantee
callers always receive a valid string value.
- Around line 253-256: The _CONCATENATED_NODES regex pattern is matching inside
quoted label text, causing false positives for valid Mermaid nodes like
a["Service [v1] Gateway [public]"]. Modify the concatenated node detection logic
in the loop where splitlines() is used to only check for the pattern outside of
quoted strings. You can either strip out quoted portions before searching, or
skip the _CONCATENATED_NODES check entirely for lines that contain quoted labels
(lines with square brackets inside double quotes).

In `@src/maestro/strategies/crew.py`:
- Around line 188-190: The code uses find() to locate
cls._CREW_EXPECTED_OUTPUT_MARKER, which returns the first occurrence. This
causes incorrect truncation if the marker text appears in the task description
itself. Change find() to rfind() to locate the last occurrence of the marker
instead, since the CrewAI scaffold containing the marker is appended after the
task description, not embedded within it.

In `@src/maestro/strategies/single.py`:
- Around line 81-84: Replace the inline diagram_type extraction logic in the
PROMPT_TEMPLATE.format call with a call to the shared extract_diagram_type
function. Instead of using input_data.get("metadata", {}).get("diagram_type",
"unspecified"), pass the raw input_data parameter to extract_diagram_type() to
get the diagram_type value. This will ensure the single-agent strategy handles
edge cases like null or non-dict metadata safely and maintains consistency with
the SOP, Crew, and LangGraph strategies by falling back to "unspecified" in
those cases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bb063cbb-bcda-490a-bb60-552a1fc1001d

📥 Commits

Reviewing files that changed from the base of the PR and between e3ffba1 and be99dd9.

📒 Files selected for processing (9)
  • src/maestro/prompts.py
  • src/maestro/run.py
  • src/maestro/strategies/_extraction.py
  • src/maestro/strategies/crew.py
  • src/maestro/strategies/langgraph.py
  • src/maestro/strategies/single.py
  • src/maestro/strategies/sop.py
  • tests/strategies/test_step_validation.py
  • tests/test_prompts.py

Comment thread src/maestro/strategies/_extraction.py Outdated
Comment thread src/maestro/strategies/_extraction.py
Comment thread src/maestro/strategies/crew.py Outdated
Comment thread src/maestro/strategies/single.py Outdated
@Colinho22 Colinho22 changed the title fix(prompts): give every strategy a fair shot at the label contract feat: add diagram-type-aware label rendering across strategies Jun 20, 2026
- extract_diagram_type normalises present-but-malformed values (null, a
  number, blank) to "unspecified"; get()'s default only guarded a missing key.
- step-3 concatenation check blanks quoted labels first, so brackets inside a
  label (e.g. a["Service [v1] Gateway"]) are no longer a false positive.
- single-agent shares extract_diagram_type instead of duplicating the lookup,
  so the baseline gets the same edge-case handling as the multi-step strategies.
- crew marker strip uses rfind so an appended scaffold is cut even if the
  marker phrase ever appears in the prompt body.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/strategies/test_step_validation.py (1)

18-29: ⚡ Quick win

Add a whitespace-only diagram_type case to lock the normalization contract.

extract_diagram_type treats blank and whitespace-only strings as "unspecified", but this test currently only checks "". Adding a " " case will prevent regressions on that branch.

Proposed test addition
 def test_extract_diagram_type_normalizes_malformed_values():
@@
     assert extract_diagram_type('{"metadata": {"diagram_type": ""}}') == "unspecified"
+    assert extract_diagram_type('{"metadata": {"diagram_type": "   "}}') == "unspecified"
     assert extract_diagram_type('{"metadata": null}') == "unspecified"
     assert extract_diagram_type("not json") == "unspecified"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/strategies/test_step_validation.py` around lines 18 - 29, The
test_extract_diagram_type_normalizes_malformed_values function tests various
malformed values but does not cover the whitespace-only string case. Add an
assertion after the empty string test case that calls extract_diagram_type with
a JSON object containing a diagram_type value of whitespace-only characters
(such as spaces) and verify it returns "unspecified", ensuring the normalization
logic properly handles this edge case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/maestro/strategies/single.py`:
- Line 26: Add the required future import statement at the very beginning of the
file before all other imports, specifically add `from __future__ import
annotations` as the first line in the file, above the existing import from
maestro.strategies._extraction. This import is a required coding guideline that
must be present at the top of all Python files in this project.

---

Nitpick comments:
In `@tests/strategies/test_step_validation.py`:
- Around line 18-29: The test_extract_diagram_type_normalizes_malformed_values
function tests various malformed values but does not cover the whitespace-only
string case. Add an assertion after the empty string test case that calls
extract_diagram_type with a JSON object containing a diagram_type value of
whitespace-only characters (such as spaces) and verify it returns "unspecified",
ensuring the normalization logic properly handles this edge case.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3c5f9b06-a8db-42e9-b793-a365001c7250

📥 Commits

Reviewing files that changed from the base of the PR and between be99dd9 and 24cee1b.

📒 Files selected for processing (4)
  • src/maestro/strategies/_extraction.py
  • src/maestro/strategies/crew.py
  • src/maestro/strategies/single.py
  • tests/strategies/test_step_validation.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/maestro/strategies/_extraction.py
  • src/maestro/strategies/crew.py


from maestro.prompts import render_rules
from maestro.schemas import InputFile, RunConfig, RunResult, SubResult
from maestro.strategies._extraction import extract_diagram_type

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check if the file starts with the required future import

head -n 5 src/maestro/strategies/single.py | grep -n "from __future__ import annotations"

Repository: Colinho22/maestro

Length of output: 43


Add from __future__ import annotations at the top of the file.

The coding guidelines require from __future__ import annotations at the top of all Python files. This import is currently missing from this file and must be added before any other imports.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/maestro/strategies/single.py` at line 26, Add the required future import
statement at the very beginning of the file before all other imports,
specifically add `from __future__ import annotations` as the first line in the
file, above the existing import from maestro.strategies._extraction. This import
is a required coding guideline that must be present at the top of all Python
files in this project.

Source: Coding guidelines

A dropped `end` leaves a subgraph open: invalid Mermaid that mmdc rejects
but the prior checks passed, so it scored as a parse failure instead of
consuming the retry budget. Deeply nested diagrams (pools > lanes >
subprocesses, network zones) are where a weaker model drops one. Count
`subgraph` openers against standalone `end` closers and reject an
imbalance; `end` is matched only as a whole-line closer so node ids like
end_event_1 and labels like "End Event" never count. Verified against all
30 ground truths (zero false positives).

Also pins extract_diagram_type on a whitespace-only diagram_type value.
Two ground-truth authoring conventions are not derivable from the input, so
scoring them penalised models for an unreachable target:

- The optional third descriptor line of a C4 / network label is authored
  inconsistently (network topology includes it for some nodes and omits it
  for others, though the input always carries the field). The entity-name
  metric now compares on the name and [Type] lines only.
- Nodes the input leaves unnamed (BPMN gateways/events with name "") are
  labelled in the ground truth from convention the input never supplies
  (type name, unicode symbols, split/join). These are scored by id match
  only. This is conditional on the input: a node the input did name is still
  scored on its label, so blanking a nameable node is still penalised.

evaluate_run takes an optional input_path to read the unnamed-id set; it is
backward compatible (omitted means strict label scoring) and fails soft on a
missing or malformed input. On the pre-flight, network name_f1 rose from
~0.28-0.38 to ~0.88-1.0 with no regression on BPMN or C4.
@Colinho22

Copy link
Copy Markdown
Owner Author

@coderabbitai please review the new changes introduced with fccc5ac and 625bb81. One addresses your earlier comments, the other one tackles the issues found after running more of the corpus (see PR description for more).

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

REPO="Colinho22/maestro"

echo "=== COMMIT fccc5ac ==="
gh api "repos/$REPO/commits/fccc5ac94ab8d9c6669c492ff04407e50c148f65" \
  --jq '.commit.message' 2>/dev/null || true
echo "---FILES CHANGED---"
gh api "repos/$REPO/commits/fccc5ac94ab8d9c6669c492ff04407e50c148f65" \
  --jq '.files[] | {filename: .filename, additions: .additions, deletions: .deletions, patch: .patch}' 2>/dev/null || true

echo ""
echo "=== COMMIT 625bb81 ==="
gh api "repos/$REPO/commits/625bb81603e1671575d76b65617a030482bad91e" \
  --jq '.commit.message' 2>/dev/null || true
echo "---FILES CHANGED---"
gh api "repos/$REPO/commits/625bb81603e1671575d76b65617a030482bad91e" \
  --jq '.files[] | {filename: .filename, additions: .additions, deletions: .deletions, patch: .patch}' 2>/dev/null || true

Length of output: 19959


🏁 Script executed:

#!/bin/bash
set -euo pipefail

REPO="Colinho22/maestro"

echo "=== Previous CodeRabbit comments on PR `#71` ==="
gh pr view 71 --repo "$REPO" --comments 2>/dev/null | head -200 || true

echo ""
echo "=== PR review comments ==="
gh api "repos/$REPO/pulls/71/comments" \
  --jq '.[] | select(.user.login == "coderabbitai") | {id: .id, body: .body, path: .path, line: .line}' 2>/dev/null | head -300 || true

Length of output: 13512



@Colinho22 Here's my review of the two commits:


fccc5ac — Metrics: score only input-derivable label parts

Overall: Well-designed and correctly implemented. The two-part fix (strip descriptor line, exempt unnamed nodes) is the right approach.

✅ What looks good

  • _label_core correctly splits on "\\n" (literal two-char Mermaid escape) and keeps only name + [Type], dropping the inconsistently-authored third line. Applied symmetrically to both output and truth before comparison.
  • extract_input_unnamed_ids fails soft (empty set on any parse/IO error) rather than crashing evaluation — good defensive design.
  • _fuzzy_match unnamed-node path is logically correct: id-in-unnamed AND id-in-truth → count as match; falls through to label matching if id not in truth. The matched_truth index-set is used consistently across both the new path and the existing label path.
  • evaluate_run signature is backward compatible (input_path: Path | None = None). Existing callers are unaffected.
  • Test coverage is thorough — _label_core unit test, descriptor-not-scored integration test, the exemption test, and the guardrail test confirming nameable-but-blank nodes are still penalized.

⚠️ Minor observation — anonymous subgraph opener (in 625bb81 context, but worth noting for _label_core too)

In _label_core, the detection of the [Type] line uses p.strip().startswith("["). A BPMN label like name\n[alias text with bracket would be matched as the type line, but this is contrived and doesn't exist in the corpus. No action needed.


625bb81 — Strategies: catch unbalanced subgraph/end

Overall: Focused and correct. The counting approach is the right tool here.

✅ What looks good

  • stripped == "end" (exact equality after .strip()) correctly ignores end_event_1, "End Event", end_event_1[...] — the test test_step3_balance_ignores_end_in_ids_and_labels pins this precisely.
  • The whitespace-only diagram_type is now pinned with an assertion.
  • Error message includes the counts (f"unbalanced subgraph/end ({opens} subgraph, {closes} end)"), which is helpful for debugging retries.

⚠️ Minor gap — anonymous subgraph openers

startswith("subgraph ") (with trailing space) won't match an anonymous subgraph declaration that has no ID or label:

flowchart LR
subgraph
  a["A"]
end
Loading

Such a line would be counted as opens=0 while the end is counted as closes=1, producing a false "unbalanced" rejection. Anonymous subgraphs appear to be rare/absent in your corpus (the commit message notes "verified against all 30 ground truths"), but if a future diagram uses one it will be incorrectly rejected. A simple fix would be to also count stripped == "subgraph":

if stripped.startswith("subgraph ") or stripped == "subgraph":
    opens += 1

This is low priority given the corpus verification, but worth a follow-up.


Previous review comments — status

All four earlier comments appear to have been addressed:

  • extract_diagram_type whitespace normalization — test pinned in 625bb81
  • _CONCATENATED_NODES false positives in quoted labels — _QUOTED_SPAN.sub blanking in place
  • crew.py find()rfind() — per PR description
  • single.py inline extraction → shared extract_diagram_type — per PR description

The anonymous-subgraph gap in 625bb81 is the only new item. Everything else looks solid.

A bare `subgraph` (no id) is a valid opener, but the balance check only
matched `subgraph ` with a trailing space, so an anonymous subgraph would
be counted as 0 opens against its 1 `end` and falsely rejected as
unbalanced. Match the bare form too. None appear in the corpus, but a model
emitting one during a run should not produce a spurious failed cell.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/maestro/analysis/metrics.py (1)

110-133: 💤 Low value

Redundant Path() wrapper on line 123.

input_path is already typed as Path | None, so wrapping it in Path(input_path) is unnecessary. While it works (Path accepts another Path), it adds a pointless conversion.

Suggested simplification
     try:
-        data = json.loads(Path(input_path).read_text(encoding="utf-8"))
+        data = json.loads(input_path.read_text(encoding="utf-8"))
     except (OSError, json.JSONDecodeError):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/maestro/analysis/metrics.py` around lines 110 - 133, In the
extract_input_unnamed_ids function, the try block wraps input_path in an
unnecessary Path() constructor when calling read_text(). Since input_path is
already typed as Path | None, the Path() wrapper is redundant and should be
removed. Simply call read_text() directly on input_path to eliminate the
pointless conversion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/maestro/analysis/metrics.py`:
- Around line 110-133: In the extract_input_unnamed_ids function, the try block
wraps input_path in an unnecessary Path() constructor when calling read_text().
Since input_path is already typed as Path | None, the Path() wrapper is
redundant and should be removed. Simply call read_text() directly on input_path
to eliminate the pointless conversion.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 565c0cc5-751e-4ed5-9190-021cbdcf565d

📥 Commits

Reviewing files that changed from the base of the PR and between 24cee1b and 53780cd.

📒 Files selected for processing (5)
  • src/maestro/analysis/metrics.py
  • src/maestro/run.py
  • src/maestro/strategies/_extraction.py
  • tests/analysis/test_label_scoring.py
  • tests/strategies/test_step_validation.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/maestro/run.py
  • src/maestro/strategies/_extraction.py

input_path is already Path after the None guard, so wrapping it in Path()
before read_text() was a no-op.
@Colinho22 Colinho22 merged commit 557ff65 into main Jun 20, 2026
2 checks passed
@Colinho22 Colinho22 deleted the fix-prompt-strategies branch June 20, 2026 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant