MAESTRO

Multi-Agent Evaluation for Structured Relational Output

Comparing agentic orchestration frameworks for automated relational diagram generation.

What it evaluates

MAESTRO is a benchmark. It gives every configuration the same task, a structured input dataset to turn into a relational Mermaid diagram, then scores the output against a ground-truth diagram. The question it answers is whether multi-agent orchestration produces better relational output than a single agent, and at what cost.

Four orchestration strategies generate the diagram, holding prompts and the output contract identical so only the orchestration differs:

single_agent: one prompt, one LLM call (the baseline)
sop_based: a hand-coded three-step procedure (extract entities, extract relationships, render Mermaid)
crew_ai: the same three steps orchestrated with CrewAI
lang_graph: the same three steps orchestrated with LangGraph

Three control conditions (no LLM, deterministic) bracket the score range so a strategy's numbers are interpretable: null_control (empty diagram) and copy_control (raw input) are the floor; ground_truth_control (the answer verbatim) is the ceiling.

Five providers are under test: Anthropic, OpenAI, Mistral, Gemini, and DeepSeek, across a matrix of inputs x strategies x models x repeats, stratified by complexity tier.

Scoring covers structural validity (does it parse, via mmdc), entity F1 (id / name / lemma), relationship F1 (relaxed / strict), and an error taxonomy of what each diagram got wrong. Every cell is repeated and variance is reported.

Running the experiment

The benchmark runs a matrix of inputs × strategies × models × repeats, scores each generated Mermaid diagram against its ground truth, and records every result (plus the runtime environment) in a SQLite database. The steps below run the experiment from a clean checkout.

This is a high-level walkthrough. A detailed guide (troubleshooting, full CLI reference) will follow as the code stabilises.

Prerequisites

Python 3.11
API keys for the providers you intend to run: Anthropic, OpenAI, Mistral, Gemini, DeepSeek (see each provider's docs for obtaining a key)
mmdc (mermaid-cli) for the structural-validity metric (optional locally; the metric is skipped if it is absent), bundled in the Docker image
Docker (optional), only if you prefer the container path over a local install

The local install path is tested on macOS and works on Windows. The Docker path runs Linux inside the container, so it is platform-independent and is the recommended route on Windows: it bundles a headless Chromium, which the parses_valid structural-validity metric needs. A native Windows install computes that metric only if mermaid-cli and a Puppeteer Chrome build (npx puppeteer browsers install chrome) are present; without them the metric is skipped, and the rest of the pipeline is unaffected.

1. Clone and install

git clone https://github.com/Colinho22/maestro.git
cd maestro
pip install -e .            # or: pip install -e ".[dev]" for the test/lint tools

Or build the container, which bundles Python, mermaid-cli, and Chromium:

docker compose build

2. Configure API keys

Copy the template and fill in the keys for the providers you will use:

cp .env.template .env
# edit .env (keys are read from the environment at run time)

3. Validate the setup with a small run

A single tier-1 cell confirms the install, keys, and scoring pipeline work before committing to the full matrix:

python -m maestro.run --strategy single_agent --tier 1 --repeats 1
# Docker: docker compose run --rm maestro python -m maestro.run --strategy single_agent --tier 1 --repeats 1

4. Run the full matrix

python -m maestro.run
# Docker: docker compose run --rm maestro python -m maestro.run

Runs are resumable by default: already-completed cells are skipped, so an interrupted run can be restarted with the same command. Results are written to maestro.db (or ./out/maestro.db under Docker).

5. Analyse the results

python -m maestro.analysis

6. Explore the results in the dashboard

docker compose up          # → http://localhost:8501
# Local (without Docker): streamlit run src/maestro/viz/app.py

Reproducibility audit trail

Every invocation snapshots its runtime environment (OS, architecture, Python version, library versions, git commit, and under Docker the image digest) into the run_environments table, linked to each run. This lets a later replication attempt diagnose diverging numbers against the exact stack that produced the original data.

Local development

Setup is tested on macOS. Install the dev extras and run the test suite and linters from the project root:

pip install -e ".[dev]"
pytest
ruff check .
ruff format --check .

pre-commit hooks (ruff lint + format) are configured in .pre-commit-config.yaml; enable them with pre-commit install.

Citing

If you use MAESTRO in your work, please cite it via the CITATION.cff file (GitHub's "Cite this repository" button), or see that file for the reference details.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github		.github
data		data
docs		docs
out		out
src/maestro		src/maestro
tests		tests
.coderabbit.yaml		.coderabbit.yaml
.dockerignore		.dockerignore
.env.template		.env.template
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
coderabbit.yaml		coderabbit.yaml
docker-compose.yml		docker-compose.yml
maestro.db.sha256		maestro.db.sha256
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAESTRO

What it evaluates

Running the experiment

Prerequisites

1. Clone and install

2. Configure API keys

3. Validate the setup with a small run

4. Run the full matrix

5. Analyse the results

6. Explore the results in the dashboard

Reproducibility audit trail

Local development

Citing

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MAESTRO

What it evaluates

Running the experiment

Prerequisites

1. Clone and install

2. Configure API keys

3. Validate the setup with a small run

4. Run the full matrix

5. Analyse the results

6. Explore the results in the dashboard

Reproducibility audit trail

Local development

Citing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages