Skip to content

feat(run): concurrent matrix execution + Docker build cleanup#70

Merged
Colinho22 merged 2 commits into
mainfrom
feat-concurrent-llm-calls
Jun 19, 2026
Merged

feat(run): concurrent matrix execution + Docker build cleanup#70
Colinho22 merged 2 commits into
mainfrom
feat-concurrent-llm-calls

Conversation

@Colinho22

@Colinho22 Colinho22 commented Jun 19, 2026

Copy link
Copy Markdown
Owner

What

Two independent efficiency changes, one per commit:

  1. Concurrent matrix execution with a per-provider concurrency cap. The
    runner ran cells sequentially, so a full matrix was ~30h of mostly idle
    network wait. Cells now run on a thread pool.
  2. .dockerignore so the build context stops shipping ~1.1 GB of .venv,
    .git, and tool caches to the daemon on every build.

Why

The matrix is I/O-bound: each cell is an independent LLM call (10-44s) followed
by a ~1ms scoring + DB write. Running them one at a time wastes almost all the
wall clock waiting on the network. Parallelising the cells is the single
largest speedup available and cuts an estimated ~30h run to ~5-6h, with
identical recorded numbers (cells are independent).

The Docker build had no .dockerignore, so even though the Dockerfile only
copies src/ and two metadata files, Docker tarred and shipped the whole
context, dominated by the 1.1 GB virtualenv, on every build.

How

Concurrency is built around the project's non-negotiables:

  • SQLite stays single-writer. Workers do only the LLM call + scoring; every
    insert_* runs on the main thread, which drains finished cells via
    as_completed(). No worker ever touches the DB, so there is no lock
    contention and no risk of a silently dropped write.
  • duration_ms stays a true latency. Concurrency is capped per provider
    with a semaphore (claude / gpt / mistral / gemini / deepseek), so one
    provider's calls can't burst past its rate limit and inflate measured latency
    with server-side queue time. The cap also keeps us clear of 429s.
  • Per-cell error isolation is preserved. A strategy/provider exception still
    becomes a failed RunResult row rather than crashing the pool.
  • New flag --provider-concurrency N (default 4): a free-tier key drops it
    to 1, a high-limit account raises it. It changes only speed, never results.

Testing

  • New tests/test_run_concurrency.py pins the semaphore wiring (one per
    provider, permit count matches the flag, keys match the provider dispatch).
  • Full suite: 246 passed, ruff check + format clean.
  • Live smoke run (10 single-agent cells) and a 6-cell LangGraph/Mistral run at
    concurrency 4: interleaved finish order confirms real parallelism; all rows,
    sub-results, and metrics persisted correctly with 0 retries and 0 errors.

Notes

  • No behaviour change to the science: results are independent of execution
    order, and the published default (4) is unchanged for anyone who doesn't pass
    the new flag.

Summary by CodeRabbit

  • New Features

    • Experiments now execute in parallel using thread pools for improved performance
    • Added --provider-concurrency CLI option to control concurrent executions per provider
  • Tests

    • Added concurrency enforcement tests
  • Chores

    • Optimized Docker build context with ignore patterns

The build had no .dockerignore, so Docker shipped the entire context to
the daemon on every build, including the 1.1 GB .venv, .git, and tool
caches, none of which the Dockerfile copies. Exclude them so context
transfer is fast and a future COPY . can never bake .venv into the image.
The matrix ran sequentially, so a full run was ~30h of mostly network
wait. Execute cells on a thread pool instead. Two invariants are kept:
workers do only the LLM call and scoring while the main thread remains
the sole DB writer (SQLite single-writer preserved), and concurrency is
capped per provider by a semaphore so one provider's calls cannot burst
past its rate limit and a cell's duration_ms stays a true latency.

Add --provider-concurrency N (default 4) so a free-tier key can drop to 1
and a high-limit account can raise it; it changes only speed, never the
recorded numbers, since cells are independent.
@Colinho22 Colinho22 added this to the 🧪 Experimental Artifact milestone Jun 19, 2026
@Colinho22 Colinho22 self-assigned this Jun 19, 2026
@Colinho22 Colinho22 added the enhancement New feature or request label Jun 19, 2026
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 288ec809-0cd0-440c-b096-0d3e271f6e21

📥 Commits

Reviewing files that changed from the base of the PR and between bd6c7ec and 80aa480.

📒 Files selected for processing (3)
  • .dockerignore
  • src/maestro/run.py
  • tests/test_run_concurrency.py

📝 Walkthrough

Walkthrough

The runner in src/maestro/run.py is refactored from sequential per-cell execution to a ThreadPoolExecutor model with per-provider threading.Semaphore caps. A new _execute_cell() worker function handles strategy construction, semaphore-gated LLM calls, and metric evaluation; main() serializes all DB writes on the main thread. A --provider-concurrency CLI flag controls the cap. Concurrency tests are added, and a .dockerignore file is included.

Changes

Parallel Experiment Runner with Per-Provider Concurrency

Layer / File(s) Summary
Semaphore infrastructure, CLI config, and tests
src/maestro/run.py, tests/test_run_concurrency.py
Adds threading/ThreadPoolExecutor imports, DEFAULT_PROVIDER_CONCURRENCY, _build_provider_semaphores() to create one semaphore per provider dispatch needle, a --provider-concurrency CLI argument with >= 1 validation, and three unit tests asserting semaphore count, permit exhaustion behavior, and dispatch key coverage.
_execute_cell worker function
src/maestro/run.py
Adds _execute_cell() which builds the control or LLM strategy for a matrix cell, acquires the per-provider semaphore around the LLM network call, returns a failed RunResult on any exception, and evaluates metrics on success with errors caught and logged.
main() thread pool orchestration
src/maestro/run.py
Refactors main() to filter out unimplemented strategies before scheduling, build provider semaphores, create a ThreadPoolExecutor sized by len(_PROVIDER_DISPATCH) × provider_concurrency, submit all runnable cells as futures, and process completions on the main thread for all DB writes (config, result, sub-results, metrics), with final counts reported against total_runnable.

Docker Build Context

Layer / File(s) Summary
.dockerignore patterns
.dockerignore
Adds patterns excluding Python virtual environments, build/cache outputs, tool caches, VCS/IDE metadata, local secrets, Claude Code files, OS cruft, and logs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 Hoppity-hop, no more waiting in line,
Each provider gets semaphores — oh, so fine!
The thread pool spins up, cells fly through the air,
The main thread writes DB with the greatest of care.
.dockerignore trims the context down slim,
A parallel meadow, built right on a whim! 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the two main changes: concurrent matrix execution and Docker build cleanup via .dockerignore, matching the substantial refactoring in run.py and new .dockerignore file.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Docstring Coverage (Src Only) ✅ Passed All 4 public functions in src/maestro/run.py have docstrings (100% coverage exceeds 80% threshold). .dockerignore is not source code. Tests are excluded.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat-concurrent-llm-calls

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Colinho22 Colinho22 merged commit e3ffba1 into main Jun 19, 2026
2 checks passed
@Colinho22 Colinho22 deleted the feat-concurrent-llm-calls branch June 19, 2026 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant