docs: HarnessBench design spec by ytallo · Pull Request #280 · iii-hq/workers

ytallo · 2026-06-17T12:36:02Z

Adds the design spec for HarnessBench — a harness benchmark that runs the same prompt across multiple model/config variants and compares how each performs. The doc lives at tech-specs/2026-06-agentic/harnessbench.md, alongside the other agentic specs.

This PR is the design doc only — no implementation yet.

Scope (MVP)

A new harnessbench worker (orchestration + metrics, no UI) plus an internal comparison view in console.

Highlights

Benchmark run = one fixed prompt × N config variants ("legs"). Each leg can override model / provider / thinking-level and any harness SendOptions (system_prompt, max_turns, output, functions).
Read + orchestrate, not instrument — every metric comes from already-persisted data: per-message usage (tokens/cost), transcript timestamps, and OTel spans.
Two-tier metrics — Tier-1 from the session transcript (wall time, turn count, tool calls, tokens, cost, final context size); Tier-2 trace enrichment (span count, avg span time), best-effort so the MVP has no hard trace-retention dependency.
Durable, event-driven orchestration — fan out harness::send per leg, complete reactively via harness::turn-completed, with a sweep fallback for dropped events.
Worker surface — harnessbench::create-run | status | get | list.

Out of scope (later layers on the same functions)

Correctness grading, multi-turn scenarios, the standalone harnessbench.iii.dev site, and CI/regression baselines.

Ref: https://linear.app/motia/issue/MOT-3534/harnessbench

vercel · 2026-06-17T12:36:08Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
workers	Ready	Preview, Comment	Jun 17, 2026 12:36pm

coderabbitai · 2026-06-17T12:36:11Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 68554359-61c4-414e-9343-5e35a63d8f1c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/mot-3534-harnessbench

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-17T12:36:19Z

skill-check — worker

0 verified, 22 skipped (no docs/).

Layer	Result
structure	✓
vale	✓
ai	✓
render	✓

Four for four. Nicely done.

docs: add HarnessBench design spec

1fe04f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: HarnessBench design spec#280

docs: HarnessBench design spec#280
ytallo wants to merge 1 commit into
mainfrom
feat/mot-3534-harnessbench

ytallo commented Jun 17, 2026

Uh oh!

vercel Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 17, 2026

Review skipped

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ytallo commented Jun 17, 2026

Scope (MVP)

Highlights

Out of scope (later layers on the same functions)

Uh oh!

vercel Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 17, 2026

Review skipped

Uh oh!

github-actions Bot commented Jun 17, 2026

skill-check — worker

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 17, 2026 •

edited

Loading