Skip to content

docs: HarnessBench design spec#280

Draft
ytallo wants to merge 1 commit into
mainfrom
feat/mot-3534-harnessbench
Draft

docs: HarnessBench design spec#280
ytallo wants to merge 1 commit into
mainfrom
feat/mot-3534-harnessbench

Conversation

@ytallo

@ytallo ytallo commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Adds the design spec for HarnessBench — a harness benchmark that runs the same prompt across multiple model/config variants and compares how each performs. The doc lives at tech-specs/2026-06-agentic/harnessbench.md, alongside the other agentic specs.

This PR is the design doc only — no implementation yet.

Scope (MVP)

A new harnessbench worker (orchestration + metrics, no UI) plus an internal comparison view in console.

Highlights

  • Benchmark run = one fixed prompt × N config variants ("legs"). Each leg can override model / provider / thinking-level and any harness SendOptions (system_prompt, max_turns, output, functions).
  • Read + orchestrate, not instrument — every metric comes from already-persisted data: per-message usage (tokens/cost), transcript timestamps, and OTel spans.
  • Two-tier metrics — Tier-1 from the session transcript (wall time, turn count, tool calls, tokens, cost, final context size); Tier-2 trace enrichment (span count, avg span time), best-effort so the MVP has no hard trace-retention dependency.
  • Durable, event-driven orchestration — fan out harness::send per leg, complete reactively via harness::turn-completed, with a sweep fallback for dropped events.
  • Worker surfaceharnessbench::create-run | status | get | list.

Out of scope (later layers on the same functions)

Correctness grading, multi-turn scenarios, the standalone harnessbench.iii.dev site, and CI/regression baselines.

Ref: https://linear.app/motia/issue/MOT-3534/harnessbench

@vercel

vercel Bot commented Jun 17, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
workers Ready Ready Preview, Comment Jun 17, 2026 12:36pm

Request Review

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 68554359-61c4-414e-9343-5e35a63d8f1c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/mot-3534-harnessbench

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

skill-check — worker

0 verified, 22 skipped (no docs/).

Layer Result
structure
vale
ai
render

Four for four. Nicely done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant