docs: HarnessBench design spec#280
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
skill-check — worker0 verified, 22 skipped (no docs/).
Four for four. Nicely done. |
Adds the design spec for HarnessBench — a harness benchmark that runs the same prompt across multiple model/config variants and compares how each performs. The doc lives at
tech-specs/2026-06-agentic/harnessbench.md, alongside the other agentic specs.This PR is the design doc only — no implementation yet.
Scope (MVP)
A new
harnessbenchworker (orchestration + metrics, no UI) plus an internal comparison view inconsole.Highlights
SendOptions(system_prompt,max_turns,output,functions).usage(tokens/cost), transcript timestamps, and OTel spans.harness::sendper leg, complete reactively viaharness::turn-completed, with asweepfallback for dropped events.harnessbench::create-run | status | get | list.Out of scope (later layers on the same functions)
Correctness grading, multi-turn scenarios, the standalone
harnessbench.iii.devsite, and CI/regression baselines.Ref: https://linear.app/motia/issue/MOT-3534/harnessbench