RFC: Rebuild vitest-evals Around Harness-First Integration Testing
Summary
vitest-evals should move from a scorer-first eval framework to a harness-first
integration testing framework for AI applications.
This overhaul is primarily for application teams writing agent integration tests
inside their own repositories. The framework still needs to be extensible
because advanced users may work across multiple harnesses, but end users should
not be expected to write glue code for common harnesses themselves.
Today the core model is:
- define
data
- run a
task(input)
- return
string | TaskResult
- apply one or more scorers
- compare the average score to a threshold
That model works for qualitative evals and output scoring, but it is too narrow
for the kind of test suites we want to build next. The next version of
vitest-evals should treat harnesses, execution traces, tool behavior, replay,
and diagnostics as first-class concepts. Traditional integration tests should be
the default authoring model. Scorers should remain available, but they should no
longer be the center of the product.
Problem
The current API and reporter are optimized for "did this output score well?"
rather than "did this agentic system behave correctly end-to-end?"
That creates a few hard constraints:
- Harness integrations are ad hoc. The AI SDK support is an example transform,
not a first-class runtime contract.
- The task contract is too small.
string | TaskResult does not model messages,
steps, intermediate state, usage, trace artifacts, retries, or cache/replay
behavior well.
- Tool support is evaluation-oriented. We can score tool usage, but we do not
have built-in primitives for recording, replaying, diagnosing, or asserting on
tool behavior as part of a normal integration test.
- Reporting is score-first. It does not surface rich execution diagnostics while
tests are running.
- DX is still too manual. Users have to adapt provider outputs themselves,
invent conventions for traces and fixtures, and build their own local replay
strategy.
Goals
- Make built-in harnesses a product surface, not an example.
- Bundle first-class scaffolds for
ai-sdk and pi-ai.
- Keep harness adapters thin. Core should own the normalized data model,
replay/VCR policy, reporting, helpers, and judge model.
- Shift the primary authoring model toward traditional integration tests built on
top of Vitest assertions and helpers.
- Make execution artifacts first-class:
tool calls, steps, messages, usage, timings, errors, and cached/replayed data.
- Add built-in VCR-style replay support for tools, with opt-in per tool and
repository-local storage.
- Emit rich usage diagnostics with every test run, including token usage, model
usage, tool call details, durations, and cache hit/miss data.
- Make the runtime and reporter feel excellent during authoring:
clear progress, useful failure output, watch mode ergonomics, and easy fixture
setup.
- Require the normalized run/session data to be JSON-serializable so it can be
persisted, reported, and attached as artifacts without custom serializers.
- Preserve a path for scorer-based evals, but reposition them as one capability
inside a broader testing framework.
Non-Goals
- Do not optimize the next major version primarily for qualitative A/B ranking
workflows.
- Do not try to support every model SDK or orchestration library in the first
rollout.
- Do not replace Vitest. We should continue to feel like a focused layer on top
of Vitest rather than a separate test runner.
- Do not require VCR/replay for every tool. Some tools must remain live or
custom-managed.
Proposed Direction
1. Introduce a Harness-First Core Contract
The central abstraction should be a harness, not a scorer.
A harness is responsible for:
- running the system under test
- adapting the underlying runtime into a normalized session/run shape
- reporting usage analytics and session data in that normalized shape
- adapting tools so core replay/VCR behavior can be applied when configured
Illustrative shape:
type HarnessRun = {
session: NormalizedSession;
output?: unknown;
usage: UsageSummary;
timings?: TimingSummary;
artifacts?: Record<string, unknown>;
errors: Array<Record<string, unknown>>;
};
type Harness = {
name: string;
run(input: unknown, context: HarnessContext): Promise<HarnessRun>;
};
This contract should replace the current assumption that a test is fundamentally
input -> string result -> scorer. The run result needs to be large enough to
support assertions, reporting, persistence, and replay without every user
reinventing adapters.
The canonical source of truth should be a normalized provider session object.
That session should be plain JSON-serializable data. Higher-level typed helpers
can sit on top of it, but the stored and reported form should be plain data.
2. Make the Contract Break Explicit
The current primary contract is too small:
TaskFn = (input: string) => Promise<string | TaskResult>
TaskResult = { result: string; toolCalls?: ToolCall[] }
The next major version should stop treating that as the main extension point.
Instead:
- harnesses become the primary integration contract
- normalized session/run artifacts become the primary output contract
- scorer inputs should be derived from normalized run data, not from ad hoc task
return shapes
- the session conversation becomes mandatory and complete enough to support
tool calls, messages, output assertions, analytics, and replay/reporting
- tool calls should be normalized as part of that session model, with helpers to
query them ergonomically
That means we should define richer normalized types for at least:
NormalizedSession
HarnessRun
ToolCallRecord
UsageSummary
TimingSummary
- reporter-facing test metadata
Compatibility can still exist, but it should be implemented as an adapter layer
from the legacy task/scorer model into the new harness/run model.
3. Bundle First-Class Harnesses
We should ship built-in harness scaffolds for:
These should not just be examples in docs. They should be supported entry points
with:
- starter scaffolding
- standard session normalization
- standard tool adaptation hooks for core VCR behavior
- standard usage accounting
- standard reporter integration
Packaging direction:
- separate first-party packages per harness, for example
@vitest-evals/harness-ai-sdk
- those packages should remain thin and should not invent harness-specific
behavior outside of adapting the harness into the core contract
4. Redefine the Test Authoring Model
The default style should look like integration testing, not judge-style evals.
Each suite is bound to exactly one explicit harness adapter.
Illustrative direction:
describeEval("deploy agent", {
harness: aiSdkHarness({ app: createDeployAgent() }),
judges: [
factualityJudge({
name: "final output factuality",
threshold: 0.8,
}),
],
data: async () => [
{
name: "uses search before choosing a deployment target",
input: "Deploy the latest release to production",
},
],
test: async ({ session, run }) => {
expect(run.output).toContain("deployed");
expect(toolCalls(session)).toContainEqual(
expect.objectContaining({
name: "search",
}),
);
expect(run.output).toSatisfyJudge(customDomainJudge());
},
});
The exact API shape is still open, but the design constraints are not:
- assertions should compose naturally with Vitest
- users should not be forced into a scorer mental model
- suites should be able to define automatic judges that run for every case
- suite-level judges are fixed for the suite and apply to every case
- each case should execute the harness exactly once, with the same run/session
shared by automatic judges and any optional test callback
- test callbacks should receive the session/run data and be optional
- test cases should be able to assert on the full run/session, not just the
final text
- test cases remain input-driven by default; case-specific assertions live in
the optional test callback rather than in an ever-growing case schema
4a. Clarify How Existing Agents Plug Into A Harness
A harness is the runtime adapter for the system under test. It is not just a
helper that prepares judge input.
For application authors, the intended contract should be:
- bind exactly one harness to the suite
- pass the existing app/agent or an agent factory into that harness
- let the harness execute the normal runtime entrypoint for each case
- let the harness inject instrumented tools, runtime hooks, and replay/VCR
behavior
- let the harness return normalized run/session artifacts for assertions,
judges, and reporting
That means the harness replaces task as the primary runtime extension point.
Users bringing an existing pi-ai or ai-sdk app should only need to supply:
- the app/agent instance, or a factory that creates it per test
- the normal execution entrypoint for one test case
- optional output mapping when the app returns a domain object rather than a
final assistant string
Built-in harnesses should own the rest:
- runtime instrumentation
- session normalization
- tool call capture
- usage/timing extraction
- replay/VCR integration
- reporter-facing artifacts
The spec should also distinguish between:
run.output: the application-facing result the test author wants to assert on
- the normalized session trace: the canonical record used for reporting, tool
assertions, replay metadata, and generic judges
This distinction matters because many real agents do not naturally return a
single final string. They may return a domain object such as { status, invoiceId } or { answer, citations }. The harness should preserve that value
in run.output while also normalizing the assistant/session trace separately.
Built-in harnesses should support two authoring levels:
- a zero-glue path for conventional apps, such as
piAiHarness({ createAgent: () => createRefundAgent() })
- an escape hatch for custom entrypoints and output mapping, such as providing a
custom run(...) function and output(...) selector without re-implementing
normalization
There is one important integration constraint: the application still needs a
supported seam for injection or observation. That can be dependency injection
for tools/model clients, framework event hooks, or a wrapper around execution
that the harness can observe. If an agent closes over global tools and model
clients with no injection point and no events, the harness cannot reliably
capture traces or apply replay behavior.
Illustrative direction:
describeEval("refund agent", {
harness: piAiHarness({
createAgent: () => createRefundAgent(),
run: ({ agent, input, context }) =>
agent.run(input, {
tools: context.tools,
events: context.events,
}),
}),
data: async () => [{ input: "Refund invoice inv_123" }],
test: async ({ run, session }) => {
expect(run.output).toMatchObject({ status: "approved" });
expect(toolCalls(session)).toContainEqual(
expect.objectContaining({ name: "lookupInvoice" }),
);
},
});
The important behavior is:
- the user passes their existing agent through the harness
- the harness supplies the instrumented runtime pieces
- the agent executes normally
- the harness returns both the domain result and the normalized trace
5. Keep Scorers, but Demote Them
Scorers and judges still matter for some use cases:
- qualitative checks
- rubric-based assertions
- domain-specific LLM-as-a-judge workflows
But they should become optional helpers that operate on normalized session/run
data, instead of defining the whole execution model. The framework should be
able to support:
- automatic suite-level judges that always run
- explicit judge assertions inside tests such as
toSatisfyJudge(...)
- pure assertion-based tests plus helper matchers
The long-term product direction is real-world integration testing, not simple
score-first eval suites. We will likely still ship a stock factuality-style
judge as an example or baseline, but users should be able to swap prompts or
replace that judge entirely because most real suites are domain-specific.
Built-In Requirements
VCR for Tools
Built-in harness support should include VCR-style recording and replay for tool
calls.
Requirements:
- opt-in at the tool level
- global replay policy configured at the Vitest config level
- repository-local cache storage with configurable path
- deterministic cache keying strategy based on tool name and normalized input
parameters
- ability to inspect stored recordings
- replay metadata surfaced in test output
- one recording file per invocation
- recordings grouped by tool name
- standardized recorded metadata including at least write time, input params,
output, and replay-relevant metadata
- automatic mode that replays when present and falls back to a live call when a
recording is missing, then writes the fresh result back to cache
- strict replay mode that errors on missing recordings
- a sanitization/redaction hook before persistence
- clean handling for non-deterministic tools and cache busting
Example use cases:
- cache web search results in the repo
- avoid repeated calls to expensive or rate-limited APIs
- stabilize integration suites in CI
- make local iteration fast without mocking the whole system
This should be a core primitive, not left to one-off userland wrappers.
Open implementation note:
- prefer config-driven per-tool policy with minimal boilerplate
- if a harness makes some wrapper unavoidable, keep that wrapper extremely thin
- sanitization hook scope and override behavior should be specified in a deeper
follow-up VCR design
Usage Diagnostics on Every Run
Every test run should produce structured diagnostics for both humans and tools.
Minimum data we should capture when available:
- model name and provider
- input tokens
- output tokens
- reasoning tokens if exposed
- total tokens
- estimated cost if derivable
- tool call count
- per-tool call details
- per-step durations
- cache hits/misses
- retry counts
- total wall-clock duration
Diagnostics should be:
- available programmatically from the run object
- attached to test metadata for reporters
- easy to persist as artifacts in CI
Built-in harnesses should populate this data as completely as possible. The
normalized contract should assume these diagnostics exist, with empty/default
values only when the underlying runtime truly cannot expose them.
Execution Trace as a First-Class Artifact
We should define a normalized execution trace that built-in harnesses emit as a
flattened provider session/conversation.
At minimum the trace should support:
- multiple user messages when they exist
- assistant messages
- tool calls and tool results
- final output access
- timing data
- replay metadata
- provider/model metadata
- usage analytics
Without this contract, every feature above turns into a format-conversion
problem.
DX Priorities
DX is the highest priority for this overhaul.
The design should optimize for the day-to-day authoring loop:
- create a suite quickly
- use a built-in scaffold for the target harness
- run tests and understand what is happening while they execute
- inspect failures without adding temporary logging
- rerun quickly using cached tool outputs when appropriate
- understand cost and tool behavior immediately
Required DX outcomes:
- one obvious way to start a new suite
- starter examples for
ai-sdk and pi-ai
- live, readable progress reporting during execution
- failure output that shows the relevant trace and diagnostics, not just a score
- watch-mode friendly output that does not flood the terminal
- clear repository conventions for fixtures, recordings, and artifacts
Potential UX surfaces:
create vitest-evals or init scaffold
- built-in directory conventions such as
.vitest-evals/recordings
- first-party helpers for judge assertions and tool-call assertions that still
feel like normal Vitest expect(...)
- summary tables for usage and tool activity after a suite finishes
- optional machine-readable output for CI ingestion
Reporter Changes
The reporter needs to evolve from score-first output to run-first output.
It should support:
- Vitest-like streaming progress as tests run
- per-test summaries that include pass/fail, duration, core usage analytics, and
tool count
- named judge results reported as per-case sub-results rather than one collapsed
aggregate score
- expanded failure sections with the most relevant trace/session data
- replay/cache indicators
- compact output by default
- a higher-verbosity mode that can show each LLM call, each message sent to it,
each tool call, and each assistant response with clean formatting
Judge execution internals should not dominate normal correctness reporting, but
we should preserve a separate mechanism to understand judge cost/diagnostics when
needed.
Migration Strategy
This is a breaking change and should be treated like one.
Recommended approach:
Phase 1: Add the New Harness Model
- add the new harness contract
- ship
ai-sdk and pi-ai scaffolds
- add reporter support for diagnostics and traces
- add VCR support for tools
- ship the new
describeEval-style API around harnesses and judges
Phase 2: Reposition Existing APIs
- update docs and examples to make harness-first suites the default
- remove scorer-first framing from the main product story
Phase 3: Simplify and Remove Redundancy
- remove or de-emphasize APIs that only make sense for the older model
Acceptance Criteria
- A user can scaffold a working
ai-sdk suite without manually adapting tool
traces.
- A user can scaffold a working
pi-ai suite without inventing custom
instrumentation.
- A built-in harness returns normalized run artifacts that support assertions,
reporting, and persistence.
- A user can wire an existing agent into a built-in harness by supplying an
agent/app factory plus optional custom run/output adapters without manually
normalizing traces.
- The API distinguishes
run.output from the normalized session trace so
application assertions and framework reporting can evolve independently.
- The canonical normalized session is JSON-serializable plain data and supports
helper APIs for common access patterns.
- Tool VCR recording/replay works for opt-in tools and stores data in the repo.
- Automatic and strict replay modes are supported globally in Vitest config.
- Usage diagnostics are shown for every test run when available.
- The reporter makes it easy to understand what happened during a run without
adding debug logging, and can show a full trace in verbose mode.
- Suite-level judges run automatically for every case and are reported
individually.
- Each case executes the harness exactly once and can optionally add additional
explicit assertions in a test callback.
Open Questions
- What is the smallest clean suite API that balances automatic judges, optional
test callbacks, and straightforward Vitest ergonomics?
- What exact helper surface should ship in v1 for tool-call assertions and
explicit judge assertions?
- What is the exact persistent layout for recordings and artifacts by default?
- How should sanitization hooks be configured globally versus per tool?
- Do we eventually rename the package, or keep
vitest-evals even as the
product becomes integration-test-first?
Initial Implementation Plan
- Design the normalized
HarnessRun and diagnostics contracts.
- Implement the harness lifecycle and reporter plumbing in the core package.
- Add built-in
ai-sdk and pi-ai harness scaffolds.
- Add tool VCR recording/replay with repository-local storage.
- Add matcher/helpers for common assertions on runs and tool calls.
- Rewrite docs and examples around the new integration-test-first model.
Why This Matters
If we get this right, vitest-evals stops feeling like a thin scoring utility
for LLM outputs and starts feeling like the obvious way to test agentic systems
inside a normal Vitest workflow.
RFC: Rebuild vitest-evals Around Harness-First Integration Testing
Summary
vitest-evalsshould move from a scorer-first eval framework to a harness-firstintegration testing framework for AI applications.
This overhaul is primarily for application teams writing agent integration tests
inside their own repositories. The framework still needs to be extensible
because advanced users may work across multiple harnesses, but end users should
not be expected to write glue code for common harnesses themselves.
Today the core model is:
datatask(input)string | TaskResultThat model works for qualitative evals and output scoring, but it is too narrow
for the kind of test suites we want to build next. The next version of
vitest-evalsshould treat harnesses, execution traces, tool behavior, replay,and diagnostics as first-class concepts. Traditional integration tests should be
the default authoring model. Scorers should remain available, but they should no
longer be the center of the product.
Problem
The current API and reporter are optimized for "did this output score well?"
rather than "did this agentic system behave correctly end-to-end?"
That creates a few hard constraints:
not a first-class runtime contract.
string | TaskResultdoes not model messages,steps, intermediate state, usage, trace artifacts, retries, or cache/replay
behavior well.
have built-in primitives for recording, replaying, diagnosing, or asserting on
tool behavior as part of a normal integration test.
tests are running.
invent conventions for traces and fixtures, and build their own local replay
strategy.
Goals
ai-sdkandpi-ai.replay/VCR policy, reporting, helpers, and judge model.
top of Vitest assertions and helpers.
tool calls, steps, messages, usage, timings, errors, and cached/replayed data.
repository-local storage.
usage, tool call details, durations, and cache hit/miss data.
clear progress, useful failure output, watch mode ergonomics, and easy fixture
setup.
persisted, reported, and attached as artifacts without custom serializers.
inside a broader testing framework.
Non-Goals
workflows.
rollout.
of Vitest rather than a separate test runner.
custom-managed.
Proposed Direction
1. Introduce a Harness-First Core Contract
The central abstraction should be a harness, not a scorer.
A harness is responsible for:
Illustrative shape:
This contract should replace the current assumption that a test is fundamentally
input -> string result -> scorer. The run result needs to be large enough tosupport assertions, reporting, persistence, and replay without every user
reinventing adapters.
The canonical source of truth should be a normalized provider session object.
That session should be plain JSON-serializable data. Higher-level typed helpers
can sit on top of it, but the stored and reported form should be plain data.
2. Make the Contract Break Explicit
The current primary contract is too small:
TaskFn = (input: string) => Promise<string | TaskResult>TaskResult = { result: string; toolCalls?: ToolCall[] }The next major version should stop treating that as the main extension point.
Instead:
return shapes
tool calls, messages, output assertions, analytics, and replay/reporting
query them ergonomically
That means we should define richer normalized types for at least:
NormalizedSessionHarnessRunToolCallRecordUsageSummaryTimingSummaryCompatibility can still exist, but it should be implemented as an adapter layer
from the legacy task/scorer model into the new harness/run model.
3. Bundle First-Class Harnesses
We should ship built-in harness scaffolds for:
ai-sdkpi-aiThese should not just be examples in docs. They should be supported entry points
with:
Packaging direction:
@vitest-evals/harness-ai-sdkbehavior outside of adapting the harness into the core contract
4. Redefine the Test Authoring Model
The default style should look like integration testing, not judge-style evals.
Each suite is bound to exactly one explicit harness adapter.
Illustrative direction:
The exact API shape is still open, but the design constraints are not:
shared by automatic judges and any optional test callback
final text
the optional
testcallback rather than in an ever-growing case schema4a. Clarify How Existing Agents Plug Into A Harness
A harness is the runtime adapter for the system under test. It is not just a
helper that prepares judge input.
For application authors, the intended contract should be:
behavior
judges, and reporting
That means the harness replaces
taskas the primary runtime extension point.Users bringing an existing
pi-aiorai-sdkapp should only need to supply:final assistant string
Built-in harnesses should own the rest:
The spec should also distinguish between:
run.output: the application-facing result the test author wants to assert onassertions, replay metadata, and generic judges
This distinction matters because many real agents do not naturally return a
single final string. They may return a domain object such as
{ status, invoiceId }or{ answer, citations }. The harness should preserve that valuein
run.outputwhile also normalizing the assistant/session trace separately.Built-in harnesses should support two authoring levels:
piAiHarness({ createAgent: () => createRefundAgent() })custom
run(...)function andoutput(...)selector without re-implementingnormalization
There is one important integration constraint: the application still needs a
supported seam for injection or observation. That can be dependency injection
for tools/model clients, framework event hooks, or a wrapper around execution
that the harness can observe. If an agent closes over global tools and model
clients with no injection point and no events, the harness cannot reliably
capture traces or apply replay behavior.
Illustrative direction:
The important behavior is:
5. Keep Scorers, but Demote Them
Scorers and judges still matter for some use cases:
But they should become optional helpers that operate on normalized session/run
data, instead of defining the whole execution model. The framework should be
able to support:
toSatisfyJudge(...)The long-term product direction is real-world integration testing, not simple
score-first eval suites. We will likely still ship a stock factuality-style
judge as an example or baseline, but users should be able to swap prompts or
replace that judge entirely because most real suites are domain-specific.
Built-In Requirements
VCR for Tools
Built-in harness support should include VCR-style recording and replay for tool
calls.
Requirements:
parameters
output, and replay-relevant metadata
recording is missing, then writes the fresh result back to cache
Example use cases:
This should be a core primitive, not left to one-off userland wrappers.
Open implementation note:
follow-up VCR design
Usage Diagnostics on Every Run
Every test run should produce structured diagnostics for both humans and tools.
Minimum data we should capture when available:
Diagnostics should be:
Built-in harnesses should populate this data as completely as possible. The
normalized contract should assume these diagnostics exist, with empty/default
values only when the underlying runtime truly cannot expose them.
Execution Trace as a First-Class Artifact
We should define a normalized execution trace that built-in harnesses emit as a
flattened provider session/conversation.
At minimum the trace should support:
Without this contract, every feature above turns into a format-conversion
problem.
DX Priorities
DX is the highest priority for this overhaul.
The design should optimize for the day-to-day authoring loop:
Required DX outcomes:
ai-sdkandpi-aiPotential UX surfaces:
create vitest-evalsorinitscaffold.vitest-evals/recordingsfeel like normal Vitest
expect(...)Reporter Changes
The reporter needs to evolve from score-first output to run-first output.
It should support:
tool count
aggregate score
each tool call, and each assistant response with clean formatting
Judge execution internals should not dominate normal correctness reporting, but
we should preserve a separate mechanism to understand judge cost/diagnostics when
needed.
Migration Strategy
This is a breaking change and should be treated like one.
Recommended approach:
Phase 1: Add the New Harness Model
ai-sdkandpi-aiscaffoldsdescribeEval-style API around harnesses and judgesPhase 2: Reposition Existing APIs
Phase 3: Simplify and Remove Redundancy
Acceptance Criteria
ai-sdksuite without manually adapting tooltraces.
pi-aisuite without inventing custominstrumentation.
reporting, and persistence.
agent/app factory plus optional custom run/output adapters without manually
normalizing traces.
run.outputfrom the normalized session trace soapplication assertions and framework reporting can evolve independently.
helper APIs for common access patterns.
adding debug logging, and can show a full trace in verbose mode.
individually.
explicit assertions in a test callback.
Open Questions
test callbacks, and straightforward Vitest ergonomics?
explicit judge assertions?
vitest-evalseven as theproduct becomes integration-test-first?
Initial Implementation Plan
HarnessRunand diagnostics contracts.ai-sdkandpi-aiharness scaffolds.Why This Matters
If we get this right,
vitest-evalsstops feeling like a thin scoring utilityfor LLM outputs and starts feeling like the obvious way to test agentic systems
inside a normal Vitest workflow.