RFC: Rebuild vitest-evals Around Harness-First Integration Testing

# RFC: Rebuild vitest-evals Around Harness-First Integration Testing

## Summary

`vitest-evals` should move from a scorer-first eval framework to a harness-first
integration testing framework for AI applications.

This overhaul is primarily for application teams writing agent integration tests
inside their own repositories. The framework still needs to be extensible
because advanced users may work across multiple harnesses, but end users should
not be expected to write glue code for common harnesses themselves.

Today the core model is:

- define `data`
- run a `task(input)`
- return `string | TaskResult`
- apply one or more scorers
- compare the average score to a threshold

That model works for qualitative evals and output scoring, but it is too narrow
for the kind of test suites we want to build next. The next version of
`vitest-evals` should treat harnesses, execution traces, tool behavior, replay,
and diagnostics as first-class concepts. Traditional integration tests should be
the default authoring model. Scorers should remain available, but they should no
longer be the center of the product.

## Problem

The current API and reporter are optimized for "did this output score well?"
rather than "did this agentic system behave correctly end-to-end?"

That creates a few hard constraints:

- Harness integrations are ad hoc. The AI SDK support is an example transform,
  not a first-class runtime contract.
- The task contract is too small. `string | TaskResult` does not model messages,
  steps, intermediate state, usage, trace artifacts, retries, or cache/replay
  behavior well.
- Tool support is evaluation-oriented. We can score tool usage, but we do not
  have built-in primitives for recording, replaying, diagnosing, or asserting on
  tool behavior as part of a normal integration test.
- Reporting is score-first. It does not surface rich execution diagnostics while
  tests are running.
- DX is still too manual. Users have to adapt provider outputs themselves,
  invent conventions for traces and fixtures, and build their own local replay
  strategy.

## Goals

- Make built-in harnesses a product surface, not an example.
- Bundle first-class scaffolds for `ai-sdk` and `pi-ai`.
- Keep harness adapters thin. Core should own the normalized data model,
  replay/VCR policy, reporting, helpers, and judge model.
- Shift the primary authoring model toward traditional integration tests built on
  top of Vitest assertions and helpers.
- Make execution artifacts first-class:
  tool calls, steps, messages, usage, timings, errors, and cached/replayed data.
- Add built-in VCR-style replay support for tools, with opt-in per tool and
  repository-local storage.
- Emit rich usage diagnostics with every test run, including token usage, model
  usage, tool call details, durations, and cache hit/miss data.
- Make the runtime and reporter feel excellent during authoring:
  clear progress, useful failure output, watch mode ergonomics, and easy fixture
  setup.
- Require the normalized run/session data to be JSON-serializable so it can be
  persisted, reported, and attached as artifacts without custom serializers.
- Preserve a path for scorer-based evals, but reposition them as one capability
  inside a broader testing framework.

## Non-Goals

- Do not optimize the next major version primarily for qualitative A/B ranking
  workflows.
- Do not try to support every model SDK or orchestration library in the first
  rollout.
- Do not replace Vitest. We should continue to feel like a focused layer on top
  of Vitest rather than a separate test runner.
- Do not require VCR/replay for every tool. Some tools must remain live or
  custom-managed.

## Proposed Direction

### 1. Introduce a Harness-First Core Contract

The central abstraction should be a harness, not a scorer.

A harness is responsible for:

- running the system under test
- adapting the underlying runtime into a normalized session/run shape
- reporting usage analytics and session data in that normalized shape
- adapting tools so core replay/VCR behavior can be applied when configured

Illustrative shape:

```ts
type HarnessRun = {
  session: NormalizedSession;
  output?: unknown;
  usage: UsageSummary;
  timings?: TimingSummary;
  artifacts?: Record<string, unknown>;
  errors: Array<Record<string, unknown>>;
};

type Harness = {
  name: string;
  run(input: unknown, context: HarnessContext): Promise<HarnessRun>;
};
```

This contract should replace the current assumption that a test is fundamentally
`input -> string result -> scorer`. The run result needs to be large enough to
support assertions, reporting, persistence, and replay without every user
reinventing adapters.

The canonical source of truth should be a normalized provider session object.
That session should be plain JSON-serializable data. Higher-level typed helpers
can sit on top of it, but the stored and reported form should be plain data.

### 2. Make the Contract Break Explicit

The current primary contract is too small:

- `TaskFn = (input: string) => Promise<string | TaskResult>`
- `TaskResult = { result: string; toolCalls?: ToolCall[] }`

The next major version should stop treating that as the main extension point.

Instead:

- harnesses become the primary integration contract
- normalized session/run artifacts become the primary output contract
- scorer inputs should be derived from normalized run data, not from ad hoc task
  return shapes
- the session conversation becomes mandatory and complete enough to support
  tool calls, messages, output assertions, analytics, and replay/reporting
- tool calls should be normalized as part of that session model, with helpers to
  query them ergonomically

That means we should define richer normalized types for at least:

- `NormalizedSession`
- `HarnessRun`
- `ToolCallRecord`
- `UsageSummary`
- `TimingSummary`
- reporter-facing test metadata

Compatibility can still exist, but it should be implemented as an adapter layer
from the legacy task/scorer model into the new harness/run model.

### 3. Bundle First-Class Harnesses

We should ship built-in harness scaffolds for:

- `ai-sdk`
- `pi-ai`

These should not just be examples in docs. They should be supported entry points
with:

- starter scaffolding
- standard session normalization
- standard tool adaptation hooks for core VCR behavior
- standard usage accounting
- standard reporter integration

Packaging direction:

- separate first-party packages per harness, for example
  `@vitest-evals/harness-ai-sdk`
- those packages should remain thin and should not invent harness-specific
  behavior outside of adapting the harness into the core contract

### 4. Redefine the Test Authoring Model

The default style should look like integration testing, not judge-style evals.

Each suite is bound to exactly one explicit harness adapter.

Illustrative direction:

```ts
describeEval("deploy agent", {
  harness: aiSdkHarness({ app: createDeployAgent() }),
  judges: [
    factualityJudge({
      name: "final output factuality",
      threshold: 0.8,
    }),
  ],
  data: async () => [
    {
      name: "uses search before choosing a deployment target",
      input: "Deploy the latest release to production",
    },
  ],
  test: async ({ session, run }) => {
    expect(run.output).toContain("deployed");
    expect(toolCalls(session)).toContainEqual(
      expect.objectContaining({
        name: "search",
      }),
    );
    expect(run.output).toSatisfyJudge(customDomainJudge());
  },
});
```

The exact API shape is still open, but the design constraints are not:

- assertions should compose naturally with Vitest
- users should not be forced into a scorer mental model
- suites should be able to define automatic judges that run for every case
- suite-level judges are fixed for the suite and apply to every case
- each case should execute the harness exactly once, with the same run/session
  shared by automatic judges and any optional test callback
- test callbacks should receive the session/run data and be optional
- test cases should be able to assert on the full run/session, not just the
  final text
- test cases remain input-driven by default; case-specific assertions live in
  the optional `test` callback rather than in an ever-growing case schema

### 4a. Clarify How Existing Agents Plug Into A Harness

A harness is the runtime adapter for the system under test. It is not just a
helper that prepares judge input.

For application authors, the intended contract should be:

- bind exactly one harness to the suite
- pass the existing app/agent or an agent factory into that harness
- let the harness execute the normal runtime entrypoint for each case
- let the harness inject instrumented tools, runtime hooks, and replay/VCR
  behavior
- let the harness return normalized run/session artifacts for assertions,
  judges, and reporting

That means the harness replaces `task` as the primary runtime extension point.

Users bringing an existing `pi-ai` or `ai-sdk` app should only need to supply:

- the app/agent instance, or a factory that creates it per test
- the normal execution entrypoint for one test case
- optional output mapping when the app returns a domain object rather than a
  final assistant string

Built-in harnesses should own the rest:

- runtime instrumentation
- session normalization
- tool call capture
- usage/timing extraction
- replay/VCR integration
- reporter-facing artifacts

The spec should also distinguish between:

- `run.output`: the application-facing result the test author wants to assert on
- the normalized session trace: the canonical record used for reporting, tool
  assertions, replay metadata, and generic judges

This distinction matters because many real agents do not naturally return a
single final string. They may return a domain object such as `{ status,
invoiceId }` or `{ answer, citations }`. The harness should preserve that value
in `run.output` while also normalizing the assistant/session trace separately.

Built-in harnesses should support two authoring levels:

- a zero-glue path for conventional apps, such as
  `piAiHarness({ createAgent: () => createRefundAgent() })`
- an escape hatch for custom entrypoints and output mapping, such as providing a
  custom `run(...)` function and `output(...)` selector without re-implementing
  normalization

There is one important integration constraint: the application still needs a
supported seam for injection or observation. That can be dependency injection
for tools/model clients, framework event hooks, or a wrapper around execution
that the harness can observe. If an agent closes over global tools and model
clients with no injection point and no events, the harness cannot reliably
capture traces or apply replay behavior.

Illustrative direction:

```ts
describeEval("refund agent", {
  harness: piAiHarness({
    createAgent: () => createRefundAgent(),
    run: ({ agent, input, context }) =>
      agent.run(input, {
        tools: context.tools,
        events: context.events,
      }),
  }),
  data: async () => [{ input: "Refund invoice inv_123" }],
  test: async ({ run, session }) => {
    expect(run.output).toMatchObject({ status: "approved" });
    expect(toolCalls(session)).toContainEqual(
      expect.objectContaining({ name: "lookupInvoice" }),
    );
  },
});
```

The important behavior is:

- the user passes their existing agent through the harness
- the harness supplies the instrumented runtime pieces
- the agent executes normally
- the harness returns both the domain result and the normalized trace

### 5. Keep Scorers, but Demote Them

Scorers and judges still matter for some use cases:

- qualitative checks
- rubric-based assertions
- domain-specific LLM-as-a-judge workflows

But they should become optional helpers that operate on normalized session/run
data, instead of defining the whole execution model. The framework should be
able to support:

- automatic suite-level judges that always run
- explicit judge assertions inside tests such as `toSatisfyJudge(...)`
- pure assertion-based tests plus helper matchers

The long-term product direction is real-world integration testing, not simple
score-first eval suites. We will likely still ship a stock factuality-style
judge as an example or baseline, but users should be able to swap prompts or
replace that judge entirely because most real suites are domain-specific.

## Built-In Requirements

### VCR for Tools

Built-in harness support should include VCR-style recording and replay for tool
calls.

Requirements:

- opt-in at the tool level
- global replay policy configured at the Vitest config level
- repository-local cache storage with configurable path
- deterministic cache keying strategy based on tool name and normalized input
  parameters
- ability to inspect stored recordings
- replay metadata surfaced in test output
- one recording file per invocation
- recordings grouped by tool name
- standardized recorded metadata including at least write time, input params,
  output, and replay-relevant metadata
- automatic mode that replays when present and falls back to a live call when a
  recording is missing, then writes the fresh result back to cache
- strict replay mode that errors on missing recordings
- a sanitization/redaction hook before persistence
- clean handling for non-deterministic tools and cache busting

Example use cases:

- cache web search results in the repo
- avoid repeated calls to expensive or rate-limited APIs
- stabilize integration suites in CI
- make local iteration fast without mocking the whole system

This should be a core primitive, not left to one-off userland wrappers.

Open implementation note:

- prefer config-driven per-tool policy with minimal boilerplate
- if a harness makes some wrapper unavoidable, keep that wrapper extremely thin
- sanitization hook scope and override behavior should be specified in a deeper
  follow-up VCR design

### Usage Diagnostics on Every Run

Every test run should produce structured diagnostics for both humans and tools.

Minimum data we should capture when available:

- model name and provider
- input tokens
- output tokens
- reasoning tokens if exposed
- total tokens
- estimated cost if derivable
- tool call count
- per-tool call details
- per-step durations
- cache hits/misses
- retry counts
- total wall-clock duration

Diagnostics should be:

- available programmatically from the run object
- attached to test metadata for reporters
- easy to persist as artifacts in CI

Built-in harnesses should populate this data as completely as possible. The
normalized contract should assume these diagnostics exist, with empty/default
values only when the underlying runtime truly cannot expose them.

### Execution Trace as a First-Class Artifact

We should define a normalized execution trace that built-in harnesses emit as a
flattened provider session/conversation.

At minimum the trace should support:

- multiple user messages when they exist
- assistant messages
- tool calls and tool results
- final output access
- timing data
- replay metadata
- provider/model metadata
- usage analytics

Without this contract, every feature above turns into a format-conversion
problem.

## DX Priorities

DX is the highest priority for this overhaul.

The design should optimize for the day-to-day authoring loop:

- create a suite quickly
- use a built-in scaffold for the target harness
- run tests and understand what is happening while they execute
- inspect failures without adding temporary logging
- rerun quickly using cached tool outputs when appropriate
- understand cost and tool behavior immediately

Required DX outcomes:

- one obvious way to start a new suite
- starter examples for `ai-sdk` and `pi-ai`
- live, readable progress reporting during execution
- failure output that shows the relevant trace and diagnostics, not just a score
- watch-mode friendly output that does not flood the terminal
- clear repository conventions for fixtures, recordings, and artifacts

Potential UX surfaces:

- `create vitest-evals` or `init` scaffold
- built-in directory conventions such as `.vitest-evals/recordings`
- first-party helpers for judge assertions and tool-call assertions that still
  feel like normal Vitest `expect(...)`
- summary tables for usage and tool activity after a suite finishes
- optional machine-readable output for CI ingestion

## Reporter Changes

The reporter needs to evolve from score-first output to run-first output.

It should support:

- Vitest-like streaming progress as tests run
- per-test summaries that include pass/fail, duration, core usage analytics, and
  tool count
- named judge results reported as per-case sub-results rather than one collapsed
  aggregate score
- expanded failure sections with the most relevant trace/session data
- replay/cache indicators
- compact output by default
- a higher-verbosity mode that can show each LLM call, each message sent to it,
  each tool call, and each assistant response with clean formatting

Judge execution internals should not dominate normal correctness reporting, but
we should preserve a separate mechanism to understand judge cost/diagnostics when
needed.

## Migration Strategy

This is a breaking change and should be treated like one.

Recommended approach:

### Phase 1: Add the New Harness Model

- add the new harness contract
- ship `ai-sdk` and `pi-ai` scaffolds
- add reporter support for diagnostics and traces
- add VCR support for tools
- ship the new `describeEval`-style API around harnesses and judges

### Phase 2: Reposition Existing APIs

- update docs and examples to make harness-first suites the default
- remove scorer-first framing from the main product story

### Phase 3: Simplify and Remove Redundancy

- remove or de-emphasize APIs that only make sense for the older model

## Acceptance Criteria

- A user can scaffold a working `ai-sdk` suite without manually adapting tool
  traces.
- A user can scaffold a working `pi-ai` suite without inventing custom
  instrumentation.
- A built-in harness returns normalized run artifacts that support assertions,
  reporting, and persistence.
- A user can wire an existing agent into a built-in harness by supplying an
  agent/app factory plus optional custom run/output adapters without manually
  normalizing traces.
- The API distinguishes `run.output` from the normalized session trace so
  application assertions and framework reporting can evolve independently.
- The canonical normalized session is JSON-serializable plain data and supports
  helper APIs for common access patterns.
- Tool VCR recording/replay works for opt-in tools and stores data in the repo.
- Automatic and strict replay modes are supported globally in Vitest config.
- Usage diagnostics are shown for every test run when available.
- The reporter makes it easy to understand what happened during a run without
  adding debug logging, and can show a full trace in verbose mode.
- Suite-level judges run automatically for every case and are reported
  individually.
- Each case executes the harness exactly once and can optionally add additional
  explicit assertions in a test callback.

## Open Questions

- What is the smallest clean suite API that balances automatic judges, optional
  test callbacks, and straightforward Vitest ergonomics?
- What exact helper surface should ship in v1 for tool-call assertions and
  explicit judge assertions?
- What is the exact persistent layout for recordings and artifacts by default?
- How should sanitization hooks be configured globally versus per tool?
- Do we eventually rename the package, or keep `vitest-evals` even as the
  product becomes integration-test-first?

## Initial Implementation Plan

1. Design the normalized `HarnessRun` and diagnostics contracts.
2. Implement the harness lifecycle and reporter plumbing in the core package.
3. Add built-in `ai-sdk` and `pi-ai` harness scaffolds.
4. Add tool VCR recording/replay with repository-local storage.
5. Add matcher/helpers for common assertions on runs and tool calls.
6. Rewrite docs and examples around the new integration-test-first model.

## Why This Matters

If we get this right, `vitest-evals` stops feeling like a thin scoring utility
for LLM outputs and starts feeling like the obvious way to test agentic systems
inside a normal Vitest workflow.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Rebuild vitest-evals Around Harness-First Integration Testing #39

RFC: Rebuild vitest-evals Around Harness-First Integration Testing

Summary

Problem

Goals

Non-Goals

Proposed Direction

1. Introduce a Harness-First Core Contract

2. Make the Contract Break Explicit

3. Bundle First-Class Harnesses

4. Redefine the Test Authoring Model

4a. Clarify How Existing Agents Plug Into A Harness

5. Keep Scorers, but Demote Them

Built-In Requirements

VCR for Tools

Usage Diagnostics on Every Run

Execution Trace as a First-Class Artifact

DX Priorities

Reporter Changes

Migration Strategy

Phase 1: Add the New Harness Model

Phase 2: Reposition Existing APIs

Phase 3: Simplify and Remove Redundancy

Acceptance Criteria

Open Questions

Initial Implementation Plan

Why This Matters

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Uh oh!

RFC: Rebuild vitest-evals Around Harness-First Integration Testing #39

Description

RFC: Rebuild vitest-evals Around Harness-First Integration Testing

Summary

Problem

Goals

Non-Goals

Proposed Direction

1. Introduce a Harness-First Core Contract

2. Make the Contract Break Explicit

3. Bundle First-Class Harnesses

4. Redefine the Test Authoring Model

4a. Clarify How Existing Agents Plug Into A Harness

5. Keep Scorers, but Demote Them

Built-In Requirements

VCR for Tools

Usage Diagnostics on Every Run

Execution Trace as a First-Class Artifact

DX Priorities

Reporter Changes

Migration Strategy

Phase 1: Add the New Harness Model

Phase 2: Reposition Existing APIs

Phase 3: Simplify and Remove Redundancy

Acceptance Criteria

Open Questions

Initial Implementation Plan

Why This Matters

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions