MANDATORY READING before ANY code changes:
- Read existing code structure in
src/to understand patterns - Check
src/index.tsfor core framework architecture - Review test files for testing patterns
- All scorer implementations MUST follow established patterns
- Architecture: MUST read
docs/architecture.mdfor system design - Testing: MUST read
docs/testing.mdfor test requirements - Development: MUST read
docs/development-guide.mdfor workflow - Examples: Check
docs/scorer-examples.mdfor implementation patterns
vitest-evals is a Vitest-based evaluation framework for testing language model outputs with flexible scoring functions. It provides a structured way to evaluate AI model outputs against expected results.
vitest-evals/
├── src/
│ ├── index.ts # Main entry point, types, and core framework
│ ├── reporter.ts # Custom Vitest reporter
│ ├── scorers/ # Scorer implementations
│ │ ├── index.ts # Scorers export file
│ │ ├── toolCallScorer.ts # Tool call evaluation scorer
│ │ └── toolCallScorer.test.ts # Tool call scorer tests
│ ├── ai-sdk-integration.test.ts # AI SDK integration example
│ ├── autoevals-compatibility.test.ts # Autoevals compatibility tests
│ ├── formatScores.test.ts # Format scores tests
│ └── wrapText.test.ts # Wrap text tests
├── docs/ # Project documentation
│ ├── architecture.md # System architecture overview
│ ├── testing.md # Testing standards and requirements
│ ├── development-guide.md # Development workflow and tips
│ ├── scorer-examples.md # Example scorer implementations
│ ├── custom-scorers.md # Custom scorer examples
│ └── provider-transformations.md # Provider tool call transformations
├── scripts/
│ └── craft-pre-release.sh # Release preparation script
├── tsup.config.ts # Build configuration
├── tsconfig.json # TypeScript configuration
├── biome.json # Code formatter/linter config
└── package.json # Project dependencies and scripts
When making changes, consider these areas:
- describeEval() function: Main evaluation entry point for test suites
- toEval matcher: Vitest matcher for individual evaluations
- TaskResult handling: Supports string or {result, toolCalls}
- ScoreFn interface: All scorers must implement this
- Async/sync support: Both scorer types supported
- Type definitions: All TypeScript interfaces defined here
- Score display: Shows evaluation results
- Error reporting: Handles test failures
- Progress tracking: Visual feedback during tests
- ToolCallScorer: Evaluates tool/function call accuracy (only built-in scorer)
- Flexible parameters: Support various parameter names
- Type safety: Full TypeScript support
- Autoevals compatibility: Works with existing scorers
- Strict mode: All code must pass strict TypeScript checks
- Explicit types: No implicit any types
- Interface-driven: Define interfaces before implementation
- All scorers MUST have tests: No exceptions
- Test edge cases: Error conditions, async behavior
- Integration tests: Test with actual AI outputs
- Run tests:
pnpm testmust pass before completion
- Lint check:
pnpm run lintmust pass - Type check:
pnpm run typecheckmust pass - Format:
pnpm run formatfor consistent style
# Development
pnpm test # Run all tests
pnpm run build # Build the package
pnpm run lint # Check code style
pnpm run format # Auto-format code
pnpm run typecheck # Verify TypeScript types
# Before completing ANY task
pnpm run lint && pnpm run typecheck && pnpm test- Implement the
Scorerinterface - Support both sync and async evaluation
- Handle errors gracefully
- Return normalized scores (0-1 range typical)
type TaskResult = string | { result: string; toolCalls?: any[] }- Always handle both formats
- Extract result string appropriately
- Pass tool calls to specialized scorers
- Support multiple parameter names (e.g.,
expectedorexpectedTools) - Use TypeScript generics for type safety
- Document parameter requirements
- Document all public APIs with JSDoc
- Include usage examples in comments
- Explain complex logic inline
- Keep examples current with API changes
- Document new scorers when added
- Update compatibility notes
- Core evaluation framework
- Custom Vitest reporter
- ToolCallScorer for function evaluation
- Autoevals library compatibility
- Flexible parameter naming
- Additional scorer implementations
- Enhanced error handling
- Performance optimizations
CRITICAL: Documentation must be kept up-to-date with code changes
- Update relevant docs when modifying code
- Add examples when creating new scorers
- Document breaking changes prominently
- Keep CLAUDE.md synchronized with project state
- Review existing code to understand patterns
- Write tests first for new features
- Implement following established patterns
- Verify with lint, typecheck, and tests
- Document changes in code and README
- Define the scorer interface extending
BaseScorerOptionsinsrc/index.ts - Implement in
src/scorers/[name].tsfollowing camelCase naming - Write comprehensive tests in
src/scorers/[name].test.ts - Export from
src/scorers/index.tsand main index - Document usage in README
import { describe, test, expect } from 'vitest'
import { YourScorer } from '../src/scorers/yourScorer'
test('scorer evaluates correctly', async () => {
expect('test input').toEval(
'expected output',
async (input) => 'test output',
YourScorer,
1.0
)
})This project uses pnpm (not npm). Always use pnpm commands.
Before marking ANY task complete:
- Code passes
pnpm run lint - Code passes
pnpm run typecheck - All tests pass with
pnpm test - New features have tests
- Documentation is updated
- Examples work correctly