Skip to content

Commit 8e30078

Browse files
dcramerclaudeCopilot
authored
docs: add comprehensive project documentation and development guidelines (#21)
## Summary - Added comprehensive CLAUDE.md development guidelines with mandatory pre-development reading requirements - Created detailed documentation covering architecture, testing, development workflow, and scorer examples - Established clear code standards, validation checklists, and common implementation patterns ## Test plan - [x] Documentation files are properly formatted - [x] All links and references are accurate - [x] Code examples follow established patterns - [x] Guidelines align with current codebase structure 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Claude <[email protected]> Co-authored-by: Copilot <[email protected]>
1 parent a9b1873 commit 8e30078

8 files changed

Lines changed: 1731 additions & 3 deletions

CLAUDE.md

Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
# vitest-evals Development Guidelines
2+
3+
## 🔴 CRITICAL: Pre-Development Requirements
4+
5+
**MANDATORY READING before ANY code changes:**
6+
- Read existing code structure in `src/` to understand patterns
7+
- Check `src/index.ts` for core framework architecture
8+
- Review test files for testing patterns
9+
- All scorer implementations MUST follow established patterns
10+
11+
### Required Documentation Review
12+
- **Architecture**: MUST read `docs/architecture.md` for system design
13+
- **Testing**: MUST read `docs/testing.md` for test requirements
14+
- **Development**: MUST read `docs/development-guide.md` for workflow
15+
- **Examples**: Check `docs/scorer-examples.md` for implementation patterns
16+
17+
## Repository Overview
18+
19+
vitest-evals is a Vitest-based evaluation framework for testing language model outputs with flexible scoring functions. It provides a structured way to evaluate AI model outputs against expected results.
20+
21+
## Repository Structure
22+
23+
```
24+
vitest-evals/
25+
├── src/
26+
│ ├── index.ts # Main entry point, types, and core framework
27+
│ ├── reporter.ts # Custom Vitest reporter
28+
│ ├── scorers/ # Scorer implementations
29+
│ │ ├── index.ts # Scorers export file
30+
│ │ ├── toolCallScorer.ts # Tool call evaluation scorer
31+
│ │ └── toolCallScorer.test.ts # Tool call scorer tests
32+
│ ├── ai-sdk-integration.test.ts # AI SDK integration example
33+
│ ├── autoevals-compatibility.test.ts # Autoevals compatibility tests
34+
│ ├── formatScores.test.ts # Format scores tests
35+
│ └── wrapText.test.ts # Wrap text tests
36+
├── docs/ # Project documentation
37+
│ ├── architecture.md # System architecture overview
38+
│ ├── testing.md # Testing standards and requirements
39+
│ ├── development-guide.md # Development workflow and tips
40+
│ ├── scorer-examples.md # Example scorer implementations
41+
│ ├── custom-scorers.md # Custom scorer examples
42+
│ └── provider-transformations.md # Provider tool call transformations
43+
├── scripts/
44+
│ └── craft-pre-release.sh # Release preparation script
45+
├── tsup.config.ts # Build configuration
46+
├── tsconfig.json # TypeScript configuration
47+
├── biome.json # Code formatter/linter config
48+
└── package.json # Project dependencies and scripts
49+
```
50+
51+
## Core Components Impact Analysis
52+
53+
When making changes, consider these areas:
54+
55+
### Framework Core (`src/index.ts`)
56+
- **describeEval()** function: Main evaluation entry point for test suites
57+
- **toEval** matcher: Vitest matcher for individual evaluations
58+
- **TaskResult** handling: Supports string or {result, toolCalls}
59+
- **ScoreFn** interface: All scorers must implement this
60+
- **Async/sync support**: Both scorer types supported
61+
- **Type definitions**: All TypeScript interfaces defined here
62+
63+
### Custom Reporter (`src/reporter.ts`)
64+
- **Score display**: Shows evaluation results
65+
- **Error reporting**: Handles test failures
66+
- **Progress tracking**: Visual feedback during tests
67+
68+
### Scorer System (`src/scorers/`)
69+
- **ToolCallScorer**: Evaluates tool/function call accuracy (only built-in scorer)
70+
- **Flexible parameters**: Support various parameter names
71+
- **Type safety**: Full TypeScript support
72+
- **Autoevals compatibility**: Works with existing scorers
73+
74+
## 🔴 CRITICAL: Code Standards
75+
76+
### TypeScript Requirements
77+
- **Strict mode**: All code must pass strict TypeScript checks
78+
- **Explicit types**: No implicit any types
79+
- **Interface-driven**: Define interfaces before implementation
80+
81+
### Testing Requirements
82+
- **All scorers MUST have tests**: No exceptions
83+
- **Test edge cases**: Error conditions, async behavior
84+
- **Integration tests**: Test with actual AI outputs
85+
- **Run tests**: `pnpm test` must pass before completion
86+
87+
### Code Quality
88+
- **Lint check**: `pnpm run lint` must pass
89+
- **Type check**: `pnpm run typecheck` must pass
90+
- **Format**: `pnpm run format` for consistent style
91+
92+
## Key Commands
93+
94+
```bash
95+
# Development
96+
pnpm test # Run all tests
97+
pnpm run build # Build the package
98+
pnpm run lint # Check code style
99+
pnpm run format # Auto-format code
100+
pnpm run typecheck # Verify TypeScript types
101+
102+
# Before completing ANY task
103+
pnpm run lint && pnpm run typecheck && pnpm test
104+
```
105+
106+
## Architecture Patterns
107+
108+
### Scorer Implementation
109+
- Implement the `Scorer` interface
110+
- Support both sync and async evaluation
111+
- Handle errors gracefully
112+
- Return normalized scores (0-1 range typical)
113+
114+
### TaskResult Handling
115+
```typescript
116+
type TaskResult = string | { result: string; toolCalls?: any[] }
117+
```
118+
- Always handle both formats
119+
- Extract result string appropriately
120+
- Pass tool calls to specialized scorers
121+
122+
### Parameter Flexibility
123+
- Support multiple parameter names (e.g., `expected` or `expectedTools`)
124+
- Use TypeScript generics for type safety
125+
- Document parameter requirements
126+
127+
## Documentation Requirements
128+
129+
### Code Documentation
130+
- Document all public APIs with JSDoc
131+
- Include usage examples in comments
132+
- Explain complex logic inline
133+
134+
### README Updates
135+
- Keep examples current with API changes
136+
- Document new scorers when added
137+
- Update compatibility notes
138+
139+
## Current Features
140+
141+
### Implemented
142+
- Core evaluation framework
143+
- Custom Vitest reporter
144+
- ToolCallScorer for function evaluation
145+
- Autoevals library compatibility
146+
- Flexible parameter naming
147+
148+
### In Progress
149+
- Additional scorer implementations
150+
- Enhanced error handling
151+
- Performance optimizations
152+
153+
## Documentation Maintenance
154+
155+
**CRITICAL**: Documentation must be kept up-to-date with code changes
156+
- Update relevant docs when modifying code
157+
- Add examples when creating new scorers
158+
- Document breaking changes prominently
159+
- Keep CLAUDE.md synchronized with project state
160+
161+
## Development Process
162+
163+
1. **Review existing code** to understand patterns
164+
2. **Write tests first** for new features
165+
3. **Implement** following established patterns
166+
4. **Verify** with lint, typecheck, and tests
167+
5. **Document** changes in code and README
168+
169+
## Common Patterns
170+
171+
### Creating a New Scorer
172+
1. Define the scorer interface extending `BaseScorerOptions` in `src/index.ts`
173+
2. Implement in `src/scorers/[name].ts` following camelCase naming
174+
3. Write comprehensive tests in `src/scorers/[name].test.ts`
175+
4. Export from `src/scorers/index.ts` and main index
176+
5. Document usage in README
177+
178+
### Testing Scorers
179+
```typescript
180+
import { describe, test, expect } from 'vitest'
181+
import { YourScorer } from '../src/scorers/yourScorer'
182+
183+
test('scorer evaluates correctly', async () => {
184+
expect('test input').toEval(
185+
'expected output',
186+
async (input) => 'test output',
187+
YourScorer,
188+
1.0
189+
)
190+
})
191+
```
192+
193+
## Package Manager
194+
195+
This project uses **pnpm** (not npm). Always use pnpm commands.
196+
197+
## Validation Checklist
198+
199+
Before marking ANY task complete:
200+
- [ ] Code passes `pnpm run lint`
201+
- [ ] Code passes `pnpm run typecheck`
202+
- [ ] All tests pass with `pnpm test`
203+
- [ ] New features have tests
204+
- [ ] Documentation is updated
205+
- [ ] Examples work correctly

docs/architecture.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# vitest-evals Architecture
2+
3+
## Overview
4+
5+
vitest-evals is built on top of Vitest to provide a specialized testing framework for evaluating AI/LLM outputs. It extends Vitest's capabilities with custom scoring functions and reporting.
6+
7+
## Core Components
8+
9+
### 1. Evaluation Framework (`src/index.ts`)
10+
11+
The heart of the system provides two main APIs:
12+
13+
**describeEval()** - Creates test suites for batch evaluation:
14+
- Accepts data function, task function, and scorers
15+
- Runs multiple test cases automatically
16+
- Integrates with Vitest's test runner
17+
18+
**toEval matcher** - Individual evaluation within tests:
19+
- Extends Vitest's expect API
20+
- Evaluates single input/output pairs
21+
- Returns pass/fail based on score threshold
22+
23+
```typescript
24+
export function describeEval(
25+
name: string,
26+
options: {
27+
data: () => Promise<Array<{ input: string } & Record<string, any>>>;
28+
task: TaskFn;
29+
scorers: ScoreFn<any>[];
30+
threshold?: number;
31+
}
32+
)
33+
```
34+
35+
### 2. Scorer System
36+
37+
Scorers are the pluggable evaluation functions that determine output quality.
38+
39+
#### Scorer Interface
40+
41+
```typescript
42+
type ScoreFn<TOptions extends BaseScorerOptions = BaseScorerOptions> = (
43+
opts: TOptions,
44+
) => Promise<Score> | Score;
45+
46+
interface BaseScorerOptions {
47+
input: string;
48+
output: string;
49+
toolCalls?: ToolCall[];
50+
}
51+
52+
type Score = {
53+
score: number | null;
54+
metadata?: {
55+
rationale?: string;
56+
output?: string;
57+
};
58+
};
59+
```
60+
61+
#### Built-in Scorers
62+
- **ToolCallScorer** (`src/scorers/toolCallScorer.ts`): Evaluates function/tool call accuracy
63+
- More scorers can be added by implementing the ScoreFn interface
64+
65+
### 3. Type System (`src/index.ts`)
66+
67+
Defines the core types:
68+
- `TaskResult`: Flexible output format supporting plain strings or structured results
69+
- `ScoreFn`: Function signature for evaluation logic
70+
- `Score`: Standardized scoring output
71+
- `ToolCall`: Comprehensive tool call structure supporting multiple providers
72+
73+
### 4. Custom Reporter (`src/reporter.ts`)
74+
75+
A Vitest reporter that:
76+
- Displays evaluation scores alongside test results
77+
- Provides visual feedback for score ranges
78+
- Integrates seamlessly with Vitest's output
79+
80+
## Data Flow
81+
82+
1. **Test Execution**: Vitest runs evaluation tests via `describeEval()` or `toEval` matcher
83+
2. **Task Execution**: Task function processes input and returns output (string or TaskResult)
84+
3. **Scorer Application**: Each scorer evaluates the output against test data
85+
4. **Score Aggregation**: Multiple scorer results are averaged
86+
5. **Threshold Check**: Average score compared against threshold
87+
6. **Reporting**: Custom reporter displays results with scores and metadata
88+
89+
## Extension Points
90+
91+
### Adding New Scorers
92+
93+
1. Create scorer file in `src/scorers/` using camelCase naming
94+
2. Implement the `ScoreFn` interface
95+
3. Export from `src/scorers/index.ts` and main index
96+
4. Write comprehensive tests in `src/scorers/[name].test.ts`
97+
98+
### Integration with AI SDKs
99+
100+
The framework is designed to work with various AI SDKs:
101+
- Vercel AI SDK (see `ai-sdk-integration.test.ts`)
102+
- OpenAI SDK
103+
- Anthropic SDK
104+
- Any system producing text/structured output
105+
106+
## Design Principles
107+
108+
1. **Flexibility**: Support multiple output formats and scoring approaches
109+
2. **Type Safety**: Full TypeScript support with strict typing
110+
3. **Testability**: Scorers themselves are easily testable
111+
4. **Compatibility**: Works with existing Vitest ecosystem
112+
5. **Extensibility**: Easy to add new scorers and integrations
113+
114+
## Performance Considerations
115+
116+
- Scorers can be sync or async to handle API calls
117+
- Batch evaluation support for efficiency
118+
- Minimal overhead on top of Vitest
119+
- Lazy loading of scorers when needed
120+
121+
## Error Handling
122+
123+
- Graceful degradation when scorers fail
124+
- Clear error messages for debugging
125+
- Type-safe error boundaries
126+
- Preserves Vitest error reporting

0 commit comments

Comments
 (0)