|
| 1 | +# vitest-evals Development Guidelines |
| 2 | + |
| 3 | +## 🔴 CRITICAL: Pre-Development Requirements |
| 4 | + |
| 5 | +**MANDATORY READING before ANY code changes:** |
| 6 | +- Read existing code structure in `src/` to understand patterns |
| 7 | +- Check `src/index.ts` for core framework architecture |
| 8 | +- Review test files for testing patterns |
| 9 | +- All scorer implementations MUST follow established patterns |
| 10 | + |
| 11 | +### Required Documentation Review |
| 12 | +- **Architecture**: MUST read `docs/architecture.md` for system design |
| 13 | +- **Testing**: MUST read `docs/testing.md` for test requirements |
| 14 | +- **Development**: MUST read `docs/development-guide.md` for workflow |
| 15 | +- **Examples**: Check `docs/scorer-examples.md` for implementation patterns |
| 16 | + |
| 17 | +## Repository Overview |
| 18 | + |
| 19 | +vitest-evals is a Vitest-based evaluation framework for testing language model outputs with flexible scoring functions. It provides a structured way to evaluate AI model outputs against expected results. |
| 20 | + |
| 21 | +## Repository Structure |
| 22 | + |
| 23 | +``` |
| 24 | +vitest-evals/ |
| 25 | +├── src/ |
| 26 | +│ ├── index.ts # Main entry point, types, and core framework |
| 27 | +│ ├── reporter.ts # Custom Vitest reporter |
| 28 | +│ ├── scorers/ # Scorer implementations |
| 29 | +│ │ ├── index.ts # Scorers export file |
| 30 | +│ │ ├── toolCallScorer.ts # Tool call evaluation scorer |
| 31 | +│ │ └── toolCallScorer.test.ts # Tool call scorer tests |
| 32 | +│ ├── ai-sdk-integration.test.ts # AI SDK integration example |
| 33 | +│ ├── autoevals-compatibility.test.ts # Autoevals compatibility tests |
| 34 | +│ ├── formatScores.test.ts # Format scores tests |
| 35 | +│ └── wrapText.test.ts # Wrap text tests |
| 36 | +├── docs/ # Project documentation |
| 37 | +│ ├── architecture.md # System architecture overview |
| 38 | +│ ├── testing.md # Testing standards and requirements |
| 39 | +│ ├── development-guide.md # Development workflow and tips |
| 40 | +│ ├── scorer-examples.md # Example scorer implementations |
| 41 | +│ ├── custom-scorers.md # Custom scorer examples |
| 42 | +│ └── provider-transformations.md # Provider tool call transformations |
| 43 | +├── scripts/ |
| 44 | +│ └── craft-pre-release.sh # Release preparation script |
| 45 | +├── tsup.config.ts # Build configuration |
| 46 | +├── tsconfig.json # TypeScript configuration |
| 47 | +├── biome.json # Code formatter/linter config |
| 48 | +└── package.json # Project dependencies and scripts |
| 49 | +``` |
| 50 | + |
| 51 | +## Core Components Impact Analysis |
| 52 | + |
| 53 | +When making changes, consider these areas: |
| 54 | + |
| 55 | +### Framework Core (`src/index.ts`) |
| 56 | +- **describeEval()** function: Main evaluation entry point for test suites |
| 57 | +- **toEval** matcher: Vitest matcher for individual evaluations |
| 58 | +- **TaskResult** handling: Supports string or {result, toolCalls} |
| 59 | +- **ScoreFn** interface: All scorers must implement this |
| 60 | +- **Async/sync support**: Both scorer types supported |
| 61 | +- **Type definitions**: All TypeScript interfaces defined here |
| 62 | + |
| 63 | +### Custom Reporter (`src/reporter.ts`) |
| 64 | +- **Score display**: Shows evaluation results |
| 65 | +- **Error reporting**: Handles test failures |
| 66 | +- **Progress tracking**: Visual feedback during tests |
| 67 | + |
| 68 | +### Scorer System (`src/scorers/`) |
| 69 | +- **ToolCallScorer**: Evaluates tool/function call accuracy (only built-in scorer) |
| 70 | +- **Flexible parameters**: Support various parameter names |
| 71 | +- **Type safety**: Full TypeScript support |
| 72 | +- **Autoevals compatibility**: Works with existing scorers |
| 73 | + |
| 74 | +## 🔴 CRITICAL: Code Standards |
| 75 | + |
| 76 | +### TypeScript Requirements |
| 77 | +- **Strict mode**: All code must pass strict TypeScript checks |
| 78 | +- **Explicit types**: No implicit any types |
| 79 | +- **Interface-driven**: Define interfaces before implementation |
| 80 | + |
| 81 | +### Testing Requirements |
| 82 | +- **All scorers MUST have tests**: No exceptions |
| 83 | +- **Test edge cases**: Error conditions, async behavior |
| 84 | +- **Integration tests**: Test with actual AI outputs |
| 85 | +- **Run tests**: `pnpm test` must pass before completion |
| 86 | + |
| 87 | +### Code Quality |
| 88 | +- **Lint check**: `pnpm run lint` must pass |
| 89 | +- **Type check**: `pnpm run typecheck` must pass |
| 90 | +- **Format**: `pnpm run format` for consistent style |
| 91 | + |
| 92 | +## Key Commands |
| 93 | + |
| 94 | +```bash |
| 95 | +# Development |
| 96 | +pnpm test # Run all tests |
| 97 | +pnpm run build # Build the package |
| 98 | +pnpm run lint # Check code style |
| 99 | +pnpm run format # Auto-format code |
| 100 | +pnpm run typecheck # Verify TypeScript types |
| 101 | + |
| 102 | +# Before completing ANY task |
| 103 | +pnpm run lint && pnpm run typecheck && pnpm test |
| 104 | +``` |
| 105 | + |
| 106 | +## Architecture Patterns |
| 107 | + |
| 108 | +### Scorer Implementation |
| 109 | +- Implement the `Scorer` interface |
| 110 | +- Support both sync and async evaluation |
| 111 | +- Handle errors gracefully |
| 112 | +- Return normalized scores (0-1 range typical) |
| 113 | + |
| 114 | +### TaskResult Handling |
| 115 | +```typescript |
| 116 | +type TaskResult = string | { result: string; toolCalls?: any[] } |
| 117 | +``` |
| 118 | +- Always handle both formats |
| 119 | +- Extract result string appropriately |
| 120 | +- Pass tool calls to specialized scorers |
| 121 | +
|
| 122 | +### Parameter Flexibility |
| 123 | +- Support multiple parameter names (e.g., `expected` or `expectedTools`) |
| 124 | +- Use TypeScript generics for type safety |
| 125 | +- Document parameter requirements |
| 126 | +
|
| 127 | +## Documentation Requirements |
| 128 | +
|
| 129 | +### Code Documentation |
| 130 | +- Document all public APIs with JSDoc |
| 131 | +- Include usage examples in comments |
| 132 | +- Explain complex logic inline |
| 133 | +
|
| 134 | +### README Updates |
| 135 | +- Keep examples current with API changes |
| 136 | +- Document new scorers when added |
| 137 | +- Update compatibility notes |
| 138 | +
|
| 139 | +## Current Features |
| 140 | +
|
| 141 | +### Implemented |
| 142 | +- Core evaluation framework |
| 143 | +- Custom Vitest reporter |
| 144 | +- ToolCallScorer for function evaluation |
| 145 | +- Autoevals library compatibility |
| 146 | +- Flexible parameter naming |
| 147 | +
|
| 148 | +### In Progress |
| 149 | +- Additional scorer implementations |
| 150 | +- Enhanced error handling |
| 151 | +- Performance optimizations |
| 152 | +
|
| 153 | +## Documentation Maintenance |
| 154 | +
|
| 155 | +**CRITICAL**: Documentation must be kept up-to-date with code changes |
| 156 | +- Update relevant docs when modifying code |
| 157 | +- Add examples when creating new scorers |
| 158 | +- Document breaking changes prominently |
| 159 | +- Keep CLAUDE.md synchronized with project state |
| 160 | +
|
| 161 | +## Development Process |
| 162 | +
|
| 163 | +1. **Review existing code** to understand patterns |
| 164 | +2. **Write tests first** for new features |
| 165 | +3. **Implement** following established patterns |
| 166 | +4. **Verify** with lint, typecheck, and tests |
| 167 | +5. **Document** changes in code and README |
| 168 | +
|
| 169 | +## Common Patterns |
| 170 | +
|
| 171 | +### Creating a New Scorer |
| 172 | +1. Define the scorer interface extending `BaseScorerOptions` in `src/index.ts` |
| 173 | +2. Implement in `src/scorers/[name].ts` following camelCase naming |
| 174 | +3. Write comprehensive tests in `src/scorers/[name].test.ts` |
| 175 | +4. Export from `src/scorers/index.ts` and main index |
| 176 | +5. Document usage in README |
| 177 | +
|
| 178 | +### Testing Scorers |
| 179 | +```typescript |
| 180 | +import { describe, test, expect } from 'vitest' |
| 181 | +import { YourScorer } from '../src/scorers/yourScorer' |
| 182 | + |
| 183 | +test('scorer evaluates correctly', async () => { |
| 184 | + expect('test input').toEval( |
| 185 | + 'expected output', |
| 186 | + async (input) => 'test output', |
| 187 | + YourScorer, |
| 188 | + 1.0 |
| 189 | + ) |
| 190 | +}) |
| 191 | +``` |
| 192 | + |
| 193 | +## Package Manager |
| 194 | + |
| 195 | +This project uses **pnpm** (not npm). Always use pnpm commands. |
| 196 | + |
| 197 | +## Validation Checklist |
| 198 | + |
| 199 | +Before marking ANY task complete: |
| 200 | +- [ ] Code passes `pnpm run lint` |
| 201 | +- [ ] Code passes `pnpm run typecheck` |
| 202 | +- [ ] All tests pass with `pnpm test` |
| 203 | +- [ ] New features have tests |
| 204 | +- [ ] Documentation is updated |
| 205 | +- [ ] Examples work correctly |
0 commit comments