getsentry
diff --git a/‎CLAUDE.md‎
Lines changed: 205 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 205 additions & 0 deletions
diff --git a/‎docs/architecture.md‎
Lines changed: 126 additions & 0 deletions b/‎docs/architecture.md‎
Lines changed: 126 additions & 0 deletions
@@ -0,0 +1,205 @@
+# vitest-evals Development Guidelines
+
+## 🔴 CRITICAL: Pre-Development Requirements
+
+**MANDATORY READING before ANY code changes:**
+- Read existing code structure in `src/` to understand patterns
+- Check `src/index.ts` for core framework architecture
+- Review test files for testing patterns
+- All scorer implementations MUST follow established patterns
+
+### Required Documentation Review
+- **Architecture**: MUST read `docs/architecture.md` for system design
+- **Testing**: MUST read `docs/testing.md` for test requirements
+- **Development**: MUST read `docs/development-guide.md` for workflow
+- **Examples**: Check `docs/scorer-examples.md` for implementation patterns
+
+## Repository Overview
+
+vitest-evals is a Vitest-based evaluation framework for testing language model outputs with flexible scoring functions. It provides a structured way to evaluate AI model outputs against expected results.
+
+## Repository Structure
+
+```
+vitest-evals/
+├── src/
+│   ├── index.ts                     # Main entry point, types, and core framework
+│   ├── reporter.ts                  # Custom Vitest reporter
+│   ├── scorers/                     # Scorer implementations
+│   │   ├── index.ts                 # Scorers export file
+│   │   ├── toolCallScorer.ts        # Tool call evaluation scorer
+│   │   └── toolCallScorer.test.ts   # Tool call scorer tests
+│   ├── ai-sdk-integration.test.ts   # AI SDK integration example
+│   ├── autoevals-compatibility.test.ts # Autoevals compatibility tests
+│   ├── formatScores.test.ts         # Format scores tests
+│   └── wrapText.test.ts             # Wrap text tests
+├── docs/                           # Project documentation
+│   ├── architecture.md             # System architecture overview
+│   ├── testing.md                  # Testing standards and requirements
+│   ├── development-guide.md        # Development workflow and tips
+│   ├── scorer-examples.md          # Example scorer implementations
+│   ├── custom-scorers.md           # Custom scorer examples
+│   └── provider-transformations.md # Provider tool call transformations
+├── scripts/
+│   └── craft-pre-release.sh        # Release preparation script
+├── tsup.config.ts                  # Build configuration
+├── tsconfig.json                   # TypeScript configuration
+├── biome.json                      # Code formatter/linter config
+└── package.json                    # Project dependencies and scripts
+```
+
+## Core Components Impact Analysis
+
+When making changes, consider these areas:
+
+### Framework Core (`src/index.ts`)
+- **describeEval()** function: Main evaluation entry point for test suites
+- **toEval** matcher: Vitest matcher for individual evaluations
+- **TaskResult** handling: Supports string or {result, toolCalls}
+- **ScoreFn** interface: All scorers must implement this
+- **Async/sync support**: Both scorer types supported
+- **Type definitions**: All TypeScript interfaces defined here
+
+### Custom Reporter (`src/reporter.ts`)
+- **Score display**: Shows evaluation results
+- **Error reporting**: Handles test failures
+- **Progress tracking**: Visual feedback during tests
+
+### Scorer System (`src/scorers/`)
+- **ToolCallScorer**: Evaluates tool/function call accuracy (only built-in scorer)
+- **Flexible parameters**: Support various parameter names
+- **Type safety**: Full TypeScript support
+- **Autoevals compatibility**: Works with existing scorers
+
+## 🔴 CRITICAL: Code Standards
+
+### TypeScript Requirements
+- **Strict mode**: All code must pass strict TypeScript checks
+- **Explicit types**: No implicit any types
+- **Interface-driven**: Define interfaces before implementation
+
+### Testing Requirements
+- **All scorers MUST have tests**: No exceptions
+- **Test edge cases**: Error conditions, async behavior
+- **Integration tests**: Test with actual AI outputs
+- **Run tests**: `pnpm test` must pass before completion
+
+### Code Quality
+- **Lint check**: `pnpm run lint` must pass
+- **Type check**: `pnpm run typecheck` must pass
+- **Format**: `pnpm run format` for consistent style
+
+## Key Commands
+
+```bash
+# Development
+pnpm test          # Run all tests
+pnpm run build     # Build the package
+pnpm run lint      # Check code style
+pnpm run format    # Auto-format code
+pnpm run typecheck # Verify TypeScript types
+
+# Before completing ANY task
+pnpm run lint && pnpm run typecheck && pnpm test
+```
+
+## Architecture Patterns
+
+### Scorer Implementation
+- Implement the `Scorer` interface
+- Support both sync and async evaluation
+- Handle errors gracefully
+- Return normalized scores (0-1 range typical)
+
+### TaskResult Handling
+```typescript
+type TaskResult = string | { result: string; toolCalls?: any[] }
+```
+- Always handle both formats
+- Extract result string appropriately
+- Pass tool calls to specialized scorers
+
+### Parameter Flexibility
+- Support multiple parameter names (e.g., `expected` or `expectedTools`)
+- Use TypeScript generics for type safety
+- Document parameter requirements
+
+## Documentation Requirements
+
+### Code Documentation
+- Document all public APIs with JSDoc
+- Include usage examples in comments
+- Explain complex logic inline
+
+### README Updates
+- Keep examples current with API changes
+- Document new scorers when added
+- Update compatibility notes
+
+## Current Features
+
+### Implemented
+- Core evaluation framework
+- Custom Vitest reporter
+- ToolCallScorer for function evaluation
+- Autoevals library compatibility
+- Flexible parameter naming
+
+### In Progress
+- Additional scorer implementations
+- Enhanced error handling
+- Performance optimizations
+
+## Documentation Maintenance
+
+**CRITICAL**: Documentation must be kept up-to-date with code changes
+- Update relevant docs when modifying code
+- Add examples when creating new scorers
+- Document breaking changes prominently
+- Keep CLAUDE.md synchronized with project state
+
+## Development Process
+
+1. **Review existing code** to understand patterns
+2. **Write tests first** for new features
+3. **Implement** following established patterns
+4. **Verify** with lint, typecheck, and tests
+5. **Document** changes in code and README
+
+## Common Patterns
+
+### Creating a New Scorer
+1. Define the scorer interface extending `BaseScorerOptions` in `src/index.ts`
+2. Implement in `src/scorers/[name].ts` following camelCase naming
+3. Write comprehensive tests in `src/scorers/[name].test.ts`
+4. Export from `src/scorers/index.ts` and main index
+5. Document usage in README
+
+### Testing Scorers
+```typescript
+import { describe, test, expect } from 'vitest'
+import { YourScorer } from '../src/scorers/yourScorer'
+
+test('scorer evaluates correctly', async () => {
+  expect('test input').toEval(
+    'expected output',
+    async (input) => 'test output',
+    YourScorer,
+    1.0
+  )
+})
+```
+
+## Package Manager
+
+This project uses **pnpm** (not npm). Always use pnpm commands.
+
+## Validation Checklist
+
+Before marking ANY task complete:
+- [ ] Code passes `pnpm run lint`
+- [ ] Code passes `pnpm run typecheck`  
+- [ ] All tests pass with `pnpm test`
+- [ ] New features have tests
+- [ ] Documentation is updated
+- [ ] Examples work correctly
@@ -0,0 +1,126 @@
+# vitest-evals Architecture
+
+## Overview
+
+vitest-evals is built on top of Vitest to provide a specialized testing framework for evaluating AI/LLM outputs. It extends Vitest's capabilities with custom scoring functions and reporting.
+
+## Core Components
+
+### 1. Evaluation Framework (`src/index.ts`)
+
+The heart of the system provides two main APIs:
+
+**describeEval()** - Creates test suites for batch evaluation:
+- Accepts data function, task function, and scorers
+- Runs multiple test cases automatically
+- Integrates with Vitest's test runner
+
+**toEval matcher** - Individual evaluation within tests:
+- Extends Vitest's expect API
+- Evaluates single input/output pairs
+- Returns pass/fail based on score threshold
+
+```typescript
+export function describeEval(
+  name: string,
+  options: {
+    data: () => Promise<Array<{ input: string } & Record<string, any>>>;
+    task: TaskFn;
+    scorers: ScoreFn<any>[];
+    threshold?: number;
+  }
+)
+```
+
+### 2. Scorer System
+
+Scorers are the pluggable evaluation functions that determine output quality.
+
+#### Scorer Interface
+
+```typescript
+type ScoreFn<TOptions extends BaseScorerOptions = BaseScorerOptions> = (
+  opts: TOptions,
+) => Promise<Score> | Score;
+
+interface BaseScorerOptions {
+  input: string;
+  output: string;
+  toolCalls?: ToolCall[];
+}
+
+type Score = {
+  score: number | null;
+  metadata?: {
+    rationale?: string;
+    output?: string;
+  };
+};
+```
+
+#### Built-in Scorers
+- **ToolCallScorer** (`src/scorers/toolCallScorer.ts`): Evaluates function/tool call accuracy
+- More scorers can be added by implementing the ScoreFn interface
+
+### 3. Type System (`src/index.ts`)
+
+Defines the core types:
+- `TaskResult`: Flexible output format supporting plain strings or structured results
+- `ScoreFn`: Function signature for evaluation logic
+- `Score`: Standardized scoring output
+- `ToolCall`: Comprehensive tool call structure supporting multiple providers
+
+### 4. Custom Reporter (`src/reporter.ts`)
+
+A Vitest reporter that:
+- Displays evaluation scores alongside test results
+- Provides visual feedback for score ranges
+- Integrates seamlessly with Vitest's output
+
+## Data Flow
+
+1. **Test Execution**: Vitest runs evaluation tests via `describeEval()` or `toEval` matcher
+2. **Task Execution**: Task function processes input and returns output (string or TaskResult)
+3. **Scorer Application**: Each scorer evaluates the output against test data
+4. **Score Aggregation**: Multiple scorer results are averaged
+5. **Threshold Check**: Average score compared against threshold
+6. **Reporting**: Custom reporter displays results with scores and metadata
+
+## Extension Points
+
+### Adding New Scorers
+
+1. Create scorer file in `src/scorers/` using camelCase naming
+2. Implement the `ScoreFn` interface
+3. Export from `src/scorers/index.ts` and main index
+4. Write comprehensive tests in `src/scorers/[name].test.ts`
+
+### Integration with AI SDKs
+
+The framework is designed to work with various AI SDKs:
+- Vercel AI SDK (see `ai-sdk-integration.test.ts`)
+- OpenAI SDK
+- Anthropic SDK
+- Any system producing text/structured output
+
+## Design Principles
+
+1. **Flexibility**: Support multiple output formats and scoring approaches
+2. **Type Safety**: Full TypeScript support with strict typing
+3. **Testability**: Scorers themselves are easily testable
+4. **Compatibility**: Works with existing Vitest ecosystem
+5. **Extensibility**: Easy to add new scorers and integrations
+
+## Performance Considerations
+
+- Scorers can be sync or async to handle API calls
+- Batch evaluation support for efficiency
+- Minimal overhead on top of Vitest
+- Lazy loading of scorers when needed
+
+## Error Handling
+
+- Graceful degradation when scorers fail
+- Clear error messages for debugging
+- Type-safe error boundaries
+- Preserves Vitest error reporting