A production-ready backend for prototyping and evaluating agentic AI workflows. Built with Next.js Route Handlers, powered by the OpenAI SDK, and includes deterministic testing for reliable evaluations without production authentication.
This backend provides battle-tested infrastructure for:
- Agentic Workflows - Multi-tool orchestration with streaming responses
- JSON Extraction - Structured output with schema enforcement
- Embeddings - Vector generation with built-in retry logic
- RAG Pipelines - Context gathering and synthesis patterns
- Evaluation Framework - Deterministic testing and performance metrics
- β Production-Ready Routes - Drop into any Next.js/Remix/Express app
- β Streaming SSE Support - Real-time agent execution traces
- β Zero-Cost Testing - Mocked tests run without OpenAI credits
- β Deterministic Evaluation - Reproducible results for CI/CD
- β Multi-User Simulation - Test different user contexts easily
- Node.js 18.17+ (for native fetch/streams support)
- npm or yarn
- OpenAI API key (optional for live testing)
- Clone and install
git clone https://github.com/HomenShum/openai-agent-eval-framework
cd openai-eval-backend
npm install- Configure environment
# Copy template
cp .env.example .env
# Add your OpenAI key (optional - only for live tests)
# Mocked tests work without this
echo "OPENAI_API_KEY=sk-your-key" >> .env- Verify installation
# Run mocked tests (no API key needed)
npm test
# Run with live API (if key is set)
$env:OPENAI_API_KEY=sk-your-key
npm testopenai-eval-backend/
βββ openai/ # Core route handlers
β βββ agent/ # Multi-tool agent orchestration
β βββ ask/ # Simple chat completions
β βββ ask-json/ # Structured JSON outputs
β βββ ask-mode/ # Preset response modes
β βββ embeddings/ # Vector embeddings
β βββ gather-context/ # RAG context collection
β βββ organization-mode/ # Hierarchical note generation
β βββ __tests__/ # Test suites and evaluations
βββ Interview/ # Demo materials
β βββ sse_samples/ # Captured SSE traces
β βββ dataset/ # Sample evaluation data
β βββ *.md # Documentation
βββ lib/ # Shared utilities
β βββ eval/ # Evaluation framework
βββ scripts/ # CLI utilities
βββ test/__mocks__/ # Test mocks
Drop any route directly into your Next.js app:
// app/api/llm/openai/ask/route.ts
export { POST } from "openai/ask/route";
// app/api/llm/openai/agent/route.ts
export { POST } from "openai/agent/route";| Endpoint | Purpose | Best For |
|---|---|---|
/agent |
Multi-step workflows with tools | Complex tasks, automation |
/ask |
Simple text generation | Chat, Q&A |
/ask-json |
Structured JSON output | Forms, data extraction |
/ask-mode |
Preset response styles | Drafting, critiques |
/embeddings |
Vector generation | Semantic search, RAG |
/gather-context |
Context ranking | Knowledge retrieval |
/organization-mode |
Hierarchical outlines | Note-taking, documentation |
// Request
const response = await fetch('/api/llm/openai/agent', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-test-user-id': 'user-123' // Optional user context
},
body: JSON.stringify({
messages: [
{ role: 'user', content: 'Create a project plan for AGI Safety' }
]
})
});
// Stream SSE responses
const reader = response.body.getReader();
// ... process streaming chunksThe test suite intelligently adapts based on environment:
| Mode | Condition | Use Case |
|---|---|---|
| Mocked | No API key | Fast, deterministic, CI/CD |
| Live | API key present | Real model behavior testing |
| Hybrid | Mixed config | Development iteration |
# All tests (auto-detects mode)
npm test
# Specific test file
npx jest openai/agent/e2e.agent.test.ts
# Pattern matching
npx jest --testNamePattern="E2E: Agent"
# Live tests only (requires API key)
OPENAI_API_KEY=sk-your-key npx jest openai/__tests__/live.*.test.ts# Capture deterministic SSE trace
npm run -s demo:capture-sse
# Live SSE capture (requires API key)
npm run -s demo:capture-sse:live
# Run mini evaluation suite
npm run -s demo:eval:compact- Suites: 18 passed, 1 skipped (19 total)
- Tests: 55 passed, 16 skipped (71 total)
- Skips: envβgated live tests (require OPENAI_API_KEY)
Key signals
- Miniβeval macro P/R/F1
- Banking β P 0.80, R 1.00, F1 0.867
- AI Research β P 0.767, R 0.933, F1 0.814
- Routes passed: ask, askβjson, embeddings, agent, gatherβcontext, organizationβmode
- Live smokes ran successfully where applicable
See also
- Full writeβup: TEST_MASTER_SUMMARY.md
- Latest SSE sample (preflight): Interview/sse_samples/ai_research.preflight.sse.log
For simplicity, this public version uses a deterministic test-user system:
// Default behavior
userId = "test-user"
// Override via header
headers: { 'x-test-user-id': 'custom-user-123' }
// Override via environment
PUBLIC_TEST_USER_ID=demo-user-456Production Note: Replace testUser.ts with your actual auth middleware before deploying.
// Define test cases
const testCases = [
{
input: "Analyze competitor Stripe",
expectedTags: ["analysis", "competitor", "Stripe"],
expectedActions: ["search", "analyze", "report"]
}
];
// Run evaluation
npm run -s demo:eval:compact
// Output
// Precision: 0.95
// Recall: 0.88
// F1 Score: 0.91Captured traces show complete agent execution:
{"type":"tool_call","name":"search","args":{"query":"AGI safety"}}
{"type":"tool_result","data":{"hits":15,"top_result":"..."}}
{"type":"reasoning","text":"Found relevant papers, organizing..."}
{"type":"final","summary":"Created 3 notes with 15 citations"}# Required for live tests (optional for mocked)
OPENAI_API_KEY=sk-your-key
# Optional configurations
OPENAI_BASE_URL=https://api.openai.com/v1 # Custom endpoint
PUBLIC_TEST_USER_ID=demo-user-123 # Default user ID
OPENAI_MODEL=gpt-5-mini # Model selection- Streaming: All endpoints support SSE for real-time feedback
- Retry Logic: Built-in exponential backoff for rate limits
- Batching: Embeddings endpoint handles up to 100 texts per request
- Caching: Deterministic test harness enables result caching
- Start Here:
Interview/TEST_MASTER_SUMMARY.md- Overview of test patterns - Deep Dive:
Interview/INTERVIEW_GUIDE.md- Complete walkthrough with examples - SSE Samples:
Interview/sse_samples/- Real execution traces - Datasets:
Interview/dataset/- Sample evaluation data
-
Local Development
# Use mocked tests for rapid iteration npm test -- --watch
-
Integration Testing
# Test with real API OPENAI_API_KEY=sk-test npm test
-
Production Deployment
# Add auth middleware # Configure production env vars # Deploy to your platform
We welcome contributions! Areas of interest:
- Additional tool integrations
- Evaluation metrics and datasets
- Provider adapters (Anthropic, Cohere, etc.)
- Performance optimizations
Built with β€οΈ for the AI engineering community. Start with mocked tests, iterate quickly, and deploy with confidence.