An AI-powered agent that generates Python solutions to coding problems and automatically verifies their correctness — without any human inspection.
Built as a hands-on exploration of ProjCode-003 themes: How do we build software with AI that we can actually trust?
- Takes a coding problem as a natural language description
- Generates a Python solution using an LLM (Groq / LLaMA3-70B)
- Executes the code in a sandboxed Python environment
- Runs test cases and checks outputs against expected results
- Retries on failure — feeds error context back to the LLM for a smarter second attempt
This retry-with-feedback loop is what makes it agentic: the system detects its own failures and attempts self-correction.
============================================================
PROBLEM: Fibonacci
Description: Return the nth Fibonacci number (0-indexed).
------------------------------------------------------------
🔄 Attempt 1/3 → Generating code...
Generated function:
def fibonacci(n: int) -> int:
if n <= 1:
return n
a, b = 0, 1
for _ in range(2, n + 1):
a, b = b, a + b
return b
📊 Result: 4/4 tests passed
✅ PASS | Input: (0,) → Expected: 0 | Got: 0
✅ PASS | Input: (1,) → Expected: 1 | Got: 1
✅ PASS | Input: (6,) → Expected: 8 | Got: 8
✅ PASS | Input: (10,) → Expected: 55 | Got: 55
🎉 SUCCESS! VERIFIED after 1 attempt(s)
📈 OVERALL PERFORMANCE
Accuracy: 12/12 = 1.00
Result: 12/12 tests passed (100% accuracy) across 4 DSA problems on first attempt.
Input Problem
│
▼
LLM Agent (Groq / LLaMA3-70B)
│ generates Python function
▼
Sandbox Runner (exec)
│ runs test cases
▼
Verifier
│ pass → done ✅
│ fail → feed error back to LLM → retry 🔄
▼
Report (pass/fail per test case)
code-verifier/
├── main.py # Agentic orchestrator — retry loop with error feedback
├── llm.py # Groq API wrapper — prompt engineering for clean code output
├── runner.py # Sandbox executor — runs generated code against test cases
├── problems.py # DSA problem bank with test cases
└── .env # GROQ_API_KEY=your_key_here
git clone https://github.com/Vedant0527/code-verifier
cd code-verifierpip install groq python-dotenvSign up at console.groq.com — no credit card required.
echo "GROQ_API_KEY=your_key_here" > .envpython main.py| Problem | Tests | Result |
|---|---|---|
| Two Sum | 3 | ✅ 3/3 |
| Reverse String | 2 | ✅ 2/2 |
| Fibonacci | 4 | ✅ 4/4 |
| Longest Substring Without Repeating Characters | 3 | ✅ 3/3 |
AI code generation is powerful but not trustworthy on its own. A model that writes an elegant solution to a problem can still produce one that silently fails edge cases. This project is a minimal prototype of a verification layer — a system that wraps AI generation with automated correctness checking.
This directly mirrors the central question of IITB Trust Lab's ProjCode-003:
"How do we build software with AI that we can actually trust?"
The answer this project proposes: don't trust the output, verify it.
- Error-aware retry — feed exact failure reason back to LLM for smarter correction
- Time-limit enforcement per test case (prevent infinite loops)
- JSON logging of all attempts and results
- CLI interface via
argparse - Expand problem bank to LeetCode Easy/Medium set
- Support C++ solution generation + compilation via subprocess
Vedant Shri Agarwal
B.Tech — Electrical and Computer Engineering, Thapar University
GitHub · LinkedIn · [email protected]