A small CLI agent that takes a real-world request ("book me a dentist next week after 5pm", "find 3 coworking spaces in Warsaw under $20/day") and tries to finish it: planning subtasks, asking clarifying questions, calling tools, and producing a clear final summary.
Take-home assignment for the Junior AI Agentic Engineer role (uv edition).
agent-assignment/
├── pyproject.toml # uv-managed dependencies
├── uv.lock # locked versions (generated by `uv sync`)
├── .python-version # 3.11
├── .env.example # copy to .env and fill in your key
├── main.py # CLI entry point (rich UI)
└── src/
├── __init__.py
├── llm.py # thin OpenAI wrapper
├── prompts.py # system prompt
├── tools.py # mock tools + JSON schemas
└── agent.py # tool-calling loop
The agent follows a standard tool-calling loop:
- Send conversation + tool schemas to the LLM.
- If the model returns a tool call, run it (or, for
ask_user, prompt the human) and append the result to the conversation. - Repeat until the model returns a plain message — that's the final answer.
- Bail after
AGENT_MAX_STEPSiterations (default 10) so a confused agent can't loop forever.
| Tool | Purpose |
|---|---|
calendar_check(start_date, end_date) |
List free slots in a date range |
search_service(query, category?, city?, max_price?) |
Search a small mock catalogue (dentists, coworking, hotels, transport) |
booking_service(option_id, when?, notes?) |
Book a previously-found option (10% simulated transient failure) |
reminder_create(title, when, notes?) |
Create a reminder |
ask_user(question) |
Pseudo-tool — pauses and asks the user for clarification |
All tools return {"ok": True, ...} or {"ok": False, "error": "..."} so the
agent can read errors and adapt instead of crashing.
Requires Python 3.10+ and uv.
# 1. Install deps and create a virtualenv
uv sync
# 2. Configure your provider (see .env.example for options)
cp .env.example .env
# edit .env — fill in OPENAI_API_KEY (and OPENAI_BASE_URL if not OpenAI)
# 3. Run the agent
uv run python main.py # interactive
uv run python main.py "Find 3 coworking spaces in Warsaw under $20/day"The agent talks to any OpenAI-compatible chat-completions endpoint.
Defaults in .env.example point at Groq (free, fast) with
meta-llama/llama-4-scout-17b-16e-instruct — chosen because it produces
well-formed tool calls more reliably than older Llamas. llama-3.3-70b
also works but occasionally emits malformed function-name strings; the
agent loop catches that and retries with a corrective hint.
To use OpenAI, OpenRouter, Together, etc., just swap OPENAI_BASE_URL,
OPENAI_API_KEY, and OPENAI_MODEL.
| Var | Default | Notes |
|---|---|---|
OPENAI_API_KEY |
— | Required. Provider key (sk-, gsk_, etc.) |
OPENAI_BASE_URL |
OpenAI's | Override for Groq / OpenRouter / etc. |
OPENAI_MODEL |
gpt-4o-mini |
Any model the provider serves with tool calling |
AGENT_MAX_STEPS |
10 |
Hard cap on tool-calling iterations |
Each screenshot is a real run against the Groq backend
(llama-3.3-70b-versatile).
"Find 3 coworking spaces in Warsaw under $20/day"
The agent calls search_service once with the right category, city, and
max_price filters and produces the final summary. No clarifications needed.
"Book me a dentist appointment next week after 5pm"
This run shows everything the assignment is grading on:
- Three clarifying questions via
ask_user(city, date, time choice between the 17:00 and 18:00 free slots). - A real recovery from a model mistake. The agent's first
booking_servicecall passed a hallucinatedoption_id: "12345". The tool rejected it (Unknown option_id '12345'. Call search_service first to get a valid id.). The agent read the error, calledsearch_serviceagain to get the real id (dent-003), and then booked successfully — confirmationBK-63254422.
That's exactly the failure-handling path the system prompt is designed to push the model into.
"Schedule a meeting with John next Tuesday afternoon and remind me 30 minutes before" "Plan a 2-day trip to Prague under 300 euro"
calendar_check → ask_user → reminder_create for the meeting; then a
chain of search_service (hotel, transport) → ask_user (dates) →
booking_service × 2 → calendar_check for the trip — all under the
€300 cap.
- Why a pseudo-tool for clarifications? It keeps the agent loop uniform — every model output is either a final message or one or more tool calls. No special-case parsing.
- Why mock data over real APIs? Reproducibility and zero credentials.
Swapping
search_serviceto call a real provider is a one-function change; the agent code doesn't move. - Why a 10% booking failure? To force the failure-handling path during evaluation — you'll occasionally see the agent retry once and recover.
- Why limit max steps? A misconfigured prompt or a model that loves re-checking the calendar can spiral. A hard cap is the cheapest safety net.
- Add an integration test with a recorded LLM response (vcrpy / pytest).
- Stream tokens for nicer UX on long final answers.
- Persist
_REMINDERS/_BOOKINGSto disk so a follow-up command (list my bookings) can see them across runs. - Replace the keyword-match search with a tiny vector store once the catalogue grows past ~50 items.



