A self-extending AI agent with persistent memory, sandboxed tool creation, budget-aware model routing, and cloud access via MCP.
Built with LangGraph, Supabase/pgvector, Docker, and FastMCP. Developed across 8 disciplined sprints, each adding a distinct capability layer.
Most LangGraph agent repos are single-feature demos — a memory example here, a tool-calling example there. cairn is an integrated system where every piece works together, designed for individual developers who want to build and run a personal AI agent without enterprise infrastructure.
-
Self-Extending Metatool System with Human Approval — The agent can write its own tools, test them in a Docker sandbox, register them in a database as "pending," and require explicit human review and approval via CLI before they go live. Self-extending agents exist (see Related Projects), but most auto-promote new capabilities. cairn's full pipeline — create → sandbox test → DB registration → human review → approval → dynamic import — prioritizes safety over convenience.
-
4-Tier Model Routing with Budget Caps — YAML-driven rules route tasks to the cheapest capable model (Qwen 3 8B → Qwen 3 32B → Claude Sonnet → Claude Extended). Daily spend tracking auto-downgrades to local models when you hit your budget. No LiteLLM dependency — just a simple, readable routing config.
-
Daily Research Digest Pipeline — A recurring daemon task scrapes configurable news sources, pre-filters items by embedding similarity against your SCMS project memories, reranks with a cross-encoder (cairn-rank), then summarizes with a local 32B model. Three scoring layers — embedding similarity, cross-encoder relevance, and LLM scoring — are tracked against human judgment in an evaluation pipeline. Approved items become permanent memories, improving future filtering — a feedback loop. Approved items compile into readable documents and optionally generate a narrated audio digest via local TTS (Kokoro) for listening on the go. A Jupyter notebook measures each layer's precision/recall against human judgments, surfacing failure modes that feed back into prompt tuning. Runs for ~$0.03/month.
-
MCP Server with OAuth 2.1 — Your agent's memory and task queue are accessible from claude.ai, Claude Desktop, Claude Code, and mobile via a Railway-deployed MCP server with full OAuth 2.1 (DCR + PKCE). One of the few Python FastMCP + OAuth reference implementations available.
-
Persistent Memory (SCMS) — Supabase + pgvector with 1536-dimensional embeddings and HNSW cosine search. The agent remembers across sessions — projects, decisions, learnings, and context. Not a demo — this is the actual persistence layer the whole system runs on.
┌─────────────────────────────────────────────────────────┐
│ CLI (main.py) │
│ Modes: single task, interactive REPL, daemon, digest │
└──────────────┬──────────────────────────────┬────────────┘
│ │
▼ ▼
┌──────────────────────────┐ ┌─────────────────────────┐
│ LangGraph StateGraph │ │ Daemon (daemon.py) │
│ │ │ Polling loop + croniter │
│ START │ │ Recurring cron tasks │
│ → CLASSIFY (no LLM) │ │ Daily digest pipeline │
│ → PLAN (LLM) │ └─────────────────────────┘
│ → ACT (LLM+tools) │
│ → REFLECT (LLM) │
│ ...or END │
└──────────┬────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ Tool Layer (16+ tools) │
│ Data: web_search, url_reader, arxiv_search, │
│ github_search │
│ Files: file_reader, file_writer, note_taker │
│ Code: code_executor (Docker sandbox) │
│ SCMS: scms_search, scms_store │
│ Project: create_project, update_project, archive_project │
│ Meta: create_tool, test_tool, list_pending_tools │
└──────────┬────────────────────────────────────────────────┘
│
▼
┌────────────────────┐ ┌──────────────────┐ ┌──────────────┐
│ Supabase/pgvector │ │ Docker Sandbox │ │ Model Router │
│ 5 tables + RPC │ │ 256MB, no net │ │ YAML rules │
│ 1536-dim embeddings│ │ 60s timeout │ │ 4 tiers │
└────────────────────┘ └──────────────────┘ └──────────────┘
┌──────────────────────────────────────────┐
│ MCP Server (Railway) │
│ FastMCP + OAuth 2.1 (DCR + PKCE) │
│ 18 tools · claude.ai / Desktop / mobile │
└──────────────────────────────────────────┘
- Python 3.12+
- uv package manager
- Ollama for local LLM inference
- Docker (recommended for code execution; restricted subprocess fallback available with
--allow-subprocess) - ffmpeg (optional — required for audio digest MP3 export;
brew install ffmpegon macOS) - A Supabase project (free tier works)
git clone https://github.com/reallyreallyryan/cairn.git
cd cairn
# Install dependencies
uv sync
# Pull local models
ollama pull qwen3:32b
ollama pull qwen3:8b
# Set up Supabase:
# 1. Create a project at supabase.com
# 2. Enable the pgvector extension (Database > Extensions)
# 3. Run the migrations in order: scms/migrations/001_initial.sql through 005
# 4. Copy your Project URL + anon key
# Build the Docker sandbox image (required for code execution and metatool testing)
docker build -t cairn-sandbox -f sandbox/Dockerfile .
# Configure
cp .env.example .env
# Edit .env with your Supabase URL, API keys, etc.
# Start Ollama (auto-started by cairn if needed, or start manually)
ollama serve
# Run your first task
uv run python main.py "What projects am I working on?"# Single task
python main.py "Search the web for LangGraph tutorials"
# Interactive mode
python main.py -i
# Use cloud model explicitly
python main.py --model cloud "Architect a REST API for task management"
# Daily research digest
python main.py --digest # Run manually
python main.py --review-digest # Review & approve/reject items into memory
python main.py --digest-status # Check last run stats
python main.py --digest-eval # Run evaluation report on approval/rejection history
python main.py --compile-digest # Compile approved items into deep-dive + briefing docs
python main.py --compile-digest --compile-since 2026-03-21 # Filter by date
python main.py --compile-digest --with-audio # Compile + generate audio narration
python main.py --audio-digest # Generate audio from today's briefing
python main.py --audio-from ~/Documents/cairn/digests/2026-04-04_digest_briefing.md # Specific file
# Task queue & daemon
python main.py --queue "Research MCP best practices" --priority 2
python main.py --daemon # Background task processing
# Metatool management
python main.py --pending-tools # List tools awaiting approval
python main.py --review-tool <id> # Review tool code + sandbox test results
python main.py --approve-tool <id> # Approve for production usecairn was built incrementally across 9 sprints. Each sprint added a distinct capability layer, and each sprint brief was handed to Claude Code for implementation.
| Sprint | Focus | What Was Added |
|---|---|---|
| 1 | Foundation | SCMS + pgvector memory, 10 tools, CLI with single task and interactive modes |
| 2 | Intelligence | Plan→Act→Reflect loop, keyword classifier, multi-step planning, decision logging |
| 3 | Security | Docker sandbox, metatool system, human approval workflow, dynamic tool loading |
| 4 | Autonomy | Model routing, budget tracking, task queue, daemon mode, notifications |
| 5a | Cloud Access | MCP server, OAuth 2.1, Railway deployment, OpenAI embedding migration |
| 5b | Digest Pipeline | Daily research digest, 4-tier routing (Qwen 3 upgrade), local 32B summarization |
| 5c | Hardening | Classifier default fix (multi→research), archived DB status, MCP ToolAnnotations, httpx session fix, ddgs migration |
| 6 | Digest Relevance | Embedding-based pre-filter for digest pipeline, per-source similarity thresholds, cold-start bypass |
| 7 | Digest Hardening | Few-shot calibration from approval history, digest dedup on ingest, evaluation pipeline with threshold analysis, digest sources expanded to 16, digest_eval MCP tool |
| 8 | Security + Reranking | gitleaks pre-commit hook, cairn-rank cross-encoder integration into digest pipeline, three-layer scoring eval |
| 8b | Digest Compiler | Compile approved digest items into deep-dive and briefing documents with full article fetching and LLM summarization |
| 8c | Audio Digest | Text-to-speech narration of briefing digests via Kokoro TTS (local, free) with OpenAI TTS fallback for listening on the go |
| 9 | Evals Deep Dive | Jupyter notebook for baseline scoring analysis: per-threshold precision/recall/F1, layer correlation, failure mode extraction, sharpened LLM relevance prompt |
| 10 | Q&A Audio Digest | Host/Expert conversation style with SCMS project context, dual Kokoro voices (male host, female expert), markdown artifact cleanup |
├── agent/ # LangGraph agent
│ ├── graph.py # StateGraph: classify → plan → act → reflect
│ ├── classifier.py # Keyword-based task classification (no LLM call)
│ ├── classify.py # CLASSIFY node: task type, project detection, SCMS context
│ ├── plan.py # PLAN node: LLM plan generation, step parsing
│ ├── act.py # ACT node: tool execution, fallback dispatch
│ ├── reflect.py # REFLECT node: result evaluation, continuation logic
│ ├── utils.py # Shared utilities: get_llm(), clean_output()
│ ├── nodes.py # Re-exports from split modules (backward compat)
│ ├── state.py # AgentState TypedDict
│ ├── model_router.py # Complexity → tier → budget check → LLM instance
│ ├── daemon.py # Background task queue processor
│ ├── digest.py # Daily research digest orchestrator
│ ├── evaluation.py # Digest evaluation: metrics, thresholds, reports
│ ├── compile_digest.py # Digest compiler: article fetching + LLM summaries
│ ├── audio_digest.py # Audio digest: TTS narration (Kokoro local / OpenAI fallback)
│ ├── notifications.py # macOS + file log notifications
│ └── tools/ # 16+ tools (web, files, code, SCMS, project, metatool)
│ ├── web_search.py
│ ├── url_reader.py
│ ├── arxiv_search.py
│ ├── github_search.py
│ ├── file_reader.py
│ ├── file_writer.py
│ ├── note_taker.py
│ ├── code_executor.py
│ ├── scms_tools.py
│ ├── project_tools.py # create_project, update_project, archive_project
│ ├── metatool.py
│ └── custom/ # Agent-created tools (after human approval)
├── mcp_server/ # FastMCP server for cloud access
│ ├── server.py # 18 MCP tools, OAuth 2.1
│ └── config.py
├── config/ # YAML configs
│ ├── model_routing.yaml
│ ├── sandbox_policy.yaml
│ └── digest_sources.yaml
├── scms/ # Shared Context Memory Store
│ ├── client.py # SCMSClient — CRUD + semantic search
│ ├── embeddings.py # OpenAI text-embedding-3-small
│ └── migrations/ # Supabase SQL migrations (001–005)
├── sandbox/ # Docker sandbox
│ ├── Dockerfile
│ └── manager.py # Container lifecycle, code injection, cleanup
├── notebooks/ # Analysis notebooks (dev deps: pandas, matplotlib, jupyterlab)
│ └── sprint9_eval_baseline.ipynb # Digest scoring baseline analysis
├── tests/ # Integration tests (pytest)
│ ├── test_project_crud.py
│ ├── test_metatool_loading.py
│ ├── test_digest_dedup.py
│ ├── test_digest_fewshot.py
│ ├── test_evaluation.py
│ ├── test_rerank.py
│ └── test_audio_digest.py
└── main.py # CLI entry point
uv run pytestTests use mocked SCMS client — no Supabase or Docker required to run them.
Key choices and their tradeoffs:
- Keyword classifier over LLM classifier — Task classification uses deterministic keyword matching, not an LLM call. Faster, cheaper, predictable. Falls back to "research" with research-focused tools for ambiguous tasks.
- Supabase over SQLite — pgvector for semantic search, cloud-accessible from MCP server, single source of truth. Requires network connectivity but enables the entire cloud access story.
- Flat cost estimates over token tracking — Simple $0/$0.01/$0.03 per-call tiers rather than token-level metering. Sufficient for cost tracking. Token-level tracking deferred to future work.
- Human approval for agent-created tools — The metatool pipeline requires explicit CLI approval. No auto-promotion, ever. This is a deliberate safety decision.
- Two-stage tool promotion — Sandbox-built tools go live in the daemon/CLI after human approval (stage 1). Promoting a tool to the MCP server for cloud clients requires Claude Code review and a Railway redeploy (stage 2). No tool reaches claude.ai or Claude Desktop without two gates.
- Local-first model routing — Default tier is local (free). Cloud models only used when routing rules determine the task needs them. Budget exhaustion auto-downgrades to local.
cairn is designed to be cheap to run daily:
| Operation | Model | Cost |
|---|---|---|
| Simple recall / notes | Qwen 3 8B (local) | $0.00 |
| Summarization / digest | Qwen 3 32B (local) | $0.00 |
| Research / multi-step | Claude Sonnet (cloud) | ~$0.01/task |
| Complex technical | Claude Sonnet extended | ~$0.03/task |
| Daily digest (full run) | Local + embedding | ~$0.001/day |
| Audio digest (Kokoro) | Local TTS (82M params) | $0.00 |
| Audio digest (OpenAI fallback) | OpenAI TTS | ~$0.18/digest |
| Daily budget cap | Configurable | Default $5.00 |
The daily digest pipeline runs almost entirely on local models. The only cloud cost is embedding approved items via OpenAI text-embedding-3-small (~$0.03/month).
- Improve digest relevance scoring (embedding pre-filter + few-shot calibration from approval/rejection history)
- Evaluation pipeline using digest approval/rejection data
- Cross-encoder reranking via cairn-rank (three-layer scoring comparison)
- Security hardening (gitleaks pre-commit hook)
- Audio digest — TTS narration of briefing digests for listening on the go
- Evals baseline — per-layer precision/recall analysis, failure modes, prompt sharpening
- Memory deduplication and aging
- 24/7 daemon deployment on Railway
- Multi-agent collaboration patterns
cairn exists in a growing ecosystem of autonomous agent tools. These projects explore overlapping ideas at different scales:
- OpenClaw — Personal AI assistant with 214k+ stars. Connects to messaging platforms with self-extending skills. Different architecture (gateway vs. research agent) and auto-promotes new capabilities without human approval.
- NVIDIA OpenShell — Enterprise sandbox for self-evolving agents with policy controls. Requires DGX/RTX hardware. cairn targets the same safety-first philosophy at a scale that runs on a laptop with Ollama.
- LangGraph — The state machine framework cairn is built on. cairn's Classify→Plan→Act→Reflect loop is one opinionated implementation of LangGraph's primitives.
- LiteLLM — Production LLM proxy with routing and budget management at enterprise scale. cairn's model router is a lightweight alternative for solo developers who want the same idea in a YAML file.
- e2b — Cloud sandboxing for AI code execution. cairn uses a simpler local Docker sandbox with resource limits.
Issues and PRs welcome. See CONTRIBUTING.md for details.
If you build something interesting with cairn, I'd love to hear about it.
MIT — see LICENSE for details.