cairn

A self-extending AI agent with persistent memory, sandboxed tool creation, budget-aware model routing, and cloud access via MCP.

Built with LangGraph, Supabase/pgvector, Docker, and FastMCP. Developed across 8 disciplined sprints, each adding a distinct capability layer.

What Makes This Different

Most LangGraph agent repos are single-feature demos — a memory example here, a tool-calling example there. cairn is an integrated system where every piece works together, designed for individual developers who want to build and run a personal AI agent without enterprise infrastructure.

Self-Extending Metatool System with Human Approval — The agent can write its own tools, test them in a Docker sandbox, register them in a database as "pending," and require explicit human review and approval via CLI before they go live. Self-extending agents exist (see Related Projects), but most auto-promote new capabilities. cairn's full pipeline — create → sandbox test → DB registration → human review → approval → dynamic import — prioritizes safety over convenience.
4-Tier Model Routing with Budget Caps — YAML-driven rules route tasks to the cheapest capable model (Qwen 3 8B → Qwen 3 32B → Claude Sonnet → Claude Extended). Daily spend tracking auto-downgrades to local models when you hit your budget. No LiteLLM dependency — just a simple, readable routing config.
Daily Research Digest Pipeline — A recurring daemon task scrapes configurable news sources, pre-filters items by embedding similarity against your SCMS project memories, reranks with a cross-encoder (cairn-rank), then summarizes with a local 32B model. Three scoring layers — embedding similarity, cross-encoder relevance, and LLM scoring — are tracked against human judgment in an evaluation pipeline. Approved items become permanent memories, improving future filtering — a feedback loop. Approved items compile into readable documents and optionally generate a narrated audio digest via local TTS (Kokoro) for listening on the go. A Jupyter notebook measures each layer's precision/recall against human judgments, surfacing failure modes that feed back into prompt tuning. Runs for ~$0.03/month.
MCP Server with OAuth 2.1 — Your agent's memory and task queue are accessible from claude.ai, Claude Desktop, Claude Code, and mobile via a Railway-deployed MCP server with full OAuth 2.1 (DCR + PKCE). One of the few Python FastMCP + OAuth reference implementations available.
Persistent Memory (SCMS) — Supabase + pgvector with 1536-dimensional embeddings and HNSW cosine search. The agent remembers across sessions — projects, decisions, learnings, and context. Not a demo — this is the actual persistence layer the whole system runs on.

Architecture

┌─────────────────────────────────────────────────────────┐
│                        CLI (main.py)                     │
│  Modes: single task, interactive REPL, daemon, digest    │
└──────────────┬──────────────────────────────┬────────────┘
               │                              │
               ▼                              ▼
┌──────────────────────────┐    ┌─────────────────────────┐
│   LangGraph StateGraph    │    │    Daemon (daemon.py)    │
│                           │    │  Polling loop + croniter │
│  START                    │    │  Recurring cron tasks    │
│    → CLASSIFY (no LLM)    │    │  Daily digest pipeline   │
│    → PLAN    (LLM)        │    └─────────────────────────┘
│    → ACT     (LLM+tools)  │
│    → REFLECT (LLM)        │
│    ...or END              │
└──────────┬────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────┐
│                    Tool Layer (16+ tools)                  │
│  Data:    web_search, url_reader, arxiv_search,           │
│           github_search                                   │
│  Files:   file_reader, file_writer, note_taker            │
│  Code:    code_executor (Docker sandbox)                  │
│  SCMS:    scms_search, scms_store                         │
│  Project: create_project, update_project, archive_project │
│  Meta:    create_tool, test_tool, list_pending_tools      │
└──────────┬────────────────────────────────────────────────┘
           │
           ▼
┌────────────────────┐  ┌──────────────────┐  ┌──────────────┐
│  Supabase/pgvector  │  │  Docker Sandbox   │  │ Model Router  │
│  5 tables + RPC     │  │  256MB, no net    │  │ YAML rules    │
│  1536-dim embeddings│  │  60s timeout      │  │ 4 tiers       │
└────────────────────┘  └──────────────────┘  └──────────────┘

               ┌──────────────────────────────────────────┐
               │  MCP Server (Railway)                     │
               │  FastMCP + OAuth 2.1 (DCR + PKCE)        │
               │  18 tools · claude.ai / Desktop / mobile  │
               └──────────────────────────────────────────┘

Quick Start

Prerequisites

Python 3.12+
uv package manager
Ollama for local LLM inference
Docker (recommended for code execution; restricted subprocess fallback available with --allow-subprocess)
ffmpeg (optional — required for audio digest MP3 export; brew install ffmpeg on macOS)
A Supabase project (free tier works)

Setup

git clone https://github.com/reallyreallyryan/cairn.git
cd cairn

# Install dependencies
uv sync

# Pull local models
ollama pull qwen3:32b
ollama pull qwen3:8b

# Set up Supabase:
# 1. Create a project at supabase.com
# 2. Enable the pgvector extension (Database > Extensions)
# 3. Run the migrations in order: scms/migrations/001_initial.sql through 005
# 4. Copy your Project URL + anon key

# Build the Docker sandbox image (required for code execution and metatool testing)
docker build -t cairn-sandbox -f sandbox/Dockerfile .

# Configure
cp .env.example .env
# Edit .env with your Supabase URL, API keys, etc.

# Start Ollama (auto-started by cairn if needed, or start manually)
ollama serve

# Run your first task
uv run python main.py "What projects am I working on?"

Usage

# Single task
python main.py "Search the web for LangGraph tutorials"

# Interactive mode
python main.py -i

# Use cloud model explicitly
python main.py --model cloud "Architect a REST API for task management"

# Daily research digest
python main.py --digest              # Run manually
python main.py --review-digest       # Review & approve/reject items into memory
python main.py --digest-status       # Check last run stats
python main.py --digest-eval         # Run evaluation report on approval/rejection history
python main.py --compile-digest      # Compile approved items into deep-dive + briefing docs
python main.py --compile-digest --compile-since 2026-03-21  # Filter by date
python main.py --compile-digest --with-audio  # Compile + generate audio narration
python main.py --audio-digest        # Generate audio from today's briefing
python main.py --audio-from ~/Documents/cairn/digests/2026-04-04_digest_briefing.md  # Specific file

# Task queue & daemon
python main.py --queue "Research MCP best practices" --priority 2
python main.py --daemon              # Background task processing

# Metatool management
python main.py --pending-tools       # List tools awaiting approval
python main.py --review-tool <id>    # Review tool code + sandbox test results
python main.py --approve-tool <id>   # Approve for production use

Sprint History

cairn was built incrementally across 9 sprints. Each sprint added a distinct capability layer, and each sprint brief was handed to Claude Code for implementation.

Sprint	Focus	What Was Added
1	Foundation	SCMS + pgvector memory, 10 tools, CLI with single task and interactive modes
2	Intelligence	Plan→Act→Reflect loop, keyword classifier, multi-step planning, decision logging
3	Security	Docker sandbox, metatool system, human approval workflow, dynamic tool loading
4	Autonomy	Model routing, budget tracking, task queue, daemon mode, notifications
5a	Cloud Access	MCP server, OAuth 2.1, Railway deployment, OpenAI embedding migration
5b	Digest Pipeline	Daily research digest, 4-tier routing (Qwen 3 upgrade), local 32B summarization
5c	Hardening	Classifier default fix (multi→research), `archived` DB status, MCP ToolAnnotations, httpx session fix, ddgs migration
6	Digest Relevance	Embedding-based pre-filter for digest pipeline, per-source similarity thresholds, cold-start bypass
7	Digest Hardening	Few-shot calibration from approval history, digest dedup on ingest, evaluation pipeline with threshold analysis, digest sources expanded to 16, digest_eval MCP tool
8	Security + Reranking	gitleaks pre-commit hook, cairn-rank cross-encoder integration into digest pipeline, three-layer scoring eval
8b	Digest Compiler	Compile approved digest items into deep-dive and briefing documents with full article fetching and LLM summarization
8c	Audio Digest	Text-to-speech narration of briefing digests via Kokoro TTS (local, free) with OpenAI TTS fallback for listening on the go
9	Evals Deep Dive	Jupyter notebook for baseline scoring analysis: per-threshold precision/recall/F1, layer correlation, failure mode extraction, sharpened LLM relevance prompt
10	Q&A Audio Digest	Host/Expert conversation style with SCMS project context, dual Kokoro voices (male host, female expert), markdown artifact cleanup

Project Structure

├── agent/                # LangGraph agent
│   ├── graph.py          # StateGraph: classify → plan → act → reflect
│   ├── classifier.py     # Keyword-based task classification (no LLM call)
│   ├── classify.py       # CLASSIFY node: task type, project detection, SCMS context
│   ├── plan.py           # PLAN node: LLM plan generation, step parsing
│   ├── act.py            # ACT node: tool execution, fallback dispatch
│   ├── reflect.py        # REFLECT node: result evaluation, continuation logic
│   ├── utils.py          # Shared utilities: get_llm(), clean_output()
│   ├── nodes.py          # Re-exports from split modules (backward compat)
│   ├── state.py          # AgentState TypedDict
│   ├── model_router.py   # Complexity → tier → budget check → LLM instance
│   ├── daemon.py         # Background task queue processor
│   ├── digest.py         # Daily research digest orchestrator
│   ├── evaluation.py     # Digest evaluation: metrics, thresholds, reports
│   ├── compile_digest.py  # Digest compiler: article fetching + LLM summaries
│   ├── audio_digest.py   # Audio digest: TTS narration (Kokoro local / OpenAI fallback)
│   ├── notifications.py  # macOS + file log notifications
│   └── tools/            # 16+ tools (web, files, code, SCMS, project, metatool)
│       ├── web_search.py
│       ├── url_reader.py
│       ├── arxiv_search.py
│       ├── github_search.py
│       ├── file_reader.py
│       ├── file_writer.py
│       ├── note_taker.py
│       ├── code_executor.py
│       ├── scms_tools.py
│       ├── project_tools.py # create_project, update_project, archive_project
│       ├── metatool.py
│       └── custom/       # Agent-created tools (after human approval)
├── mcp_server/           # FastMCP server for cloud access
│   ├── server.py         # 18 MCP tools, OAuth 2.1
│   └── config.py
├── config/               # YAML configs
│   ├── model_routing.yaml
│   ├── sandbox_policy.yaml
│   └── digest_sources.yaml
├── scms/                 # Shared Context Memory Store
│   ├── client.py         # SCMSClient — CRUD + semantic search
│   ├── embeddings.py     # OpenAI text-embedding-3-small
│   └── migrations/       # Supabase SQL migrations (001–005)
├── sandbox/              # Docker sandbox
│   ├── Dockerfile
│   └── manager.py        # Container lifecycle, code injection, cleanup
├── notebooks/            # Analysis notebooks (dev deps: pandas, matplotlib, jupyterlab)
│   └── sprint9_eval_baseline.ipynb  # Digest scoring baseline analysis
├── tests/                # Integration tests (pytest)
│   ├── test_project_crud.py
│   ├── test_metatool_loading.py
│   ├── test_digest_dedup.py
│   ├── test_digest_fewshot.py
│   ├── test_evaluation.py
│   ├── test_rerank.py
│   └── test_audio_digest.py
└── main.py               # CLI entry point

Testing

uv run pytest

Tests use mocked SCMS client — no Supabase or Docker required to run them.

Design Decisions

Key choices and their tradeoffs:

Keyword classifier over LLM classifier — Task classification uses deterministic keyword matching, not an LLM call. Faster, cheaper, predictable. Falls back to "research" with research-focused tools for ambiguous tasks.
Supabase over SQLite — pgvector for semantic search, cloud-accessible from MCP server, single source of truth. Requires network connectivity but enables the entire cloud access story.
Flat cost estimates over token tracking — Simple $0/$0.01/$0.03 per-call tiers rather than token-level metering. Sufficient for cost tracking. Token-level tracking deferred to future work.
Human approval for agent-created tools — The metatool pipeline requires explicit CLI approval. No auto-promotion, ever. This is a deliberate safety decision.
Two-stage tool promotion — Sandbox-built tools go live in the daemon/CLI after human approval (stage 1). Promoting a tool to the MCP server for cloud clients requires Claude Code review and a Railway redeploy (stage 2). No tool reaches claude.ai or Claude Desktop without two gates.
Local-first model routing — Default tier is local (free). Cloud models only used when routing rules determine the task needs them. Budget exhaustion auto-downgrades to local.

Cost

cairn is designed to be cheap to run daily:

Operation	Model	Cost
Simple recall / notes	Qwen 3 8B (local)	$0.00
Summarization / digest	Qwen 3 32B (local)	$0.00
Research / multi-step	Claude Sonnet (cloud)	~$0.01/task
Complex technical	Claude Sonnet extended	~$0.03/task
Daily digest (full run)	Local + embedding	~$0.001/day
Audio digest (Kokoro)	Local TTS (82M params)	$0.00
Audio digest (OpenAI fallback)	OpenAI TTS	~$0.18/digest
Daily budget cap	Configurable	Default $5.00

The daily digest pipeline runs almost entirely on local models. The only cloud cost is embedding approved items via OpenAI text-embedding-3-small (~$0.03/month).

Roadmap

Improve digest relevance scoring (embedding pre-filter + few-shot calibration from approval/rejection history)
Evaluation pipeline using digest approval/rejection data
Cross-encoder reranking via cairn-rank (three-layer scoring comparison)
Security hardening (gitleaks pre-commit hook)
Audio digest — TTS narration of briefing digests for listening on the go
Evals baseline — per-layer precision/recall analysis, failure modes, prompt sharpening
Memory deduplication and aging
24/7 daemon deployment on Railway
Multi-agent collaboration patterns

Related Projects

cairn exists in a growing ecosystem of autonomous agent tools. These projects explore overlapping ideas at different scales:

OpenClaw — Personal AI assistant with 214k+ stars. Connects to messaging platforms with self-extending skills. Different architecture (gateway vs. research agent) and auto-promotes new capabilities without human approval.
NVIDIA OpenShell — Enterprise sandbox for self-evolving agents with policy controls. Requires DGX/RTX hardware. cairn targets the same safety-first philosophy at a scale that runs on a laptop with Ollama.
LangGraph — The state machine framework cairn is built on. cairn's Classify→Plan→Act→Reflect loop is one opinionated implementation of LangGraph's primitives.
LiteLLM — Production LLM proxy with routing and budget management at enterprise scale. cairn's model router is a lightweight alternative for solo developers who want the same idea in a YAML file.
e2b — Cloud sandboxing for AI code execution. cairn uses a simpler local Docker sandbox with resource limits.

Contributing

Issues and PRs welcome. See CONTRIBUTING.md for details.

If you build something interesting with cairn, I'd love to hear about it.

License

MIT — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
agent		agent
config		config
mcp_server		mcp_server
notebooks		notebooks
sandbox		sandbox
scms		scms
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.mcp		Dockerfile.mcp
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
railway.toml		railway.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cairn

What Makes This Different

Architecture

Quick Start

Prerequisites

Setup

Usage

Sprint History

Project Structure

Testing

Design Decisions

Cost

Roadmap

Related Projects

Contributing

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cairn

What Makes This Different

Architecture

Quick Start

Prerequisites

Setup

Usage

Sprint History

Project Structure

Testing

Design Decisions

Cost

Roadmap

Related Projects

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages