code2lean

░█▀▀░█▀█░█▀▄░█▀▀░▀▀▄░█░░░█▀▀░█▀█░█▀█
░█░░░█░█░█░█░█▀▀░▄▀░░█░░░█▀▀░█▀█░█░█
░▀▀▀░▀▀▀░▀▀░░▀▀▀░▀▀▀░▀▀▀░▀▀▀░▀░▀░▀░▀

Source code → LLM proposer → Lean 4 verifier → multi-gate validation.

Lean is the verifier. The LLM is just a proposer.

English · 简体中文

A pipeline that takes a function in source code, asks an LLM to translate it into Lean 4 and propose a correctness theorem, then runs five independent gates — sanitizer, Lean compile, axiom allowlist, differential test against the original source, and a second-LLM critic — to decide whether to trust the result.

The point is the gates, not the translation. An LLM can write theorem foo : P := sorry and call it "verified." The whole design exists to make that lie detectable.

Why this exists

A theorem can be formally correct while the implementation it describes still has a security flaw. The repo's seed example is HMAC tag comparison:

source/token_verify_vulnerable.py — early-exit byte loop; functionally correct, timing-leaky.
source/token_verify_fixed.py — hmac.compare_digest; constant-time.
RepoVerify/TokenVerify.lean — both the functional theorem (true for both implementations) and the cost theorem (only true for the fixed one).
source/attack_demo.py — deterministic comparison-cost oracle that recovers the secret HMAC tag from the vulnerable implementation.

Run the attack:

python source/attack_demo.py

The recovered tag matches the secret. The lesson: a verifier that only proves functional theorems would green-light the vulnerable implementation. Real assurance needs stronger theorems, and you have to ask whether the theorem you proved is the right one.

The generic pipeline (verify/) generalizes this question: for any extracted function, does the proposed Lean theorem actually pin down the function's behavior, or is it a vacuous restatement that any buggy implementation would also satisfy?

Quick start

Install

Lean 4 / Lake (via elan), plus Python ≥ 3.10:

git clone https://github.com/phunterlau/code2lean.git
cd code2lean
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
lake build

Verify the handwritten Lean baseline

lake build

Type-checks RepoVerify/TokenVerify.lean (the original HMAC demo's formal model: vulnerable + fixed equality, both functional theorems, the leak counterexample, and the constant-time cost theorem).

Run the pipeline on an example

Set one of the LLM API keys, then run:

export OPENAI_API_KEY=...        # for --provider openai
export GEMINI_API_KEY=...        # for --provider gemini
export ANTHROPIC_API_KEY=...     # for --provider claude

python -m verify.pipeline \
  --example examples/01_insecure_compare \
  --provider gemini \
  --critic openai \
  --max-repair 3

The pipeline:

AST-extracts the named function from source.py.
Asks the proposer to write a Lean 4 file (function + theorem + proof).
Compiles with lake env lean. Repairs on failure.
Validates with #print axioms (allowlist: propext, Classical.choice, Quot.sound).
Differential-tests Lean #eval outputs against the Python function on every fixture case.
Optionally has a second LLM critic verdict the theorem (PASS / WEAK / FAIL).

Exit code:

0 — all gates passed
1 — sanitizer / Lean / axiom / diff-test failure
2 — critic returned WEAK / FAIL / PARSE_ERROR

Run the original attack demo

python -m pytest                              # functional + recovery test
python source/attack_demo.py                  # deterministic timing-leak attack

How it works (one diagram)

[source.py] ─► AST extract ─► proposer (LLM) ─► Lean file
                                                     │
                                                     ▼
                                           Gate A: sanitizer
                                           Gate B: lake env lean ──repair loop──► proposer
                                           Gate C: axiom allowlist
                                           Gate D: differential test (Python ↔ Lean)
                                           Gate E: critic (second LLM)
                                                     │
                                                     ▼
                                              PASS / fail

A–D are mechanical gates, no LLM judgment in the verification path. E is the only LLM-judgment step, and it exists to catch vacuous theorems (result ≤ 8 for bit_count8 is true even of a constant-zero implementation; Lean accepts it; the critic catches it). See docs/architecture.md for the why and docs/pipeline.md for the how.

Benchmarks at a glance

Same 10 hard examples, three proposer/critic pairings:

Proposer	Critic	Lean acceptance	Theorem PASS	Wall-clock
Gemini 3.1 Pro	GPT-5.5	10/10	0/10	~38.5 min
Claude Opus 4.7	GPT-5.5	8/10	2/10	~8.5 min
GPT-5.5	Claude Opus 4.7	9/10	6/10	~30.3 min

Lean acceptance (did the proof close?) and theorem strength (did the critic approve?) are independent signals. Gemini closes the most proofs; GPT writes the strongest theorems; Claude is the fastest sweep. Four examples were WEAK across all three proposers — the bottleneck there is the prompt, not the model. Full analysis in docs/benchmarks.md.

What gets verified — scope

code2lean works on any source language with an AST library and a type system that maps to Lean. The current adapter is Python; the architecture is language-agnostic.

Within Python today, the pipeline accepts:

Single function at module scope (no methods)
Pure / total / terminating
Types: bool, int (signed or unsigned), bytes, str, list[T], tuple[A, B]

It does not handle classes, decorators, generators, async, floats, dicts, sets, numpy / external libraries, file or network I/O, exceptions, randomness, or functions that depend on module-level helpers.

Full details and what extending to Rust / TypeScript / Go would look like: docs/scope-and-limits.md.

Examples

examples/ has 80 small Python functions (byte/list manipulation, arithmetic, Boolean predicates, crypto-flavored primitives) each paired with a fixture for the differential test. Past run artifacts (last_lean_<provider>.lean, last_proposer_<provider>.txt, last_critic_<critic>.txt, pipeline_*.log) are checked in as samples. See examples/README.md for the full list and --example invocation.

Files

verify/
  pipeline.py                 # orchestrator — gates A–D + repair loop
  extract.py                  # AST extraction → FunctionSpec
  llm.py                      # OpenAI / Gemini / Claude provider abstraction
  critic.py                   # gate E — Phase-1 critic (PASS/WEAK/FAIL)
  gpt_to_lean_security.py     # legacy single-file GPT→Lean demo (the seed)
RepoVerify/
  TokenVerify.lean            # handwritten Lean baseline (the security demo)
  Autogen.lean                # pipeline target — overwritten on each run
  AutogenFromGPT.lean         # legacy target for gpt_to_lean_security.py
source/                       # original HMAC demo Python
tests/                        # pytest for the original demo
examples/                     # 80 functions × fixtures + run artifacts
docs/                         # design notes, walkthrough, benchmarks, roadmap

Docs

File	Read it for
docs/architecture.md	Design thoughts: trust model, proposer/verifier, the five gates, anti-spoofing
docs/pipeline.md	End-to-end pipeline reference
docs/example-walkthrough.md	Single example (`34_parity_xor`) traced through every gate
docs/benchmarks.md	Three-way Gemini/Claude/GPT proposer comparison + caveats
docs/scope-and-limits.md	What can be verified today; extending to other languages
docs/roadmap.md	Mutation-kill, pass@k, distinguishability, vulnerability finding

Blog

Can an LLM Formally Verify Your Code? A Feasibility Study — motivation, a full insecure_compare walkthrough, benchmark analysis, and what comes next.

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

code2lean

Why this exists

Quick start

Install

Verify the handwritten Lean baseline

Run the pipeline on an example

Run the original attack demo

How it works (one diagram)

Benchmarks at a glance

What gets verified — scope

Examples

Files

Docs

Blog

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
RepoVerify		RepoVerify
docs		docs
examples		examples
source		source
tests		tests
verify		verify
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.zh-Hans.md		README.zh-Hans.md
RepoVerify.lean		RepoVerify.lean
lake-manifest.json		lake-manifest.json
lakefile.lean		lakefile.lean
lean-toolchain		lean-toolchain
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

code2lean

Why this exists

Quick start

Install

Verify the handwritten Lean baseline

Run the pipeline on an example

Run the original attack demo

How it works (one diagram)

Benchmarks at a glance

What gets verified — scope

Examples

Files

Docs

Blog

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages