An LLM-driven property checker for code, backed by Lean 4 as a proof engine.
Point it at a file, and it will: identify pure functions, generate properties those functions should satisfy, translate them into Lean 4 theorems, and attempt to prove them with Mathlib. Properties that Lean accepts are mechanically verified — but only within the limits described below.
Works with any LLM backend — Claude, GPT-4, Gemini, Llama, Mistral, or any OpenAI-compatible endpoint.
What this actually is
The pipeline has three LLM steps before Lean ever runs:
- Decomposition — the LLM reads your code and identifies which parts are pure functions
- Property extraction — the LLM generates properties it believes those functions satisfy, along with explicit preconditions and modeling assumptions
- Formalization — the LLM translates each property into a Lean 4 theorem
Only then does Lean check the proof. Lean is mechanically sound — it cannot be fooled — but it only checks what it is given. If the LLM misunderstood your function, or generated a property that is technically true but misses the point, Lean will happily prove the wrong thing.
What "verified" means here: Lean accepted a proof of a theorem that the LLM derived from your code. This is a meaningful signal — LLMs make logical errors and Lean catches them — but it is not equivalent to a certified compiler or a formal proof that your source code is correct.
What this is useful for:
- Catching logical errors in AI-generated code that tests might miss
- Surfacing the assumptions an LLM makes about your code (now shown explicitly)
- Increasing confidence in pure domain logic: calculations, transformations, validations
- Getting a structured, machine-checked view of what properties hold under stated assumptions
What this does not give you:
- A guarantee that your source code is correct — only that a Lean model of it satisfies generated properties
- Complete coverage — the LLM picks which properties to check and may miss important ones
- Traditional formal verification — that requires a certified translation from source to proof, which this does not have
How it works
Your code (any language)
→ LLM: extract pure functions, identify side effects
→ LLM: generate properties with explicit preconditions and assumptions
→ LLM: translate each property into a Lean 4 theorem + proof
→ Lean 4 + Mathlib: accept or reject each proof (with retries)
→ Results: verified / failed / unverifiable
Side effects (DB calls, HTTP, I/O) are excluded — only pure, deterministic logic is
checked. Properties that depend on reference equality, reflection, or runtime behaviour
are classified as unverifiable (not a bug, not a failure — a modeling limit).
Every property now carries explicit preconditions (what must hold on inputs) and assumptions (modeling choices made during translation). These are shown in the output so you can judge whether the LLM's interpretation matches your intent.
Setup
1. Configure your LLM provider
Two backends:
1 — Claude Code CLI (uses your local claude binary and Pro/Max plan, no API key needed)
2 — OpenAI-compatible API (any provider with an OpenAI-compatible endpoint):
| Provider | Base URL |
|---|---|
| OpenAI | https://api.openai.com/v1 |
| Anthropic | https://api.anthropic.com/v1 |
| Groq | https://api.groq.com/openai/v1 |
| Ollama (local) | http://localhost:11434/v1 |
| LM Studio | http://localhost:1234/v1 |
| Any other | Any OpenAI-compatible endpoint |
setup.sh writes the right COMPOSE_FILE to .env automatically. Available models
are fetched from /v1/models; if unsupported, enter the model name manually.
2. Start
Pulls the prebuilt image from GHCR — no local build needed. To build from source:
docker compose up --build
./formal status
# {"status":"ok"}3. Add this to your project's AI agent instructions
For Claude Code, add to your CLAUDE.md. For Cursor, Copilot, or other agents, add
to the equivalent instructions file.
## Formal Verification A property checker runs at http://localhost:1337. After implementing any feature that contains non-trivial pure logic (calculations, transformations, validations, business rules), verify it: ```sh ~/dev/formal/formal verify /absolute/path/to/file.java ``` **When to verify:** after writing or modifying pure domain logic — pricing calculations, volume computations, data transformations, validation functions. **When to skip:** pure I/O code, controller wiring, configuration, tests. Results: - `full` — all properties proved under stated assumptions - `partial` / `failed` — investigate unverified properties; may indicate a logic bug - `unverifiable` — modeling limitation (reference equality, reflection, etc.), not a bug Review the preconditions and assumptions in the output. If they do not match your intent, the proof result may not reflect real behaviour.
CLI
The formal script is the primary interface. All commands talk to localhost:1337.
# Watch live logs — see generated Lean code, proof attempts, retries in real time ./formal watch # Verify a file (language auto-detected from extension) ./formal verify path/to/Feature.java # Verify inline code ./formal verify --code 'def f(x): return max(0, x)' --lang Python # Full JSON response ./formal verify path/to/Feature.java --full # Health check ./formal status
Watch output
formal watch streams structured logs from the container:
[PIPELINE] Decomposing feature [Java]: SomeService.java
[PIPELINE] Pure functions: ['computePrice', 'applyDiscount']
[PIPELINE] Extracted 5 properties
[SCREEN ] ✓ prop_1: VERIFIABLE — discount is always between 0 and 1
[SCREEN ] ~ prop_3: UNVERIFIABLE — depends on reference equality
[CACHE ] prop_1 [bound] cache hit — skipping proof
[VERIFY ] prop_2 [monotonicity] Formalizing: higher discount yields lower price
[LEAN ] prop_2 theorem: import Mathlib ...
[VERIFY ] prop_2 generating proof (attempt 1/3)...
[FAIL ] prop_2 ✗ attempt 1 failed: type mismatch
[VERIFY ] prop_2 generating proof (attempt 2/3)...
[OK ] prop_2 ✓ verified
[PIPELINE] Done — verified: 4, failed: 0, unverifiable: 1
Verify output
─────────────────────────────────────────
File: path/to/Feature.java
Summary: Applies discount and computes final price
Score: full (4/5 verified, 1 unverifiable)
─────────────────────────────────────────
✓ [bound] discount is always between 0 and 1
Preconditions: discount is a float
Assumptions: floats modeled as rationals, no NaN or Inf
✓ [monotonicity] higher discount yields lower price
Preconditions: price > 0, 0 <= discount <= 1
Assumptions: floats modeled as rationals
~ [invariant] bundleId reference matches stored entity
→ depends on JVM reference equality, not structural equality
─────────────────────────────────────────
Preconditions and assumptions are shown for every property so you can verify that the LLM's interpretation of your code matches what you intended.
API reference
POST /verify-feature
# From a file (language auto-detected) curl -X POST http://localhost:1337/verify-feature \ -H 'Content-Type: application/json' \ -d '{"file": "/absolute/path/to/Feature.java"}' # Inline code curl -X POST http://localhost:1337/verify-feature \ -H 'Content-Type: application/json' \ -d '{"code": "...", "language": "TypeScript"}'
Supported languages: Python, Java, Kotlin, TypeScript, JavaScript, Go, Rust, C#, C++, Ruby.
Response:
{
"overall_score": "full | partial | failed | no_pure_logic",
"properties_found": 5,
"properties_verified": 4,
"properties_unverifiable": 1,
"pure_functions": ["computePrice", "applyDiscount"],
"impure_parts": ["saves to DB", "sends email"],
"results": [
{
"property_id": "prop_1",
"description": "discount is always between 0 and 1",
"kind": "bound",
"status": "verified",
"verified": true,
"preconditions": ["discount is a float"],
"assumptions": ["floats modeled as rationals", "no NaN or Inf"],
"lean_code": "...",
"lean_output": "...",
"retries": 0,
"reason": ""
}
]
}Scores (computed over verifiable properties only, unverifiable excluded):
| Score | Meaning |
|---|---|
full |
All verifiable properties proved |
partial |
≥50% of verifiable properties proved |
failed |
<50% of verifiable properties proved |
no_pure_logic |
No pure functions found |
Property status:
| Status | Meaning |
|---|---|
verified |
Lean 4 accepted the proof under stated preconditions and assumptions |
failed |
Proof could not be found — may indicate a logic bug or a bad translation |
unverifiable |
Property cannot be modelled in Lean 4 (not a bug) |
POST /verify
Generates Python code for a natural language task and verifies it end-to-end.
curl -X POST http://localhost:1337/verify \ -H 'Content-Type: application/json' \ -d '{"task": "a function that computes compound interest"}'
GET /health
curl http://localhost:1337/health
Configuration
Set in .env (created by setup.sh), overridable via environment variables:
| Variable | Description |
|---|---|
LLM_BACKEND |
claude-cli or openai (set by setup.sh) |
HOST_CLAUDE_CONFIG_DIR |
Host path to the Claude config directory to mount into the container (default: ~/.claude). Use this to select a different account, e.g. ~/.claude-work. |
LLM_BASE_URL |
Base URL of any OpenAI-compatible endpoint |
LLM_API_KEY |
API key (leave empty for local models) |
LLM_MODEL |
Model name as accepted by the provider |
MAX_PROOF_RETRIES |
Retry attempts per property on Lean errors (default: 3) |
MAX_PARALLEL_PROPERTIES |
Concurrent property verifications (default: 4) |
LEAN_TIMEOUT |
Seconds before a Lean check times out (default: 120) |
PROOF_CACHE_DIR |
Directory for cached proof results (default: results/cache/) |
PROOF_CACHE_TTL_DAYS |
Cache entries older than this are deleted on the next save (default: 7) |
Development
Linting and formatting
Ruff handles both linting and formatting. Config is in pyproject.toml.
ruff check . # lint ruff check --fix . # lint + auto-fix ruff format . # format
Enabled rule sets: E (pycodestyle), F (pyflakes), I (isort), UP (pyupgrade). Line length: 120.
Proof cache
Successfully verified proofs are cached to disk. The same property on the same
function is never re-proved — on subsequent runs a [CACHE] hit is logged and
the stored result is returned immediately.
The cache key is a SHA-256 hash of the function source code, property description, kind, formal spec, preconditions, and assumptions. Any change to these inputs produces a new key; the old entry becomes an orphan and will be cleaned up by the TTL eviction.
Only successful proofs are cached; failed attempts always go through the full LLM + retry loop.
Cache files are written to results/cache/ (one JSON file per entry). Entries
older than PROOF_CACHE_TTL_DAYS (default: 7) are deleted automatically on the
next save. Set to 0 to disable the TTL. Override the directory with PROOF_CACHE_DIR.
Limitations
- LLM-driven semantics. Property extraction, formalization, and proof generation all go through an LLM. The LLM can misread code, miss properties, or generate theorems that are true but irrelevant. Lean only checks what it is given.
- Preconditions and assumptions may be wrong. Review them in the output. A proof built on a wrong assumption is not evidence that your code is correct.
- Pure logic only. Side effects (DB, HTTP, I/O) are excluded by design.
- Modeling limits. Floats are modelled as rationals; strings use structural equality. IEEE 754 precision and reference semantics cannot be modelled.
- Not a test replacement. This checks properties for all inputs under stated assumptions; it does not replace integration or end-to-end tests.
- Lean timeout. Complex proofs may time out — increase
LEAN_TIMEOUTif needed. - Building from source is slow. Installing Lean 4 + Mathlib oleans takes several minutes locally. The prebuilt GHCR image avoids this.