GitHub - yamafaktory/formal: LLM-driven property checker for code, backed by Lean 4 and Mathlib

9 min read Original article ↗

Checks Publish Docker image

An LLM-driven property checker for code, backed by Lean 4 as a proof engine.

Point it at a file, and it will: identify pure functions, generate properties those functions should satisfy, translate them into Lean 4 theorems, and attempt to prove them with Mathlib. Properties that Lean accepts are mechanically verified — but only within the limits described below.

Works with any LLM backend — Claude, GPT-4, Gemini, Llama, Mistral, or any OpenAI-compatible endpoint.

What this actually is

The pipeline has three LLM steps before Lean ever runs:

  1. Decomposition — the LLM reads your code and identifies which parts are pure functions
  2. Property extraction — the LLM generates properties it believes those functions satisfy, along with explicit preconditions and modeling assumptions
  3. Formalization — the LLM translates each property into a Lean 4 theorem

Only then does Lean check the proof. Lean is mechanically sound — it cannot be fooled — but it only checks what it is given. If the LLM misunderstood your function, or generated a property that is technically true but misses the point, Lean will happily prove the wrong thing.

What "verified" means here: Lean accepted a proof of a theorem that the LLM derived from your code. This is a meaningful signal — LLMs make logical errors and Lean catches them — but it is not equivalent to a certified compiler or a formal proof that your source code is correct.

What this is useful for:

  • Catching logical errors in AI-generated code that tests might miss
  • Surfacing the assumptions an LLM makes about your code (now shown explicitly)
  • Increasing confidence in pure domain logic: calculations, transformations, validations
  • Getting a structured, machine-checked view of what properties hold under stated assumptions

What this does not give you:

  • A guarantee that your source code is correct — only that a Lean model of it satisfies generated properties
  • Complete coverage — the LLM picks which properties to check and may miss important ones
  • Traditional formal verification — that requires a certified translation from source to proof, which this does not have

How it works

Your code (any language)
  → LLM: extract pure functions, identify side effects
  → LLM: generate properties with explicit preconditions and assumptions
  → LLM: translate each property into a Lean 4 theorem + proof
  → Lean 4 + Mathlib: accept or reject each proof (with retries)
  → Results: verified / failed / unverifiable

Side effects (DB calls, HTTP, I/O) are excluded — only pure, deterministic logic is checked. Properties that depend on reference equality, reflection, or runtime behaviour are classified as unverifiable (not a bug, not a failure — a modeling limit).

Every property now carries explicit preconditions (what must hold on inputs) and assumptions (modeling choices made during translation). These are shown in the output so you can judge whether the LLM's interpretation matches your intent.

Setup

1. Configure your LLM provider

Two backends:

1 — Claude Code CLI (uses your local claude binary and Pro/Max plan, no API key needed)

2 — OpenAI-compatible API (any provider with an OpenAI-compatible endpoint):

Provider Base URL
OpenAI https://api.openai.com/v1
Anthropic https://api.anthropic.com/v1
Groq https://api.groq.com/openai/v1
Ollama (local) http://localhost:11434/v1
LM Studio http://localhost:1234/v1
Any other Any OpenAI-compatible endpoint

setup.sh writes the right COMPOSE_FILE to .env automatically. Available models are fetched from /v1/models; if unsupported, enter the model name manually.

2. Start

Pulls the prebuilt image from GHCR — no local build needed. To build from source:

docker compose up --build
./formal status
# {"status":"ok"}

3. Add this to your project's AI agent instructions

For Claude Code, add to your CLAUDE.md. For Cursor, Copilot, or other agents, add to the equivalent instructions file.

## Formal Verification

A property checker runs at http://localhost:1337. After implementing any feature
that contains non-trivial pure logic (calculations, transformations, validations,
business rules), verify it:

```sh
~/dev/formal/formal verify /absolute/path/to/file.java
```

**When to verify:** after writing or modifying pure domain logic — pricing
calculations, volume computations, data transformations, validation functions.

**When to skip:** pure I/O code, controller wiring, configuration, tests.

Results:
- `full` — all properties proved under stated assumptions
- `partial` / `failed` — investigate unverified properties; may indicate a logic bug
- `unverifiable` — modeling limitation (reference equality, reflection, etc.), not a bug

Review the preconditions and assumptions in the output. If they do not match your
intent, the proof result may not reflect real behaviour.

CLI

The formal script is the primary interface. All commands talk to localhost:1337.

# Watch live logs — see generated Lean code, proof attempts, retries in real time
./formal watch

# Verify a file (language auto-detected from extension)
./formal verify path/to/Feature.java

# Verify inline code
./formal verify --code 'def f(x): return max(0, x)' --lang Python

# Full JSON response
./formal verify path/to/Feature.java --full

# Health check
./formal status

Watch output

formal watch streams structured logs from the container:

[PIPELINE] Decomposing feature [Java]: SomeService.java
[PIPELINE] Pure functions: ['computePrice', 'applyDiscount']
[PIPELINE] Extracted 5 properties
[SCREEN  ] ✓ prop_1: VERIFIABLE — discount is always between 0 and 1
[SCREEN  ] ~ prop_3: UNVERIFIABLE — depends on reference equality
[CACHE   ] prop_1 [bound] cache hit — skipping proof
[VERIFY  ] prop_2 [monotonicity] Formalizing: higher discount yields lower price
[LEAN    ] prop_2 theorem: import Mathlib ...
[VERIFY  ] prop_2 generating proof (attempt 1/3)...
[FAIL    ] prop_2 ✗ attempt 1 failed: type mismatch
[VERIFY  ] prop_2 generating proof (attempt 2/3)...
[OK      ] prop_2 ✓ verified
[PIPELINE] Done — verified: 4, failed: 0, unverifiable: 1

Verify output

─────────────────────────────────────────
File:    path/to/Feature.java
Summary: Applies discount and computes final price
Score:   full  (4/5 verified, 1 unverifiable)
─────────────────────────────────────────
  ✓ [bound] discount is always between 0 and 1
      Preconditions: discount is a float
      Assumptions:   floats modeled as rationals, no NaN or Inf
  ✓ [monotonicity] higher discount yields lower price
      Preconditions: price > 0, 0 <= discount <= 1
      Assumptions:   floats modeled as rationals
  ~ [invariant] bundleId reference matches stored entity
      → depends on JVM reference equality, not structural equality
─────────────────────────────────────────

Preconditions and assumptions are shown for every property so you can verify that the LLM's interpretation of your code matches what you intended.

API reference

POST /verify-feature

# From a file (language auto-detected)
curl -X POST http://localhost:1337/verify-feature \
  -H 'Content-Type: application/json' \
  -d '{"file": "/absolute/path/to/Feature.java"}'

# Inline code
curl -X POST http://localhost:1337/verify-feature \
  -H 'Content-Type: application/json' \
  -d '{"code": "...", "language": "TypeScript"}'

Supported languages: Python, Java, Kotlin, TypeScript, JavaScript, Go, Rust, C#, C++, Ruby.

Response:

{
  "overall_score": "full | partial | failed | no_pure_logic",
  "properties_found": 5,
  "properties_verified": 4,
  "properties_unverifiable": 1,
  "pure_functions": ["computePrice", "applyDiscount"],
  "impure_parts": ["saves to DB", "sends email"],
  "results": [
    {
      "property_id": "prop_1",
      "description": "discount is always between 0 and 1",
      "kind": "bound",
      "status": "verified",
      "verified": true,
      "preconditions": ["discount is a float"],
      "assumptions": ["floats modeled as rationals", "no NaN or Inf"],
      "lean_code": "...",
      "lean_output": "...",
      "retries": 0,
      "reason": ""
    }
  ]
}

Scores (computed over verifiable properties only, unverifiable excluded):

Score Meaning
full All verifiable properties proved
partial ≥50% of verifiable properties proved
failed <50% of verifiable properties proved
no_pure_logic No pure functions found

Property status:

Status Meaning
verified Lean 4 accepted the proof under stated preconditions and assumptions
failed Proof could not be found — may indicate a logic bug or a bad translation
unverifiable Property cannot be modelled in Lean 4 (not a bug)

POST /verify

Generates Python code for a natural language task and verifies it end-to-end.

curl -X POST http://localhost:1337/verify \
  -H 'Content-Type: application/json' \
  -d '{"task": "a function that computes compound interest"}'

GET /health

curl http://localhost:1337/health

Configuration

Set in .env (created by setup.sh), overridable via environment variables:

Variable Description
LLM_BACKEND claude-cli or openai (set by setup.sh)
HOST_CLAUDE_CONFIG_DIR Host path to the Claude config directory to mount into the container (default: ~/.claude). Use this to select a different account, e.g. ~/.claude-work.
LLM_BASE_URL Base URL of any OpenAI-compatible endpoint
LLM_API_KEY API key (leave empty for local models)
LLM_MODEL Model name as accepted by the provider
MAX_PROOF_RETRIES Retry attempts per property on Lean errors (default: 3)
MAX_PARALLEL_PROPERTIES Concurrent property verifications (default: 4)
LEAN_TIMEOUT Seconds before a Lean check times out (default: 120)
PROOF_CACHE_DIR Directory for cached proof results (default: results/cache/)
PROOF_CACHE_TTL_DAYS Cache entries older than this are deleted on the next save (default: 7)

Development

Linting and formatting

Ruff handles both linting and formatting. Config is in pyproject.toml.

ruff check .          # lint
ruff check --fix .    # lint + auto-fix
ruff format .         # format

Enabled rule sets: E (pycodestyle), F (pyflakes), I (isort), UP (pyupgrade). Line length: 120.

Proof cache

Successfully verified proofs are cached to disk. The same property on the same function is never re-proved — on subsequent runs a [CACHE] hit is logged and the stored result is returned immediately.

The cache key is a SHA-256 hash of the function source code, property description, kind, formal spec, preconditions, and assumptions. Any change to these inputs produces a new key; the old entry becomes an orphan and will be cleaned up by the TTL eviction.

Only successful proofs are cached; failed attempts always go through the full LLM + retry loop.

Cache files are written to results/cache/ (one JSON file per entry). Entries older than PROOF_CACHE_TTL_DAYS (default: 7) are deleted automatically on the next save. Set to 0 to disable the TTL. Override the directory with PROOF_CACHE_DIR.

Limitations

  • LLM-driven semantics. Property extraction, formalization, and proof generation all go through an LLM. The LLM can misread code, miss properties, or generate theorems that are true but irrelevant. Lean only checks what it is given.
  • Preconditions and assumptions may be wrong. Review them in the output. A proof built on a wrong assumption is not evidence that your code is correct.
  • Pure logic only. Side effects (DB, HTTP, I/O) are excluded by design.
  • Modeling limits. Floats are modelled as rationals; strings use structural equality. IEEE 754 precision and reference semantics cannot be modelled.
  • Not a test replacement. This checks properties for all inputs under stated assumptions; it does not replace integration or end-to-end tests.
  • Lean timeout. Complex proofs may time out — increase LEAN_TIMEOUT if needed.
  • Building from source is slow. Installing Lean 4 + Mathlib oleans takes several minutes locally. The prebuilt GHCR image avoids this.