AlphaEvolve Performance for a Fraction of the Cost
LEVI is an LLM-guided evolutionary framework for code and prompts. Point it at a scoring function and a budget — LEVI evolves the artifact for you, using API models, a local server, or your Claude Code / Codex CLI subscription.
$4.50 improves on what other frameworks need $15-30 and frontier models to achieve across a variety of problems, at a fraction of the cost.
Why LEVI
Existing frameworks couple performance tightly to model capability. Drop to a smaller model and results degrade sharply. LEVI decouples the two by making diversity an architectural concern rather than a model concern, and by matching model capacity to task demand.
Cheap models handle the bulk of mutation work. A behavioral archive keeps structurally different strategies alive, preventing premature convergence. Periodic paradigm shifts from a stronger model inject genuinely new ideas. The result: you spend less and get more.
LEVI on code optimization (left, transaction scheduling — exceeds every baseline's final score within ~50 evaluations) and prompt optimization (right, HotpotQA with Qwen3-8B — outperforms GEPA at less than half the rollouts).
Quickstart
# Install uv first: https://docs.astral.sh/uv/getting-started/installation/ git clone https://github.com/ttanv/levi.git cd levi uv sync
Pick whichever path matches what you have access to — each is a single self-contained file under examples/quickstart/, runs in a couple of minutes, costs a few cents (or nothing on the CLI path):
| You have… | Run | Evolves |
|---|---|---|
| an API key (OpenAI / Anthropic / …) | uv run python examples/quickstart/quickstart_api.py |
code |
| a Claude Code / Codex CLI subscription | uv run python examples/quickstart/quickstart_claude.py |
code |
| an API key, and you want to tune prompts | uv run python examples/quickstart/quickstart_prompts.py |
prompts |
Set OPENAI_API_KEY (or change MODEL at the top of the file to another litellm provider and set the matching key) before running the API quickstarts.
A minimal LEVI program looks like this:
from collections import Counter import levi def score_fn(pack, inputs): scores = [] for items in inputs: bins = pack(list(items), 10) if any(sum(b) > 10 for b in bins) or Counter(x for b in bins for x in b) != Counter(items): return {"score": 0.0} scores.append(100.0 * sum(sum(b) for b in bins) / (len(bins) * 10)) return {"score": sum(scores) / len(scores)} if __name__ == "__main__": result = levi.evolve_code( "Pack items into bins of capacity 10, minimizing wasted space.", function_signature="def pack(items: list[int], capacity: int) -> list[list[int]]:", score_fn=score_fn, inputs=[[4, 8, 1, 4, 2, 1], [9, 2, 3, 7, 8, 1, 4]], model="openai/gpt-4o-mini", budget_dollars=0.10, ) print(result.best_score, result.best_program)
The if __name__ == "__main__": guard matters — LEVI runs evaluations in subprocesses (spawn on macOS / Windows).
Going further
examples/quickstart/— the three single-file starters above.examples/circle_packing/— n=26 circle packing benchmark; the simplest non-toy problem.examples/ADRS/— seven ADRS Leaderboard problems used in the paper. Most use a cheap proposer model via OpenRouter (or a local Qwen server) plus stronger paradigm-shift calls. Seeexamples/ADRS/README.mdfor setup.examples/hotpotqa/,examples/hover/,examples/pupa/,examples/ifbench/— prompt-evolution benchmarks comparing against GEPA.
Results
LEVI holds the highest average score (76.5) across all seven ADRS Leaderboard problems, ahead of GEPA (71.9), OpenEvolve (70.6), and ShinkaEvolve (67.4). Six of the seven problems were solved on a $4.50 budget.
| Problem | LEVI | Best Other Framework | Saving |
|---|---|---|---|
| Spot Single-Reg | 51.7 | GEPA 51.4 | 6.7x cheaper |
| Spot Multi-Reg | 72.4 | OpenEvolve 66.7 | 5.6x cheaper |
| LLM-SQL | 78.3 | OpenEvolve 72.5 | 4.4x cheaper |
| Cloudcast | 100.0 | GEPA 96.6 | 3.3x cheaper |
| Prism | 87.4 | GEPA / OpenEvolve / ShinkaEvolve 87.4 | 3.3x cheaper |
| EPLB | 74.6 | GEPA 70.2 | 3.3x cheaper |
| Txn Scheduling | 71.1 | OpenEvolve 70.0 | 1.5x cheaper |
LEVI scored 2.6359+ packing density on the n=26 circle packing benchmark, with a local model handling the majority of mutations. See examples/circle_packing for the full setup.
For advanced routing, pass a levi.LM(...) directly:
local_qwen = levi.LM( "Qwen/Qwen3-30B-A3B-Instruct-2507", api_base="http://localhost:8000/v1", api_key="unused", input_cost_per_token=0.0000001, output_cost_per_token=0.0000004, )
How It Works
- Seed & score. You provide a starting program and a scoring function. LEVI generates diverse variants to populate a behavioral archive.
- Evolve. Cheap models mutate and refine solutions in parallel. The behavioral archive keeps structurally different strategies alive, preventing convergence.
- Paradigm shifts. Periodically, a stronger model proposes entirely new algorithmic approaches based on the archive's best ideas.
- Budget stops. LEVI tracks spend in real time and stops when your dollar, evaluation, or time cap is hit.
Read more in the full writeup.
Further Reading
- Documentation — API reference, configuration, architecture details, and ablations.
- LEVI: LLM-Guided Evolutionary Search Needs Better Harnesses, Not Bigger Models — The full blog post on the ADRS site.
Citation
If you use LEVI in your research, please cite:
@software{tanveer2026levi, title = {LEVI: LLM-Guided Evolutionary Search Needs Better Harnesses, Not Bigger Models}, author = {Tanveer, Temoor}, url = {https://github.com/ttanv/levi}, year = {2026} }
Contact: ttanveer@alumni.cmu.edu


