GitHub - ttanv/levi: Better LLM Optimization for the Price of a Cup of Coffee

AlphaEvolve Performance for a Fraction of the Cost

LEVI is an LLM-guided evolutionary framework for code and prompts. Point it at a scoring function and a budget — LEVI evolves the artifact for you, using API models, a local server, or your Claude Code / Codex CLI subscription.

$4.50 improves on what other frameworks need $15-30 and frontier models to achieve across a variety of problems, at a fraction of the cost.

Why LEVI

Existing frameworks couple performance tightly to model capability. Drop to a smaller model and results degrade sharply. LEVI decouples the two by making diversity an architectural concern rather than a model concern, and by matching model capacity to task demand.

Cheap models handle the bulk of mutation work. A behavioral archive keeps structurally different strategies alive, preventing premature convergence. Periodic paradigm shifts from a stronger model inject genuinely new ideas. The result: you spend less and get more.

LEVI on code optimization (left, transaction scheduling — exceeds every baseline's final score within ~50 evaluations) and prompt optimization (right, HotpotQA with Qwen3-8B — outperforms GEPA at less than half the rollouts).

Quickstart

# Install uv first: https://docs.astral.sh/uv/getting-started/installation/
git clone https://github.com/ttanv/levi.git
cd levi
uv sync

Pick whichever path matches what you have access to — each is a single self-contained file under examples/quickstart/, runs in a couple of minutes, costs a few cents (or nothing on the CLI path):

You have…	Run	Evolves
an API key (OpenAI / Anthropic / …)	`uv run python examples/quickstart/quickstart_api.py`	code
a Claude Code / Codex CLI subscription	`uv run python examples/quickstart/quickstart_claude.py`	code
an API key, and you want to tune prompts	`uv run python examples/quickstart/quickstart_prompts.py`	prompts

Set OPENAI_API_KEY (or change MODEL at the top of the file to another litellm provider and set the matching key) before running the API quickstarts.

A minimal LEVI program looks like this:

from collections import Counter
import levi

def score_fn(pack, inputs):
    scores = []
    for items in inputs:
        bins = pack(list(items), 10)
        if any(sum(b) > 10 for b in bins) or Counter(x for b in bins for x in b) != Counter(items):
            return {"score": 0.0}
        scores.append(100.0 * sum(sum(b) for b in bins) / (len(bins) * 10))
    return {"score": sum(scores) / len(scores)}

if __name__ == "__main__":
    result = levi.evolve_code(
        "Pack items into bins of capacity 10, minimizing wasted space.",
        function_signature="def pack(items: list[int], capacity: int) -> list[list[int]]:",
        score_fn=score_fn,
        inputs=[[4, 8, 1, 4, 2, 1], [9, 2, 3, 7, 8, 1, 4]],
        model="openai/gpt-4o-mini",
        budget_dollars=0.10,
    )
    print(result.best_score, result.best_program)

The if __name__ == "__main__": guard matters — LEVI runs evaluations in subprocesses (spawn on macOS / Windows).

Going further

examples/quickstart/ — the three single-file starters above.
examples/circle_packing/ — n=26 circle packing benchmark; the simplest non-toy problem.
examples/ADRS/ — seven ADRS Leaderboard problems used in the paper. Most use a cheap proposer model via OpenRouter (or a local Qwen server) plus stronger paradigm-shift calls. See examples/ADRS/README.md for setup.
examples/hotpotqa/, examples/hover/, examples/pupa/, examples/ifbench/ — prompt-evolution benchmarks comparing against GEPA.

Results

LEVI holds the highest average score (76.5) across all seven ADRS Leaderboard problems, ahead of GEPA (71.9), OpenEvolve (70.6), and ShinkaEvolve (67.4). Six of the seven problems were solved on a $4.50 budget.

Problem	LEVI	Best Other Framework	Saving
Spot Single-Reg	51.7	GEPA 51.4	6.7x cheaper
Spot Multi-Reg	72.4	OpenEvolve 66.7	5.6x cheaper
LLM-SQL	78.3	OpenEvolve 72.5	4.4x cheaper
Cloudcast	100.0	GEPA 96.6	3.3x cheaper
Prism	87.4	GEPA / OpenEvolve / ShinkaEvolve 87.4	3.3x cheaper
EPLB	74.6	GEPA 70.2	3.3x cheaper
Txn Scheduling	71.1	OpenEvolve 70.0	1.5x cheaper

LEVI scored 2.6359+ packing density on the n=26 circle packing benchmark, with a local model handling the majority of mutations. See examples/circle_packing for the full setup.

For advanced routing, pass a levi.LM(...) directly:

local_qwen = levi.LM(
    "Qwen/Qwen3-30B-A3B-Instruct-2507",
    api_base="http://localhost:8000/v1",
    api_key="unused",
    input_cost_per_token=0.0000001,
    output_cost_per_token=0.0000004,
)

How It Works

Seed & score. You provide a starting program and a scoring function. LEVI generates diverse variants to populate a behavioral archive.
Evolve. Cheap models mutate and refine solutions in parallel. The behavioral archive keeps structurally different strategies alive, preventing convergence.
Paradigm shifts. Periodically, a stronger model proposes entirely new algorithmic approaches based on the archive's best ideas.
Budget stops. LEVI tracks spend in real time and stops when your dollar, evaluation, or time cap is hit.

Citation

If you use LEVI in your research, please cite:

@software{tanveer2026levi,
  title  = {LEVI: LLM-Guided Evolutionary Search Needs Better Harnesses, Not Bigger Models},
  author = {Tanveer, Temoor},
  url    = {https://github.com/ttanv/levi},
  year   = {2026}
}