Holey
A Python library for program synthesis and symbolic execution that combines SMT (Z3, CVC5, ...) constraint solving with LLM-guided synthesis. Put holes in your Python code and let holey fill them using formal constraints, natural language specifications, or both.
The symbolic execution is
inspired by Philip Zucker's blog post "Symbolic Execution by Overloading __bool__",
but explores all branches exhaustively instead of randomly and fleshes out the concepts towards solving Python Programming Puzzles.
The solver incorporates heuristics from LLMs in addition to symbolic execution.
- Slides from talk at PEPM'26
- Diagram: multi-stage programming SMT: overloaded execution generates constraints
- Diagram: overview
Examples
# find a string s such that sat(s) holds def sat(s: str): return s.count('o') == 1000 and s.count('oo') == 0 # SMT solver times out, but try 3 instead of 1000 "oCoDo" # ask LLM to extrapolate from 3 to 1000 "o" + "Co" * 999
Another example:
- satisfiability predicate in Python
- translated to SMTLIB
- full log
See more in examples/.
Setup
Install dependencies
pythonwith support forpip(e.g.conda), tested with Python 3.12z3orcvc5or both -- on mac with Homebrew, can install withbrew install z3 cvc5
Clone recursive
git clone --recursive https://github.com/namin/holey.git
Setup Python environment
conda create -n holey python=3.12
conda activate holey
pip install -e ".[test,ollama,anthropic]"
Setup LLM environments
For each LLM you want to use, provide an LLM API key, even if only a dummy one is needed. Only provide a key if you want to use that particular LLM provider. All provided keys will be used in parallel to generate a matrix of successes per LLM provider.
export OLLAMA_API_KEY=ollama
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
export OPENAI_API_KEY=...
Run
Help reference
python puzzle_solver.py --help
Sanity check
python puzzle_solver.py --name-prefix HelloWorld:0
Run all puzzles, saving stdout/stderr to file results.txt
python puzzle_solver.py >results.txt 2>&
Fallback to LLMs
python puzzle_solver.py --name-prefix ListIn:1 --llm
Current status
The symbolic execution alone currently solves:
- 69% (247 out of 360) of
intpuzzles, - 45% (165 out of 363) of
strpuzzles, - 24% (142 out of 591) of
List[int]puzzles, - 90% (46 out of 51) of
floatpuzzles, - 27% (38 out of 141) of
List[str]puzzles, - 42% (638 out of 1506) overall.
with the following errors:
- 51 timeouts after 3 seconds at staging time (while generating the SMTLIB program)
- 292 errors at staging time
- 41 SMTLIB programs returning
satbut the originalsatfunction failing on synthesized model input, - 493 SMTLIB programs returning non-
sat:- 75
unsat - 165
unknown - 253 timeout (after 2 seconds)
- 0
satbut failed to extract solution variable
- 75
- 209 (out of 1715) puzzles not yet even attempted because their type is not
intorstr, such asfloat,list(of various specialization), etc.
Extrapolation
- 194 smaller problems tried
- 59 successes on smaller problem
Earlier extrapolation results
- 123 smaller problems tried
- 11 successes on smaller problem
- 9 successful extrapolations
Extrapolated puzzles
Study_1:0 PandigitalSquare:0 CircularShiftNum:2 WeirdDecodeVowels:0 TripleDouble:0 MaxDelta:0 MinConsecutiveSum:2 MaxConsecutiveSum:0 BirthdayParadox:0 BirthdayParadox:1 Tutorial5:0
Successfully extrapolated puzzles
Study_1:0 PandigitalSquare:0 WeirdDecodeVowels:0 TripleDouble:0 MaxDelta:0 MinConsecutiveSum:2 MaxConsecutiveSum:0 BirthdayParadox:0 Tutorial5:0
Matrix
- claude (extrapolate) 8 1 1 0 1 1 0 1 1 1 0 1
- claude (end-to-end) 5 1 1 0 1 0 0 0 0 1 1 0
- claude (SMTLIB) 0 0 0 0 0 0 0 0 0 0 0 0
- gemini (extrapolate) 5 0 1 0 1 1 1 1 0 0 0 0
- gemini (end-to-end) 5 0 0 0 0 1 1 0 1 1 0 1
- gemini (SMTLIB) 0 0 0 0 0 0 0 0 0 0 0 0
- ollama (extrapolate) 2 0 0 0 0 0 0 1 0 0 0 1
- ollama (end-to-end) 2 0 0 0 0 0 0 0 0 1 0 1
- ollama (SMTLIB) 0 0 0 0 0 0 0 0 0 0 0 0
Source map
.
├── README.md
├── benchmarks
│ └── PythonProgrammingPuzzles benchmark added as git submodule
├── holey
│ ├── __init__.py
│ ├── backend.py backend to SMTLIB batch processes
│ ├── core.py includes tracer, symbolic classes, ...
│ ├── llm.py support for LLM generation and code extraction
│ └── preprocessor.py includes node transformer and sat driver
├── log
│ └── results.txt example run
├── puzzle_solver.py main routine for benchmark solver
├── pyproject.toml
└── tests
└── test_core.py ran with python -m pytest, basic and LLM-generated
Contribute!
I need help in completely fleshing out the symbolic executor as well as designing and implementing LLM-based heuristics to complement it. See the contributing guidelines, in particular discussing a workflow to find and fix issues driven by the benchmarks.