GitHub - ferreirafabio/autoresearch-automl: Can LLMs beat classical HPO? A benchmark comparing classical, LLM-based, and hybrid methods on Karpathy's autoresearch.

Can LLMs Beat Classical Hyperparameter Optimization Algorithms?
A Study on autoresearch

Paper: Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

Interactive demo: ferreirafabio.github.io/autoresearch-automl

Introduction

autoresearch enables an LLM agent to optimize hyperparameters by editing training code directly. We use it as a testbed to compare classical HPO algorithms against LLM-based methods on tuning the hyperparameters of a small language model under a fixed compute budget. When defining a fixed search space, classical methods such as CMA-ES and TPE consistently outperform LLM-based agents, where avoiding out-of-memory failures matters more than search diversity. Allowing the LLM to directly edit source code narrows the gap to the classical methods but does not close it, even with frontier models available at the time of writing such as Claude Opus 4.6 and Gemini 3.1 Pro Preview. We observe that LLMs struggle to track optimization state across trials. In contrast, classical methods lack the domain knowledge of LLMs. To combine the strengths of both, we introduce Centaur, a hybrid that shares CMA-ES's interpretable internal state with the LLM. Centaur achieves the best result in our experiments, and a 0.8B LLM already suffices to outperform all classical and pure LLM methods. Unconstrained code editing requires larger models to be competitive with classical methods. All in all, our results suggest that LLMs are most effective as a complement to classical optimizers, not as a replacement. We benchmark 9 methods, 4 classical, 4 LLM-based, and 1 hybrid, all under the same 24-hour GPU training budget with 3 seeds.

Methods

Classical (fixed 14-HP search space):

TPE: Tree-structured Parzen Estimator (Optuna).
CMA-ES: Covariance Matrix Adaptation Evolution Strategy (Optuna CMA sampler).
SMAC: Sequential Model-based Algorithm Configuration with Random Forest surrogate (SMAC3).
Random: Uniform random sampling (Optuna RandomSampler).

LLM-based (fixed 14-HP search space):

LLAMBO (Optuna): LLM as surrogate + candidate generator inside Bayesian optimization (OptunaHub port). Uses binary surrogate labels, delegates categorical HPs to random sampling, and hides failed trials from the surrogate (see Details).
LLAMBO (Paper): Our reimplementation faithful to the original paper: continuous surrogate labels, all HPs visible to the LLM, failed trials included (Liu et al., 2024).
Karpathy Agent (14 HPs): LLM sees trial history and suggests the next config within the fixed search space.

LLM-based (unconstrained search space):

Karpathy Agent (Code): LLM directly edits train.py source code each trial (Karpathy's autoresearch).

Hybrid (fixed 14-HP search space):

Centaur (CMA-ES+LLM): CMA-ES runs every trial; on 30% of trials, the LLM receives CMA-ES's internal state and suggests a config. CMA-ES updates from all results, including LLM-suggested ones. See centaur.md.

Our main experiments use self-hosted Qwen3.5 (0.8B and 27B) as the LLM optimizer via vLLM on the same GPU that trains the optimizee (~50M parameter language model). We additionally run frontier model experiments with Gemini 3.1 Pro Preview via the Gemini API and Claude Opus 4.6 via the Claude API.

Experimental Setup

Single H200 GPU, 5 min/trial, minimize val_bpb. Search space: 14 HPs auto-extracted from train.py via AST parsing (every ALL_CAPS = literal assignment becomes a tunable HP). All methods get 24 hours of GPU training time (excluding LLM inference overhead), capped to ~76 GB VRAM (to match the H100 used in Karpathy's and Shwartz Ziv's experiments). Failed trials reported as val_bpb=100.0 so optimizers learn to avoid OOM regions. 3 seeds per condition.

Results

Classical methods outperform LLMs in fixed search spaces

Within the fixed search space, classical HPO methods consistently outperform LLM-based agents. The gap to the best fixed-space LLM method (LLAMBO Paper at 0.9862) is substantial, and several pure LLM methods perform worse than random search, indicating that restricting LLMs to a fixed HP search space does not leverage their strengths. OOM avoidance matters more than search diversity: the top methods all keep OOM rates below 16%, while the bottom four exceed 36%. Notably, LLAMBO observes full trial history yet produces OOM rates (48-61%) comparable to random search (56%), suggesting that small/mid-sized LLMs fail to learn which regions cause memory failures.

Best in each category is highlighted: hybrid classical pure LLM (Okabe-Ito colorblind-safe).

Method	Seeds	Best val_bpb	OOM%
Centaur [Opus 4.6]	3	0.9739 ± 0.0012	17%
Centaur [Qwen 27B]	3	0.9763 ± 0.0005	15%
Centaur [Qwen 0.8B]	3	0.9766 ± 0.0008	13%
Centaur [Gemini 3.1 Pro]	3	0.9767 ± 0.0013	20%
TPE	3	0.9768 ± 0.0019	11%
Karpathy Agent (Code) [Opus 4.6]	3	0.9770 ± 0.0027	5%
SMAC	3	0.9778 ± 0.0020	36%
CMA-ES	3	0.9785 ± 0.0036	16%
Karpathy Agent (Code) [Qwen 27B]	3	0.9814 ± 0.0046	12%
Karpathy Agent (Code) [Gemini 3.1 Pro]	3	0.9826 ± 0.0004	3%
LLAMBO (Paper) [Qwen 27B]	3	0.9862 ± 0.0041	48%
Random	3	0.9873 ± 0.0021	56%
LLAMBO (Optuna) [Qwen 27B]	3	0.9882 ± 0.0012	61%
Karpathy Agent (14 HPs) [Qwen 27B]	3	0.9904 ± 0.0002	1%
Karpathy Agent (Code) [Qwen 0.8B]	3	0.9910 ± 0.0001	19%

Unconstrained code editing is viable but requires model scale

Karpathy Agent (Code), which directly edits training source code, is the only pure LLM method competitive with classical approaches. Experiments with frontier models Gemini 3.1 Pro Preview and Claude Opus 4.6 narrow the gap further (Opus 4.6: 0.9770 ± 0.0027, competitive with CMA-ES at 0.9785) but do not close it to the best classical methods like TPE (0.9768). Frontier model capability also shows up in the OOM rate: Karpathy Agent (Code)'s failure rate drops sharply with LLM capability (19% for Qwen 0.8B, 12% for Qwen 27B, 3% for Gemini 3.1 Pro, 5% for Opus 4.6), whereas Centaur's stays in a narrow range (13-20%) across model choices, indicating that CMA-ES dominates OOM avoidance in the hybrid.

Frontier model comparison

Frontier models (Claude Opus 4.6, Gemini 3.1 Pro Preview) vs Qwen3.5-27B for Karpathy Agent (Code) and Centaur. Centaur with Opus 4.6 achieves the best result (0.9739). TPE shown as classical reference.

0.8B vs 27B LLM optimizer

Scaling the LLM from 0.8B to 27B is essential for unconstrained code editing (0.9910 vs 0.9814) but provides no advantage for fixed-HP optimization. Solid lines = 27B, dashed = 0.8B.

Hybrid optimization: best of both worlds

Centaur outperformed all methods including CMA-ES alone by using the LLM on only 30% of trials. The LLM receives CMA-ES's full internal state (mean vector, step-size, covariance matrix), the top-5 configurations, and the last 20 trials. A 0.8B LLM already suffices to outperform all classical and pure LLM methods. Scaling from 0.8B (0.9766) to 27B (0.9763) to Gemini Pro (0.9767) yields no improvement, suggesting a capability plateau. However, Centaur with Claude Opus 4.6 breaks through this plateau to 0.9739, with the improvement coming from higher-quality suggestions rather than better OOM avoidance (17% vs 15% for Qwen 27B).

We ablate the LLM ratio: higher ratios degrade performance, confirming that CMA-ES should retain majority control. See centaur.md for the full algorithm.

Incumbent Traces

Grey dots are all trials, colored dots are new bests, staircase is the incumbent (best-so-far). Each panel shows the best seed for that method.

Classical + Hybrid:

LLM-based:

Search Space

14 hyperparameters auto-extracted via AST parsing (every ALL_CAPS = literal assignment in train.py):

HP	Type	Range	Log	Default
DEPTH	int	4 – 24		8
ASPECT_RATIO	int	32 – 128		64
HEAD_DIM	int	64 – 256	yes	128
DEVICE_BATCH_SIZE	int	32 – 256	yes	128
TOTAL_BATCH_SIZE	int	65 536 – 2 097 152	yes	524 288
EMBEDDING_LR	float	0.01 – 2.0	yes	0.6
UNEMBEDDING_LR	float	0.0005 – 0.05	yes	0.004
MATRIX_LR	float	0.005 – 0.2	yes	0.04
SCALAR_LR	float	0.05 – 2.0	yes	0.5
WEIGHT_DECAY	float	0.0 – 0.5		0.2
WARMUP_RATIO	float	0.0 – 0.3		0.0
WARMDOWN_RATIO	float	0.1 – 0.8		0.5
FINAL_LR_FRAC	float	0.0 – 0.2		0.0
WINDOW_PATTERN	categorical	SSSL, SSLL, SLSL, LLLL, SSSS, LSSL		SSSL

Defaults are Karpathy's starting config (commit b11d6f28), not his final optimized values.

Usage

uv venv --python 3.12
source .venv/bin/activate
uv pip install -e ".[dev,all]"

# TPE
python -m autoresearch_automl.cli run --backend optuna --trials 100 --seed 0

# Random Search
python -m autoresearch_automl.cli run --backend random --trials 100 --seed 0

# SMAC3
python -m autoresearch_automl.cli run --backend smac --trials 100 --seed 0

# CMA-ES
python -m autoresearch_automl.cli run --backend cma_es --trials 100 --seed 0

# LLAMBO (Optuna) (requires vLLM running)
export OPENAI_BASE_URL=http://127.0.0.1:8000/v1
export OPENAI_API_KEY=dummy
python -m autoresearch_automl.cli run --backend llambo --trials 100 --llm-model Qwen3.5-0.8B

# LLAMBO (Paper)
python -m autoresearch_automl.cli run --backend llambo_original --trials 100 --llm-model Qwen3.5-0.8B

# Karpathy Agent (14 HPs) — LLM suggests configs within fixed search space
python -m autoresearch_automl.cli run --backend karpathy_agent_hps --trials 100 --llm-model Qwen3.5-0.8B

# Karpathy Agent (Code) — edits train.py directly
python -m autoresearch_automl.cli run --backend karpathy_agent --trials 100 --llm-model Qwen3.5-0.8B

# Centaur (CMA-ES+LLM)
python -m autoresearch_automl.cli run --backend centaur --trials 100 --llm-model Qwen3.5-0.8B

Related work

Karpathy's autoresearch for the training task and the idea of LLM-driven experimentation
Ravid Shwartz Ziv for showing that expert HP selection beats blind LLM search
LLAMBO (Liu et al., 2024) for using LLMs as surrogate models in Bayesian optimization

Details

H200 vs H100 baseline

Our baseline (Karpathy's default config) achieves val_bpb ≈ 0.991 on H200 at full clock speed (~1750K tokens/s), comparable to Karpathy's ~0.998 on H100. Early runs showed a higher baseline of ~1.008 due to GPU power throttling. All results reported here use non-throttled H200s.

LLM system prompts

The system prompts used by each LLM-based method include the optimization goal (minimize val_bpb), the model class and training stack (GPT-2 scale transformer, Muon+AdamW optimizer), the dataset (FineWeb), the hardware constraints and OOM warning, the search space with bounds, and the trial history. Full prompt templates are reproduced verbatim in the paper appendix (§LLM Prompts and Problem Context) and live in source here:

Karpathy Agent (Code): karpathy_agent_backend.py — AGENT_PROMPT
Karpathy Agent (14 HPs): karpathy_agent_hps_backend.py — SUGGEST_PROMPT
Centaur: centaur_backend.py — SUGGEST_PROMPT
LLAMBO (Paper): llambo_original_backend.py — DEFAULT_TASK_DESCRIPTION + llambo_original/acquisition_function.py
LLAMBO (Optuna): llambo_backend.py — DEFAULT_TASK_DESCRIPTION

LLAMBO (Optuna) vs LLAMBO (Paper)

The OptunaHub LLAMBO sampler (Ozaki et al., 2025) differs from the original paper code in several ways that materially affect optimization quality:

Aspect	Original paper	OptunaHub port
Surrogate labels	Actual metric values, LLM sees performance gradients	Binary 0/1 (top 20% threshold)
Categorical HPs	All HPs included in LLM prompts	Categoricals delegated to random sampling
Failed trials	Visible to surrogate (can learn infeasible regions)	Hidden from surrogate

We implemented a faithful adaptation of the paper's code alongside the OptunaHub version to quantify these differences.

Citation

@article{ferreira2026autoresearchautoml,
    title={Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch},
    author={Fabio Ferreira and Lucca Wobbe and Arjun Krishnakumar and Frank Hutter and Arber Zela},
    year={2026},
    eprint={2603.24647},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2603.24647},
}

Can LLMs Beat Classical Hyperparameter Optimization Algorithms?A Study on autoresearch