GitHub - harshbhatt7585/autoRL

3 min read Original article ↗

Agent performance by experiment

This repo is for autonomous research over RL environments that run atop simverse.

It can build RL environments and train it using a single program.md file.

Simverse designed as an abstraction layer which agent can use to build and train the env.

Design principles

  • keep the evaluator fixed
  • keep the score fixed
  • keep the mutable surface area tiny
  • let the outer agent choose episode budget within hard caps
  • keep the rest of the runtime stable enough that results stay interpretable

Repo layout

  • candidate/env.py: the editable SimEnv candidate, currently an intentionally empty starter canvas
  • candidate/train.py: env-specific policy and PPO hyperparameters
  • framework.py: fixed Simverse PPO evaluator and score
  • train.py: fixed CLI entrypoint that prints comparable metrics
  • program.md: instructions for the autonomous agent
  • vendor/simverse: vendored upstream Simverse source

Scoring philosophy

The fixed score is simple:

  • score = mean_eval_return

That means the candidate is judged only by the average post-training episode return across all greedy evaluation episodes and all seeds.

Other reported metrics such as solve rate, learning gain, stability, and complexity penalty are still printed for debugging and analysis, but they no longer affect the score.

Quick start

Create a local virtual environment and install the vendored Simverse package with uv:

uv venv
uv pip install -e vendor/simverse

Run the fixed evaluator through that virtualenv:

Each run appends a summary row to results.tsv. You can also pass --description "..." and optionally --status pending|keep|discard to label that row.

The evaluator starts with small defaults but allows budget growth up to hard caps:

  • default 12 PPO training episodes
  • default 8 greedy eval episodes
  • hard cap 1000 PPO training episodes
  • hard cap 100 greedy eval episodes
  • 2 random seeds
  • 32 parallel environments per seed

The intended workflow is to explore with small --train-episodes and --eval-episodes, then ratchet them upward as the candidate gets stronger. The rest of the evaluator should stay the same.

Task horizon is defined through the candidate hyperparameter surface in candidate/train.py via training_overrides()["max_steps"]. The outer CLI still controls only the number of PPO training episodes and greedy evaluation episodes.

The checked-in candidate environment is intentionally minimal. The actual task the agent should build belongs in program.md, not hardcoded in the baseline starter env.

For an autonomous loop, initialize git first so the agent can keep or discard environment mutations cleanly:

git init
git add .
git commit -m "Initial Simverse autoRL scaffold"

Then point your coding agent at program.md.

How to run?

First give all permission to codex so it can run without interruption

As meontioned ins autoresearch you can spin the codex with this prompt and it will run continously

Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.

But I noticed that it stops after setting up, if it does then do this will continously trigger it even if it stops.

  nohup bash -lc '
  while true; do
    codex -a never -s workspace-write exec -C /Users/harshbhatt/Projects/autoRL "Continue the experiment loop in program.md from the current repo state. Read
  results.tsv, keep working on candidate/env.py and candidate/train.py, run another experiment, update results.tsv, and do not stop after a single accepted
  run. Only stop if the repo is broken or the process is interrupted."
    sleep 2
  done
  ' > .log/codex.log 2>&1 &

Citations and Thank You