I ran an autonomous loop for 55 hours and it wrote more test code than source code. Nobody asked it to. 125,000 lines of Go, ported from a VB6 football simulation I built in 2005, and the test-to-source ratio came out 1.25:1. Codex just kept writing tests on its own.
The simulation still runs today on a Windows XP box. Players still pay to use it. I’d basically given up on ever moving it forward.
The Program
LLMs are terrible at VB6. I don’t mean “not great.” I mean they can’t make a single change, no matter how mild, without completely destroying the build. One-line edits. Trivial stuff. They break it every time. The error messages are useless and the compiler just gives up.
But they can read it. And they can rewrite it in Go.
So I had a program written in a language nobody supports anymore, with 20 years of accumulated football logic, a main runner file of 34,000 lines, an orchestrator of 10,000 lines, and a module file of 1,000 lines plus a ton of text files it pulls in at runtime. I would’ve paid 10K or even 20K to have someone port this manually. Instead I’m paying $200 a month and letting Codex take a shot.
The VB6 code is the spec. The Go code either matches it or it doesn’t. And the loop can read 34,000 lines of VB6 and systematically port bounded slices of it, function by function. So I ran the loop.
The Numbers
55 hours before it hit rate limits and stopped.
55,483 source lines across 96 files. 69,557 test lines across 136 files. 2,150 test functions containing 4,662 assertions. I didn’t ask for any of the tests. Codex just kept writing them.
The package breakdown tells you where the complexity lives:
sim/pass(passing logic): 17K linessim/run(rushing logic): 8.8K linesplaycall(play selection): 7K lineslegacysim(format bridge): 3.4K linessim/special(special teams): 2.5K linesgamerun(orchestration): 2.2K lines
454 of 1,851 functions have explicit VB6 line number references in comments. 5,820 comment lines referencing specific VB6 source positions. The loop was leaving breadcrumbs back to the original code, and that turned out to be one of the most useful things it did.
The Loop
I had a VB6 league (lg9) running on the live XP box producing real game results, and a Go harness league (lg50) producing results from the ported code. Same rosters, same schedules, same playbooks. Run a week in both, compare every stat. If passing yards diverge, something is wrong in the port.
Every item followed the same 5-step process:
- Trace the VB6 line in the source file
- Port a bounded slice into Go
- Write focused tests around the ported formulas
- Run a one-week
lg50smoke against the VB6 baseline - Record the delta
257 unique drill-down IDs. 372 numbered completed items. 195 recorded smoke runs against the lg9 baseline. 44 parity report files. 19 smoke test reports.
Rationalization
The LLM wants to consistently make it “just work” by not porting the code, but by writing different code to do the same thing. It sees a messy VB6 routine and thinks “I can write something cleaner that does the same thing.” And it’s wrong. It writes something cleaner that does approximately the same thing but misses the weird edge case from 2008 that exists because a user reported a bug with play-action passes against cover-2 and I fixed it by adding a conditional that doesn’t make any sense unless you know the history.
I had to inject what I call an anti-rationalization measure. The loop’s CLAUDE.md says: stat gaps between the Go league and the VB6 league signal missing port logic, not tuning opportunities. Close the gap by porting the missing VB6 branch, never by adjusting constants. Never invent Go-only logic that doesn’t trace back to VB6.
169 times the loop explicitly declared “do not cap” or “do not tune toward target stats.” It understood the constraint and followed it.
24 instances where the loop removed Go-only invented logic. Code it had written earlier that didn’t trace back to VB6, that it later recognized as wrong and deleted.
Port Tasks
I think success with these loops is entirely dependent on how well you can structure tests and success criteria. My problem had a verifiable answer. The VB6 code IS the spec. Run both systems, compare outputs. When you hand an autonomous agent a port task, you’re giving it something most creative coding tasks don’t have: a mechanical way to know if the output is correct.
“Build me a dashboard” has too many valid answers. “Port figurepdef() from VB6 to Go and verify it produces the same defensive package assignments” has exactly one correct answer.
I keep thinking about what other tasks have this shape. Format migrations. Protocol reimplementations. Anything where there’s an existing system that defines correct behavior and you’re rebuilding it in a different language or framework. Those are the tasks where autonomous loops can run for days and actually produce real output.
I’m not convinced it works for open-ended creative work. If there’s no reference implementation to compare against, I genuinely don’t know how you’d structure the success criteria well enough for a 55-hour unsupervised run where the agent has to make thousands of decisions about behavior without a ground truth to validate against.
Dead Code
VB6 code is messy. Functions exist that nobody calls. Lines are commented out with a leading '. Blocks wrapped in If 1 = 0 Then ... End If are dead code that looks alive.
The worst was RBMassage(). The function definition existed in getfbNEWNewer.frm. Looked legit. The loop ported it. But the only call site, Call RBMassage, was commented out with '. The ported function applied accumulator erosion that the live runner never executes. It produced negative rushing yards. That kind of thing is hard to catch in a 34,000-line file where commented-out code and live code look nearly identical.
The loop caught 7 dead code instances total.
Cleanup
The only thing I haven’t liked about cleaning this up is duplicate functions doing the same thing.
5 copies of wtspd(). 4 copies of wtstr(). 3 copies of ScaleNumber(). Multiple attrVal() raw-read helpers doing the same thing across different packages.
The pattern is predictable. An autonomous loop running for 55 hours doesn’t have a great mental model of what it already wrote 40 hours ago. It needs function X, it writes function X, even though function X already exists three packages over. The code works. It passes tests. It’s just duplicated.
passLupAttrVal reimplemented lup() independently and missed the GameplayXP modifier. That’s Phase 4 of VB6’s lup function, a 0-10% experience modifier on attribute reads. Every pass-side attribute lookup was slightly wrong because of this. Not enough to fail a smoke test obviously. Just enough to be wrong.
Import cycles prevented consolidation until I created a vb6math leaf package to hold shared math functions. About 1,600 lines of duplicate logic removed in total.
Regressions
43 documented regressions during the run. All handled by opening new drill-downs rather than reverting correct ports. Yards per attempt oscillated wildly as different source slices landed: 13.04, then 4.28, then 15.73, then 5.88.
Each new slice of VB6 logic changed the simulation’s behavior in ways that temporarily made aggregate stats worse before the next slice corrected them. Port the pass protection logic, yards drop. Port the coverage escape logic, yards spike. Port the sack threshold adjustments, yards drop again.
This is fine if you understand what’s happening. It’s alarming if you’re watching a dashboard expecting monotonic improvement. I’m treating it like marble. Chipping away at every single issue. Over time, it gets closer to a functional complete port.
The Handoff
When the loop stopped, I asked Claude to review what Codex had written. Claude’s takeaway was: stop having Codex do it. Let me do it instead.
Claude looked at the 125K lines of output, found the real problems (the lup() duplication, the import cycles, the missing GameplayXP modifier), and fixed them in a few hours. Not because Claude is smarter. Because reviewing existing code with known problems is a fundamentally different task than generating new code over 55 hours. The review task has the same structured-problem property that made the port loop work: here’s the code, here are the test results, find what’s wrong.
But even just reading this old VB6 game created in 2005 rewritten in Go is something. It’s all so pretty and nice now. The VB6 code was a disaster. One 34,000-line form with no decomposition, no tests, variables named a1 and temp3 everywhere. Now there are packages, clear boundaries, typed structs, 2,150 test functions.
The Economics
OpenAI will change this. This is a brief window in time where they’re allowing completely ungodly levels of token usage just to take people from Anthropic. I’m not naive enough to think it’ll last forever. So I’m taking my shot at completing this for $200 and calling it a day.
It’s not elegant. But the alternative was $10-20K for a manual port, or more realistically, the program just stays dead. 34,000 lines of spaghetti VB6 with variables named a1 through a9, no function decomposition, decades of commented-out experiments mixed into live code. No human being was going to port that in any reasonable timeframe.
Risk
It’s about risk tolerance and management. For personal projects, I decide that risk tolerance, and automation makes sense. In work projects, I don’t have the same risk tolerance. I have extraordinarily little, so I have a completely different process for that where every change gets reviewed and validated before it touches anything.
One is not better than the other. They’re just connected to the risk tolerance I have.
Running an autonomous loop for 55 hours on a personal project with mechanical success criteria is a reasonable risk. The worst case is wasted tokens and some code I delete. Running the same loop on production work where a regression means paging someone at 3am is a different calculation entirely.
Next Time
I’d enforce a deduplication pass every N hours. Let the loop write whatever it wants, but periodically sweep for functions with similar signatures doing the same thing. The duplicate problem is predictable and catching it during the run instead of after would save cleanup time.
I’d also build a shared library of VB6-equivalent math functions before the loop starts. The lup() duplication happened because the loop treated attribute reads as a simple utility rather than recognizing it as a multi-phase function with modifiers. That shared library should exist from hour one, not get discovered during cleanup.
The Assessment
At this point, it’s probably done enough already that with enough hours I could manually get it done. Going from “completely impossible, no path forward” to “I could finish this by hand if I had to.” That happened for $200.
The code quality is uneven. Some packages are clean, well-structured Go. Others have that “written at 3am” feeling where the logic is correct but the organization is rough. But the lg50 Go harness runs games that produce stats converging toward the lg9 VB6 league. Not identical yet. There’s still port work to do.
Autonomous loops work when the success criteria are mechanical and the input fully defines the output. Port tasks are that. Most software development is not. I’m not sure how much of this generalizes. My problem had a very specific shape, and I got lucky that the shape was ideal for this kind of work.
Maybe loops get better at open-ended work, maybe the economics change and this window closes. I spent $200 instead of $20,000.