TLA+ in support of AI code generation

6 min read Original article ↗

Gregory Terzian

There’s been some discussion lately about the idea of “AI writing 90% of the code”, and some controversy whether that is an accurate claim for the present or the future. After reading Ronacher’s piece, I rather would want to ask: what exactly is the 10% that is written by hand? Are there some gnarly lines of code that require typing by hand?

In my most recent experience — when I’m not contributing to Servo, which has an AI ban in place — the 10% I’m writing by hand is not code but specification. Some of that is in English in the form of basic prompts — which I don’t consider part of the codebase — but my favorite parts are in TLA+, which I check into the codebase.

Press enter or click to view image in full size

The languages used in https://github.com/gterzian/automata

This post will document my most recent experience: building a non-trivial program(as in: concurrency, graphics) where 100% of the Rust code is AI generated, but where that process is supported by a TLA+ specification of the core algorithm, as well as my own conceptual work expressed through a series of prompts in English: a conversation with the agent.

To be clear: that the code was 100% generated by an LLM doesn’t mean that it came-up with the design and ideas behind the code, or even that it helped me fix serious issues. As you can see from the conversation summary, all ideas and major fixes are mine. I used the LLM only to translate higher-level software concepts into actual code.

Rule 110 Cellular Automaton

Wolfram’s A New Kind of Science is a fascinating read: you’ll never look at a cat’s fur pattern quite in the same way ever again. But, from reading the description of the basic rules of cellular automaton, I did wonder how to exactly implement one in code. For example, Wolfram’s description that the cells of an automaton are “updated in parallel at every step in its evolution” left me wondering whether that means “only update cells for the current step”, or whether it allows for cells from different steps(read: rows) to be updated in parallel as well.

Since a mathematical description in Matthew Cook’s paper did not clear that up for me, I decided to go ahead and write a spec in order to get a precise description of how a particular automaton, the one named Rule 110, can update itself at each computational step(which is not necessarily the same as a step in it’s history).

My earlier question seems to have been answered in the affirmative by the spec: one can compute cells in conceptual parallel vertical columns, up to the point where you need to wait on the neighboring cell from another such column to have been computed. In other words, to compute a cell at step i, you don’t always have to wait for step i-1 to have computed all of its cells.

This took a bit of thinking, in order to not only write down the update rule in TLA+, but also back it up with what I think is an inductive invariant implying that the automaton can only update itself according to its rule — an hour or two of work tops.

Later on I also added a spec for the original description, and a refinement mapping that my description implies the original one.

Next, I also wanted to implement the rule and visualize it.

The Rust Implementation

Over the course of several days(where the last two days I wrote these lines and made some last changes), I went over different iterations of an implementation in Rust — using vello and wgpu for graphics and winit for the window and event-loop integration — including throwing away all code and starting from scratch twice.

I started again from scratch once by accident — I corrupted the state of the project somehow, and instead of figuring it out I just thought: why not start from scratch, but this time try GPT5 Codex, since I wanted to compare it with Sonnet 4.5 anyway — and then again later I threw away the GPT code and went back from square one with Sonnet.

I played with parallelizing the compute in vertical columns of cells, only to found out it actually slowed things down and decided it was not worth figuring out further anyway(a single thread could compute the entire board even faster than the screen could update the UI); I played with various rendering and event loops configurations; I tried out all kinds of UIs to show the progress of the automaton(including splitting the screen and showing steps counter clockwise), and then finally, using this library, I also added an option to write the automaton run to a gif(because I couldn’t find a way to do that from a screen capture on my mac without downloading an app or signing-up for a website).

Update: I ended-up spending some more time on this: as usual, it’s not the amount of time that counts, but rather the cycles(of sleep!), which give you an opportunity to reflect and refine. For example, the rendering loop was further refined to ensure that the use of the GPU by the GIF encoder would not interfere with the main thread use of the GPU.

Below is a janky gif(because the gif generation is made subordinate to main thread presenting of frames) of the automaton progressing from a single top-right black cell, and where the user scrolls the board to follow interesting patterns. The code can be found here, and the repository also contains the tla spec and a summary of the conversation history.

Press enter or click to view image in full size

Takeaways

  • TLA+ works as a ground truth for code generation by an AI. Despite the amount of refactoring, I think the core algorithm of the automaton always followed my TLA+ spec(and any error is mine and part of the spec).
  • One could say that the automaton rule is well-known enough for the AI to be able to code it without the need for a TLA+ spec. But, it did get confused about the details of the rule(or at least their encoding in the spec), which to me seems to imply that encoding it in TLA+ helped. It certainly helped me better understand the rule.
  • Simpler concurrency problems and their solutions can be specified in English. For example, the event and rendering loop of the app is concurrent, however it is simple enough to not require TLA+. But, I wish I had specified it, because I kept repeating myself in prompts regarding maintaining a certain structure, and things went wrong on occasion. Then again, I didn’t know exactly what I wanted there when I started.
  • Claude is awesome as a concept to code translator, where the concept can be specified at various levels of details. It is not good at actual reasoning about concepts, which covers architecture and concurrent logic. The summary of the evolution of the project shows that all the major ideas, and their fixes, were mine.
  • In terms of code alone, the question is not what percentage can be AI generated, the question is rather what percentage requires careful review: for now that is still 100% on production code.

Thank you for reading.