On what the Future of Software Looks Like, Today.

Why formal methods combined with AI assistance point to the future of software engineering.

In those days where everyone online seems to be convinced that the coding singularity is near, it can be hard to find the signal in the noise.

One of Stephen Diehl latest article: “Can Opus 4.6 do Category Theory in Lean?”, was a clear signal to me. The article describes writing a mathematical library using Lean, with the assistance of AI, and notes that:

When (tedious parts of formalization) becomes trivial, we get to spend our time on the parts that actually matter (…) We get to build higher.

I’ve been using TLA+ for about six years now, but I have never written a proof, nor have I done any formal verification of code. TLA for me was rather a way to debug my thinking when it came to designing concurrent algorithms; when the time came to implement them in code, I relied not on math but intuition.

Translating the TLA+ spec into code could be done by an AI, as long as I reviewed the output.

But Lean — which I have delved into after reading Diehl’s article — offers a different promise: verification across the software development cycle.

But there is a catch: the usual Lean is pretty low-level, as in, it looks like code. That means that it is far lower-level than TLA+, which I use not to model all of the code, but only some interesting concurrent property of it.

So does that mean that in the future we will not be writing code anymore, but using an AI to write Lean, where the benefit versus the old code being that Lean can be used to prove theorems about Lean?

While progress, that is kind of a step-down from a future I envisioned of writing high-level math and then seeing it translated into software by an AI.

But that is still possible with Lean: I found out that, just like one can have various level of granularity in TLA+ specs, and then use refinement to prove they implement each other, one can also split the Lean between a spec — which looks a lot like TLA+ — and a code-like implementation.

Of-course, all of it can be written by an AI, with the question remaining: what does the human have to review? And the answer being: sort of only the high-level spec part (it depends on the task at hand).

When I wrote “sort of” above, what I had in mind is the fact that in many cases you still want to review the lower-level artifact, not for correctness but for softer requirements like an architecture that makes sense and is maintainable.

Let’s proceed with an example.

The Lean Spec

The simplest example of using TLA+ comes from Lamport’s Teaching Concurrency paper, which I went through in a previous article where I wrote a TLA+ spec, and a Rust implementation of the algorithm discussed therein.

Let’s see what it looks like using the Lean technique I describe above.

First, we’ll write a high-level spec in a style similar to the original TLA+ one, in ConcurrencySpec.lean, which is a translation from my original TLA+ spec.

Does the Lean map 100% to the TLA+? I don’t know. While I did enjoy multiple readings of Lamport’s book: “A Science of Concurrent Programs”, which goes into the mathematical foundations of TLA+, that does not qualify me to answer that question. I would tend towards a definite “no” however, for example because, besides the stop action, the spec is not stuttering insensitive.
In this example it does not matter, because each step of the program corresponds to a step of the spec, but it would matter if the refinement proof of the finer-grained implementation would require the coarser grain spec to be stuttering insensitive (meaning the finer-grain version can take any number of additional steps without changing the state of the coarser grained spec).
I look forward to an expert writing a kind of TLA+ library for Lean to verify these kinds of things.

What was interesting about this simple translation is that it highlighted a problem in my original spec, which made assertions on NoVal (a kind of none variant of an option). The AI told me it couldn’t write a proof on that part, because it was not strong enough. So the invariant had to be strengthened (here is the new TLA+, and here is the Lean translation).

The story was actually even more interesting, because my initial “fix” was actually too strong of an invariant, but I finally found the right solution(or so I hope), this time.

And that’s a perfect example of the challenges ahead, identified by Martin Kleppmann: “As the verification process itself becomes automated, the challenge will move to correctly defining the specification: that is, how do you know that the properties that were proved are actually the properties that you cared about?”

The Lean Implementation

The implementation of the Spec is the runIO function, which should sort of match the previous Rust implementation of my original TLA+ spec.

In this example, the implementation is actually simpler than the spec. But, I think that is just because of the kind of example used. If you imagine a fuller implementation, one that would come with all kinds of sequential code, than the high-level spec — focusing on concurrent interactions only — would be much smaller than the implementation.

This is the Lean code that looks a lot like, well, code. It re-uses the same data structure, State, as the spec, but shares it among concurrent tasks protected by a mutex.

Also, the idea that this implementation is a refinement of the spec, is (supposed to be) proven here. Now I have to acknowledge that I am still learning Lean, and I am not 100% sure that the proof is what I think it is, but for now it will do.

An interesting side story here is that the initial implementation did not re-use the data structure from the spec — instead it just used a single x variable, similar to the Rust code — and trying to prove that this initial implementation refined the spec took the AI down a wild goose chase. When I switched to a more expensive model, it just went down a more sophisticated wild goose chase (it was noticeably better at using a tool at its disposal: lean-lsp-mcp).

The “open sesame” prompt then was: “so can you not simplify the implementation and the refinement proof simply by having runIO use ConcurrencySpec.State, protected by a mutex?”

To which the AI replied:

That is a brilliant simplification! Since `State` (from `ConcurrencySpec`) is an abstract mathematical state that directly represents functions from processes to variables, we can just use `IO.Ref (State N)` instead of two concrete arrays in `runIO`!
This eliminates the need for the `ConcreteState` wrapper, `updateConcrete`, `concrete_reachable`, `ghost_relation`, and all the tedious array bounds checking (`setIfInBounds`) proofs that Lean was struggling with.

Again, given my limited experience with Lean so far, I am taking the above with a grain of salt, and just hoping the latest proof really does prove what it says.

Takeaways

The strength of TLA+ is that it is a high-level way to describe concurrent algorithms. The beauty of TLA+ lies in it being the most elegant way to describe concurrent programs in the abstract, as an expression of an elegant mathematical foundation.

In the second chapter of A Science of Concurrent Programs, under Meaningless Expressions, Lamport explains the choice to build most of TLA+ on simple math, and he singles out the kind of typing system Lean uses as unnecessary to reason about abstract programs and “so complicated that I never tried to learn them”).

The practical weakness of TLA+ is that it lacks a formal relationship to running code.

The strength of Lean is that the math is the code. Its weakness is that it can be quite low-level and intricate, since it requires describing the entire program logic, not just some concurrent part of it. But that can be addressed by splitting the Lean into a high-level TLA-like spec, and a lower-level code-like implementation, where the two are linked by way of a refinement proof.

And this opens the door to the future: formal software written by AI and specified at a high-level by humans — the autonomous slop antithesis.