AI thinks your code is correct, but it can not prove it.

9 min read Original article ↗

When we talk about software development, there is one concept that sometimes does not receive the importance it deserves. That concept is trust. Every decision made by an engineering team, from approving a Pull Request to deploying to production, is ultimately a bet on the correctness of the code. Naturally, the more evidence we have that the code behaves correctly, the more willing we are to deliver it to real users.

This concept of trust is built incrementally. In this sense, one of the first signs that a piece of code might be correct is the fact that it compiles.

Compiling ensures that the code is syntactically and structurally valid. The compiler performs checks such as making sure that you are not calling methods that do not exist, that variables and functions type check, and that the structure of the program is consistent.

A program compiling is a good first sign. However, compilation does not tell us anything about whether a program fulfills its purpose. This is where testing comes into play.

There are many types of tests: unit, integration, end-to-end... But, in general, the idea is the same: passing the tests means that the code behaves as you expect in the cases you have chosen to test.

At this point, the level of confidence rises again. The code is not only structurally valid, but for certain inputs it produces certain outputs that are correct. The problem at this point is that the coverage of the cases that can be tested is finite. With formal verification, we can go further.

Formal verification ensures that the fundamental properties of the system are fulfilled for all possible inputs. This can prevent the bugs that bite harder: the ones that show up in production at a much later stage, when some specific (and often rare) conditions are met.

Vanilla Java Banking App

To illustrate the added value of formal verification, we will use a simple banking app written in Java. The domain of this application is easy to understand: accounts with balances, deposit, withdrawal, and transfer operations.

In our context, the banking system must comply with certain basic properties. However, we will later see why these properties may differ depending on the context or user needs:

  • The account balance can never be negative
  • Transfers retain the total money in the bank, i.e. what leaves from one account goes into another

Based on these premises, let's see what kind of bugs would be easy to miss.

Fee incurred in the red

Let's look at the withdrawal method. At first glance, it does everything right: it validates the amount, checks for sufficient funds, and subtracts the withdrawal. But there is a subtle detail, a fee is applied after the balance check, without being included in the validation:

public void withdraw(int amount) {
      if (amount <= 0) {
          throw new IllegalArgumentException("Withdrawal must be positive");
      }

      // Withdraw fee
      int fee = 2;

      if (balance < amount) {
          throw new IllegalStateException("Insufficient funds");
      }

      balance -= amount;
      balance -= fee;
  }

There is a validation that throws an error when balance < amount, to make sure that our invariant of "account balance can never be negative" is preserved, but it does not take the fee into account. Let's suppose that a given user has $10 in their account. A priori, they can withdraw that $10, but if we consider the fee, the account would be left with a negative balance of -$2. This violates our stated properties.

Money conservation in transfers

Along the same line of reasoning, we may even find cumulative errors. For example, just as there is a fee for withdrawing money, there may be cases where there is a fee for making a transfer. In addition, it is common for the transfer method to internally call the withdrawal method to take money out of the source account. In that case, we could even encounter two fees. Let's look at it in the code:

public void transfer(String fromId, String toId, int amount) {
      Account from = getAccount(fromId);
      Account to = getAccount(toId);
      // Transfer's fee
      int fee = 1;
      from.withdraw(amount + fee);
      to.deposit(amount);
      transactions.add(new Transaction(
          TransactionType.TRANSFER, fromId, toId, amount
      ));
  }

Let's imagine that the source account has $11, the destination account has $50, and $10 is transferred. In this case, the source account is left with $11 -$10 (money transferred) -$2 (withdrawal fee) -$1 (transfer fee) = -$2. Meanwhile, the destination account is left with $60. The total balance of the system goes from $61 ($11 + $50) to $58 (-$2 + $60). Once again, one of our properties is violated.

Arithmetic overflows

There is an even more subtle category of bugs: those where the code is apparently correct line by line, but fails due to the limitations of the language itself. Unlike the previous cases, here the problem is not one of local or global context; it is a semantic subtlety of the language that even an LLM with full system context could overlook. Consider a simple method that calculates the total balance of an account by adding a bonus:

  public int calculateTotalBalance(int balance, int bonus) {
      return balance + bonus;
  }

The code appears to be fine, but if balance = Integer.MAX_VALUE (2,147,483,647) and bonus = 1, the result is not 2,147,483,648 but -2,147,483,648. Java does not throw any exceptions; integer arithmetic simply overflows and wraps around. An astronomical balance quietly becomes a negative one.

These types of errors are especially dangerous because they also escape the code review of an LLM.

How would Predictable Code act in these cases?

The first thing that might come to mind after seeing these examples is the idea of conducting more exhaustive tests and trying to cover a broader spectrum of possibilities. That is their key limitation: tests can verify specific examples. Ensuring the correctness of a system requires something different: that certain properties are satisfied for all possible inputs, which in most cases is inherently infinite. This is only achievable through formal verification. Formal verification carries out logical reasoning through the program that proves the implementation of a program does not do anything that would go against the specification (i.e. the set of properties, requirements and invariants). If the program violates some property, formal verification can also often help find concrete inputs that show this violation: these are counterexamples.

Next, we will see how Predictable Code would behave in the above situations. We will also illustrate how it would act in a normal situation where there are no errors.

Withdraw fee

[ERROR] InvariantViolation: balance_non_negative
	Intended invariant ∀ withdraw(amount): balance_before ≥ amount + fee
    Method:   Account.withdraw
    Severity: HIGH

    Counterexample:
      balance = 10, withdraw(10), fee = 2 → balance_after = -2

    Violation: balance_after < 0
    Confidence: 100%

    Suggested fix: require(balance >= amount + fee)

Transfer and money conservation

[CRITICAL] InvariantViolation: total_balance_preserved
	Expected global invariant: Σ balances_before == Σ balances_after
    Method:   Bank.transfer
    Severity: CRITICAL

    Counterexample:
      Alice = 11, Bob = 50, transfer(10)
      Alice_after = -2, Bob_after = 60
      Σ before = 61, Σ after = 58 → 3 units lost

    Violation: Σ balances_before ≠ Σ balances_after
    Confidence: 100%

    Suggested fix: ensure fee is applied once per transaction 

Overflows

[ERROR] InvariantViolation: balance_non_negative_after_operation
      Expected invariant: ∀ calculateTotalBalance(balance, bonus): result ≥ 0
      Method:   Account.calculateTotalBalance
      Severity: HIGH

      Counterexample:
        balance = 2147483647, bonus = 1
        result = -2147483648 (integer overflow)

      Violation: result < 0
      Confidence: 100%

      Suggested fix: use long or validate input bounds before operation

No bugs case

There may be a situation in which Predictable Code analyzes code without errors. In such a case, we could expect a response like the following.

INFERRED INVARIANTS:
	INV-1: Account.balance >= 0

	INV-2: withdraw(amount > 0 && amount <= balance)
       strictly decreases Account.balance

	INV-3: transfer(from, to, amount > 0 && amount <= from.balance)
       preserves total balance:
       from.balance + to.balance is constant

All invariants are preserved across all execution paths.

Counterexamples found: NONE
Search space coverage: COMPLETE

Predictable Code explored all reachable states derived
from the code semantics and found no counterexamples
violating inferred invariants.

System consistency: HIGH
Behavior determinism: GUARANTEED
Hidden side effects: NONE DETECTED

System classification: PREDICTABLE
Verification confidence: HIGH

The analyzed system is internally consistent.
All inferred semantic properties are preserved.
No unexpected or contradictory behaviors were detected.

Beyond specific cases

The examples we have seen in this article are intentionally simple. Their purpose is to illustrate the idea in a clear and accessible way. However, in real systems, the errors that formal verification detects are often considerably more subtle and difficult to spot. And this is where a limitation of technologies such as LLMs becomes apparent: they can detect problems within their context window, a function, a file, perhaps a module. But the most critical properties of a system are global: they cross boundaries between components, between modules, between services. The preservation of money in a transfer is not a property of withdraw or deposit separately, it is a property of the entire system. Formal verification operates at exactly that level.

Throughout this article, we have seen a specific use case. However, Predictable Code addresses the problem in general. On the one hand, the behavior is language-independent. Although we have seen an example in Java, the same invariant detection logic applies to any codebase. A service in Kotlin, an API in Python, a legacy system in C++, etc. The same applies to the context of the code, which in this case was banking but can be extrapolated to any field.

Furthermore, Predictable Code is adaptable to the user's needs. For example, the user of the application we have presented could say something like “at my bank, I need to be allowed to be in the red up to X$ or for Y days.” In that case, those properties would become axiomatic for Predictable Code and, therefore, not flagged as violations in subsequent iterations.

Ultimately, at Predictable Machines, our approach is that any team, regardless of the language, can mathematically prove that their code does what it says it does.


Is your AI-generated code actually correct?

If your engineering teams are shipping AI-generated code, the question isn't whether it runs. It's whether it's aligned with your specifications and correct at the implementation level. Predictable Code verifies that automatically, across any language, giving your organization mathematical proof that the code does exactly what it's supposed to. Learn more about Predictable Code →