My Agents Crashed the Economy, So I Taught Them About Salads

Like most projects, it started as an overly ambitious idea: what if I could create a simulation where corporations and people interact and capture emerging patterns like price strategies, inflation, and creative destruction?

Since these things are complex, my attempts to create a functional micro economy failed miserably. The hardest part was getting agents to act rationally, everyone either became poor or all corporations went bankrupt.

I used hard-coded rules to determine what to do:

IF corp.price > avg(all_prices):
	lower_price_by(0.3)

These simple instructions couldn’t capture the nuanced reality of business decisions. I needed an alternative approach, that’s when I stumbled upon reinforcement learning.

The key concept in RL is the Markov Decision Process (MDP), a framework for modeling decision-making where outcomes are partly random. The goal is to find a good “policy” for the decision maker (in my case, corporate agents).

A policy is a rule that says “when in state X, do action Y,” which sounded perfect for my problem.

Here’s how it works, explained through examples (no formulas, I promise).

Think of a state as a snapshot of your situation. Are you hungry? Tired? Rested? That’s your current state. In each state, you can take different actions.

Let’s use a person as an example:

HUNGRY: [”eat”, “sleep”, “work”] 
TIRED: [”eat”, “sleep”, “work”] 
RESTED: [”eat”, “sleep”, “work”]

Each action has a reward, a number representing how good that choice is:

HUNGRY: [(”eat”, 10), (”sleep”, 5), (”work”, -5)]
TIRED: [(”eat”, 5), (”sleep”, 10), (”work”, -10)]
RESTED: [(”eat”, 5), (”sleep”, 0), (”work”, -10)]

When you’re hungry, eating gives +10 reward (great choice!), while working gives -5 (bad idea). When tired, sleeping gives +10, and working gives -10 (terrible idea).

Let’s say you’re HUNGRY and you eat. What state do you end up in?

The first time: HAPPY (a new state you’ve discovered!)
The second time: HAPPY
The third time: OVERFULL (ugh, you overate—another new state)
The fourth time: HAPPY

After eating 10 times when HUNGRY, you’ve learned:

- Ended up HAPPY: 7 times (70%)
- Ended up OVERFULL: 3 times (30%)

Now you can represent eating with both outcomes and their probabilities:

HUNGRY: {
 “eat”: [(10, “HAPPY”, 0.7), (-2, “OVERFULL”, 0.3)],
 “sleep”: [(5, “HUNGRY”, 1.0)],
 “work”: [(-5, “HUNGRY”, 1.0)]
}

When you eat while HUNGRY, there’s a 70% chance you’ll feel HAPPY (reward: +10) and a 30% chance you’ll feel OVERFULL (reward: -2).

The expected reward isn’t just the best-case +10, it’s the weighted average based on what actually happens:

10 × 0.7 + (-2) × 0.3 = 7 - 0.6 = 6.4

So eating when HUNGRY gives you 6.4 reward on average, not 10. This is how you learn the true value of each action, through repeated experience!

If you just pick the action with the highest immediate reward, you’ll make terrible long-term decisions. Let me show you why.

Imagine you’re HUNGRY. You can eat pizza (reward: +10) or salad (reward: +6).

Pizza wins, right?

Not so fast fatty!

HUNGRY: {
    “eat_pizza”: [(10, “HAPPY”, 0.7), (5, “UNHEALTHY”, 0.3)],
    “eat_salad”: [(6, “HAPPY”, 1.0)]
},
UNHEALTHY: {
    “eat_pizza”: [(-10, “SICK”, 0.9), (5, “UNHEALTHY”, 0.1)],
    “eat_salad”: [(4, “UNHEALTHY”, 0.8), (8, “HUNGRY”, 0.2)]
}

Pizza tastes better now (+10 vs +6), but there’s a 30% chance it makes you UNHEALTHY. And once you’re UNHEALTHY, eating more pizza becomes disastrous, 90% chance of getting SICK with a -10 reward.

Salad is boring but safe. It always makes you HAPPY, with no risk of the UNHEALTHY spiral.

So which is actually the better choice? To answer that, we need to look beyond immediate rewards and consider where each action leads.

Value iteration calculates the long-term value of each state by answering: “What’s the expected total reward I can get starting from this state, assuming I act optimally from here on?”

Here’s the key insight: the value of being in a state isn’t just about the immediate reward, it’s about that reward PLUS the value of wherever you end up next.

Think about it: being HUNGRY is pretty good because you can eat and feel HAPPY. But being SICK is terrible because you’re stuck with bad options.

There’s one more wrinkle: future rewards count less than immediate ones. A salad tomorrow is better than a salad next year (you might not even be hungry by then). We handle this with a discount factor that makes distant rewards worth less.

After running value iteration, we get a table showing the long-term value

of each state:

State      | Value
-----------|----------
HUNGRY     | 25.4
HAPPY      | 30.2
UNHEALTHY  | 8.5
SICK       | -5.0

These numbers mean: “If I’m in this state and make optimal choices going forward, this is my expected total reward.”

Now we can finally answer the pizza vs salad question.

For each state, we evaluate each action by considering:

The immediate reward
Where it might lead (and the probabilities)
The long-term value of those next states

When HUNGRY:

When HUNGRY:
- Eat pizza: Immediate +10, but 30% risk of UNHEALTHY (value: 8.5)
  → Long-term value: 29.8
- Eat salad: Immediate +6, guaranteed HAPPY (value: 30.2)
  → Long-term value: 33.2

Salad wins! Not because it tastes better, but because it leads to better long-term outcomes.

We do this calculation for every state and end up with a policy, a simple rule for what to do in each situation:

State      | Best Action
-----------|------------
HUNGRY     | eat_salad
UNHEALTHY  | eat_salad
HAPPY      | eat_salad
SICK       | rest

This is the optimal strategy. No complex rules, no nested if-statements!

My Agents Crashed the Economy, So I Taught Them About Salads

Discussion about this post

Ready for more?