Lessons from AI Safety for Businesses – Miloš Švaňa

Businesses and organizations often seem misaligned with our actual needs and wants. In the second installment of the Reading Club series, I want to explore this issue from the perspective of AI safety research.

We want our businesses and organizations to be aligned with our values. The same can be said about powerful AI systems. AI safety researchers call it the alignment problem. One of the most important scientific papers addressing the alignment problem is Risks from Learned Optimization. It discusses how optimization processes used to train AI models can create models that are themselves optimizers, and how these optimizers can be misaligned with the original goal the AI system was trained to achieve. Let’s unpack this idea step-by-step.

Optimizers

An optimizer is anything that minimizes or maximizes something. For example, when you ask Google Maps to navigate you to some destination, the application gives you the shortest possible route — it minimizes the total travel time. In other words, the route planning algorithm in Google Maps is an optimizer.

Businesses and organizations are optimizers, too. For instance:

An ideal government wants to maximize the well-being of its citizens.
A publicly traded company wants to maximize shareholder value.
A charity might want to minimize the number of people dying from easily preventable diseases.
An activist group might wants to minimize air pollution in a given region.

Most AI models are trained by a similar optimizer. Let’s say we want to train an AI model that predicts the average temperature for the next day. We can evaluate the quality of such a model by calculating how big an error the model makes when we use it on historical data. For example, if the model predicts an average temperature of 19 degrees Celsius and the real temperature that day was 17 degrees Celsius, the model made an error of 2 degrees. During training, the optimizer starts with a randomly configured AI model. Then it minimizes the total error the model makes on some example dataset by tweaking its various parameters.

Mesa Optimizers

Risks from Learned Optimization suggests that when we train an AI model using some optimizer, the trained model itself can become an optimizer. The authors call the optimizer embedded in the trained AI model a mesa optimizer (the word “mesa” is Greek for “below”; it’s the opposite of “meta” or “above”).

To grasp how a mesa optimizer looks, imagine you want to train your dog to follow certain commands. This is an optimization problem: you want to maximize the percentage of cases when the dog correctly follows the command you give it. You are the optimizer, and the dog is an equivalent of an AI model. As you are training the dog by giving it treats for making progress, the dog might start optimizing for a different goal: getting the largest amount of treats. It becomes a mesa optimizer.

An example related to AI might be a system trained to find the shortest path through a maze. The base optimizer might train such an AI system by rewarding it with a positive score if it makes progress. This training process can lead to an AI system that itself implements some well-known optimization algorithm for finding the shortest path (for example, breadth-first search).

In the maze example, the base optimizer and the mesa optimizer are aligned with each other. They have the same goal — finding the shortest path through a maze.

But such an alignment is by no means guaranteed. To demonstrate this, we can return to the dog training example. The dog developed its own goal that is only partially compatible with ours. As long as following our commands is the best way of getting the most treats, the dog and the trainer are aligned. But let’s say that the dog then discovers the place where all the treats are stored and eats enough to feel satisfied. Then it might stop caring about the trainer’s commands. The base optimizer (the trainer) and the mesa optimizer (the dog) are now misaligned.

Proxies

Misalignment usually happens when the goal of the mesa optimizer is only a proxy of the base optimizer’s goal — it more or less correlates with the main goal, but it is distinct from it.

Proxies are one of the reasons behind the less-than-ideal behavior of many organizations. We usually measure the performance of organizations using KPIs — key performance indicators. For example, publicly traded companies might use KPIs like quarterly revenue, and the research output of universities is measured by the number of papers in high-impact journals or by the number of citations.

But all these KPIs are just proxies. They more or less correlate with what we want from our organizations, but they don’t describe our wishes exactly. It’s easy to imagine how this leads to suboptimal results. For example, the emphasis on quarterly financial indicators in companies shifts focus towards short-term gains, and the emphasis on the number of publications or the number of citations incentivises scientists to deploy methods like salami slicing.

We can say that companies and research universities are mesa optimizers acting on behalf of their shareholders or the entire society. As mesa optimizers, these organizations learned to optimize for goals that are potentially misaligned with what society and the shareholders actually want.

What makes mesa optimizers more likely?

After defining mesa optimizers, Risks from Learned Optimization explores two important questions, the first of which is: What factors make the emergence of mesa optimizers in AI systems more likely? The paper lists many factors, but I don’t intend to overwhelm you, so I’ll focus on just one: task complexity.

If we want to train an AI model that solves a simple task, the trained model will likely learn a simple algorithm. On the other hand, if the task is complex, the model might need to employ a more complex approach, including optimization.

Can we say something similar about businesses and organizations? Although I have no proof of this, I’d say that such an analogy is at least plausible and worth investigating.

Say we have two organizations: a small amateur sports club that is ultimately controlled by its members and a research university. Members of the sports club have a simple goal: engaging in a given sport as a hobby. It’s hard to imagine how such a sports club could become an optimizer that wants to minimize or maximize something.

On the other hand, a research university involves many stakeholders with different goals: the society as a whole, researchers, teachers, students, and sponsors. The goal of a university is complex and has many components. It includes education, research, community events, or regional economic growth. And as I already demonstrated, universities might become optimizers with their own goals misaligned with what the stakeholders want.

What makes misalignment more likely?

If mesa optimizers were always aligned with the base optimizer, their existence wouldn’t be an issue. But as I already illustrated in several examples, this alignment is not guaranteed. The second question examined in the paper is what factors increase the probability of a mesa optimizer becoming misaligned. To stay concise, I’ll again focus on a single factor: unidentifiability of the real goal.

Say we are training an image classifier to distinguish huskies from German shepherds. But the example images we use to train this classifier always depict huskies in snowy environments and German shepherds surrounded by greenery. The classifier becomes confused. What do we want from it? Do we want it to distinguish between the two dog breeds or the two types of environment? If the classifier happens to develop into a mesa optimizer, it might randomly choose to focus on the environment, not the dogs.

When it comes to organizations, one of the reasons why we use KPIs as proxies is that exactly describing what we want from an organization (but also from an AI system) is difficult, maybe even impossible. Can we exactly define what a university is supposed to do? Better yet, can we formulate the goal mathematically so that we can objectively evaluate whether a university is doing well? If you have a good solution, I’d like to hear about it. For now, we are stuck with KPIs, whose use often leads to outcomes no one wished for, or in other words, misalignment.

Deceptive alignment

Let me end the discussion on Risks from Learned Optimization by saying a few words on deceptive alignment. Deceptive alignment is an especially bad form of misalignment: the AI system might appear aligned during training, but once we deploy the system in the real world, it suddenly goes rogue.

One analogue for deceptive alignment in the world of organizations might be the concept of enshittification. Many early-stage companies seem to be exceptionally user-friendly. Think early Google, Uber, or Netflix. At the beginning, these companies looked like they genuinely cared about user satisfaction. But then something changed. Maybe investors actually wanted to start seeing some returns. So the focus shifted. Suddenly, the companies started collecting much more personal data. They started serving more ads. Prices went up. Account sharing was forbidden. The once consumer-friendly companies became misaligned with what the users actually want.

AI safety is not only about AI safety

I’ve previously stated that AI safety utilizes knowledge from many other fields, including philosophy, economics, and decision theory. I hope that as you are reaching the end of this article, you are being convinced that the relationship works both ways. Just as AI safety research benefits from other fields, other fields can benefit from AI safety research.

The world of organizations and businesses can potentially benefit the most. Many are discussing the arrival of artificial superintelligence. But in some sense, it’s already here. Organizations are made of more than one person. If you agree that two or more people are, at least on average, more intelligent than a single person, then any organization is by definition superintelligent. Solving the alignment problem in the context of AI might help us solve the alignment problem in the context of organizations and vice versa.

What do you think? Do you think the analogies between AI systems and organizations make sense? Where are they potentially failing?