AI as the engine, humans as the steering wheel

2025 Feb 28 See all posts

Special thanks to Devansh Mehta, Davide Crapis and Julian Zawistowski for feedback and review, Tina Zhen, Shaw Walters and others for discussion.

If you ask people what they like about democratic structures, whether governments, workplaces, or blockchain-based DAOs, you will often hear the same arguments: they avoid concentration of power, they give their users strong guarantees because there isn't a single person who can completely change the system's direction on a whim, and they can make higher-quality decisions by gathering the perspectives and wisdom of many people.

If you ask people what they dislike about democratic structures, they will often give the same complaints: average voters are not sophisticated, because each voter only has a small chance of affecting the outcome, few voters put high-quality thought into their decisions, and you often get either low participation (making the system easy to attack) or de-facto centralization because everyone just defaults to trusting and copying the views of some influencer.

The goal of this post will be to explore a paradigm that could perhaps use AI to get us the benefits of democratic structures without the downsides. "AI as the engine, humans as the steering wheel". Humans provide only a small amount of information into the system, perhaps only a few hundred bits, but each of those bits is a well-considered and very high-quality bit. AI treats this data as an "objective function", and tirelessly makes a very large number of decisions doing a best-effort at fitting these objectives. In particular, this post will explore an interesting question: can we do this without enshrining a single AI at the center, instead relying on a competitive open market that any AI (or human-AI hybrid) is free to participate in?

Why not just put a single AI in charge?

The easiest way to insert human preferences into an AI-based mechanism is to make a single AI model, and have humans feed their preferences into it somehow. There are easy ways to do this: you can just put a text file containing a list of people's instructions into the system prompt. Then you use one of many "agentic AI frameworks" to give the AI the ability to access the internet, hand it the keys to your organization's assets and social media profiles, and you're done.

After a few iterations, this may end up good enough for many use cases, and I fully expect that in the near future we are going to see many structures involving AIs reading instructions given by a group (or even real-time reading a group chat) and taking actions as a result.

Where this structure is not ideal is as a governing mechanism for long-lasting institutions. One valuable property for long-lasting institutions to have is credible neutrality. In my post introducing this concept, I listed four properties that are valuable for credible neutrality:

Don't write specific people or specific outcomes into the mechanism
Open source and publicly verifiable execution
Keep it simple
Don't change it too often

An LLM (or AI agent) satisfies 0/4. The model inevitably has a huge amount of specific people and outcome preferences encoded through its training process. Sometimes this leads to the AI having preferences in surprising directions, eg. see this recent research suggesting that major LLMs value lives in Pakistan far more highly than lives in the USA (!!). It can be open-weights, but that's far from open-source; we really don't know what devils are hiding in the depths of a model. It's the opposite of simple: the Kolmogorov complexity of an LLM is in the tens of billions of bits, about the same as that of all US law (federal + state + local) put together. And because of how rapidly AI is evolving, you'll have to change it every three months.

For this reason, an alternative approach that I favor exploring for many use cases is to make a simple mechanism be the rules of the game, and let AIs be the players. This is the same insight that makes markets so effective: the rules are a relatively dumb system of property rights, with edge cases decided by a court system that slowly accumulates and adjusts precedents, and all of the intelligence comes from entrepreneurs operating "at the edge".

The individual "game players" can be LLMs, swarms of LLMs interacting with each other and calling into various internet services, various AI + human combinations, and many other constructions; as a mechanism designer, you do not need to know. The ideal goal is to have a mechanism that functions as an automaton - if the goal of the mechanism is choosing what to fund, then it should feel as much as possible like Bitcoin or Ethereum block rewards.

The benefits of this approach are:

It avoids enshrining any single model into the mechanism; instead, you get an open market of many different participants and architectures, all with their own different biases. Open models, closed models, agent swarms, human + AI hybrids, cyborgs, infinite monkeys, etc, are all fair game; the mechanism does not discriminate.
The mechanism is open source. While the players are not, the game is - and this is a pattern that is already reasonably well-understood (eg. political parties and markets both work this way)
The mechanism is simple, and so there are relatively few routes for a mechanism designer to encode their own biases into the design
The mechanism does not change, even if the architecture of the underlying players will need to be redesigned every three months from here until the singularity.

The goal of the steering mechanism is to provide a faithful representation of the participants' underlying goals. It only needs to provide a small amount of information, but it should be high-quality information.

You can think of the mechanism as exploiting an asymmetry between coming up with an answer and verifying the answer. This is similar to how a sudoku is difficult to solve, but it's easy to verify that a solution is correct. You (i) create an open market of players to act as "solvers", and then (ii) maintain a human-run mechanism that performs the much simpler task of verifying solutions that have been presented.

Futarchy

Futarchy was originally introduced by Robin Hanson as "vote values, but bet beliefs". A voting mechanism chooses a set of goals (which can be anything, with the caveat that they need to be measurable) which get combined into a metric M. When you need to make a decision (for simplicity, let's say it's YES/NO), you set up conditional markets: you ask people to bet on (i) whether YES or NO will be chosen, (ii) value of M if YES is chosen, otherwise zero, (iii) value of M if NO is chosen, otherwise zero. Given these three variables, you can figure out if the market thinks YES or NO is more bullish for the value of M.

"Price of the company share" (or, for a cryptocurrency, a token) is the most commonly cited metric, because it's so easy to understand and measure, but the mechanism can support many kinds of metrics: monthly active users, median self-reported happiness of some group of constituents, some quantifiable measure of decentralization, etc.

Futarchy was originally invented in the pre-AI era. However, futarchy fits very naturally in the "sophisticated solver, easy verifier" paradigm described in the previous section, and traders in a futarchy can be AI (or human+AI combinations) too. The role of the "solvers" (prediction market traders) is to determine how each proposed plan will affect the value of a metric in the future. This is hard. The solvers make money if they are right, and lose money if they are wrong. The verifiers (the people voting on the metric, adjusting the metric if they notice that it is being "gamed" or is otherwise becoming outdated, and determining the actual value of the metric at some future time) need only answer the simpler question "what is the value of the metric now?"

Distilled human judgement

Distilled human judgement is a class of mechanisms that works as follows. There is a very large number (think: 1 million) of questions that need to be answered. Natural examples include:

How much credit does each person in this list deserve for their contributions to some project or task?
Which of these comments violate the rules of a social media platform (or sub-community)?
Which of these given Ethereum addresses represent a real and unique human being?
Which of these physical objects contributes positively or negatively to the aesthetics of its environment?

You have a jury that can answer such questions, though at the cost of spending a lot of effort on each answer. You ask the jury to only a small number of the questions (eg. if the total list has 1 million items, the jury perhaps only provides answers on 100 of them). You can even ask the jury indirect questions: instead of asking "what percent of total credit does Alice deserve?", you can ask "does Alice or Bob deserve more credit, and how many times more?". When designing the jury mechanism, you can reuse time-tested mechanisms from the real world like grants committees, courts (determining value of a judgement), appraisals, etc, though of course the jury participants are themselves welcome to use new-fangled AI research tools to help them come to an answer.

You then allow anyone to submit a list of numerical responses to the entire set of questions (eg. providing an estimate for how much credit each participant in the entire list deserves). Participants are encouraged to use AI to do this, though they can use any technique: AI, human-AI hybrid, AI with access to internet search and the ability to autonomously hire other human or AI workers, cybernetically enhanced monkeys, etc.

Once the full-list providers and the jurors have both submitted their answers, the full lists are checked against the jury answers, and some combination of the full lists that are most compatible with the jury answers is taken as the final answer.

The distilled human judgement mechanism is different from futarchy, but has some important similarities:

In futarchy, the "solvers" are making predictions, and the "ground-truth data" that their predictions get checked against (to reward or penalize solvers) is the oracle that outputs the value of the metric, which is run by the jury.
In distilled human judgement, the "solvers" are providing answers to a very large quantity of questions, and the "ground-truth data" that their predictions get checked against is high-quality answers to a small subset of those questions, provided by a jury.

Toy example of distilled human judgement for credit assignment, see python code here. The script asks you to be the jury, and contains some AI-generated (and human-generated) full lists pre-included in the code. The mechanism identifies the linear combination of full lists that best-fits the jury answers. In this case, the winning combination is 0.199 * Claude's answer + 0.801 * Deepseek's answer; this combination matches the jury answers better than any single model does. These coefficients would also be the rewards given to the submitters.

The "humans as a steering wheel" aspect in this "defeating Sauron" example is reflected in two places. First, there is high-quality human judgement being applied on each individual question, though this is still leveraging the jury as "technocratic" evaluators of performance. Second, there is an implied voting mechanism that determines if "defeating Sauron" is even the right goal (as opposed to, say, trying to ally with him, or offering him all the territory east of some critical river as a concession for peace). There are other distilled human judgement use cases where the jury task is more directly values-laden: for example, imagine a decentralized social media platform (or sub-community) where the jury's job is to label randomly selected forum posts as following or not following the community's rules.

There are a few open variables within the distilled human judgement paradigm:

How do you do the sampling? The role of the full list submitters is to provide a large quantity of answers; the role of the jurors is to provide high-quality answers. We need to choose jurors, and choose questions for jurors, in such a way that a model's ability to match jurors' answers is maximally indicative of its performance in general. Some considerations include:
- Expertise vs bias tradeoff: skilled jurors are typically specialized in their domain of expertise, so you will get higher quality input by letting them choose what to rate. On the other hand, too much choice could lead to bias (jurors favoring content from people they are connected to), or weaknesses in sampling (some content is systematically left unrated)
- Anti-Goodharting: there will be content that tries to "game" AI mechanisms, eg. contributors that generate large amounts of impressive-looking but useless code. The implication is that the jury can detect this, but static AI models do not unless they try hard. One possible way to catch such behavior is to add a challenge mechanism by which individuals can flag such attempts, guaranteeing that the jury judges them (and thus motivating AI developers to make sure to correctly catch them). The flagger gets a reward if the jury agrees with them or pays a penalty if the jury disagrees.
What scoring function do you use? One idea that is being used in the current deep funding pilots is to ask jurors "does A or B deserve more credit, and how much more?". The scoring function is score(x) = sum((log(x[B]) - log(x[A]) - log(juror_ratio)) ** 2 for (A, B, juror_ratio) in jury_answers) : that is, for each jury answer, it asks how far away the ratio in the full list is from the ratio provided by the juror, and adds a penalty proportional to the square of the distance (in log space). This is to show that there is a rich design space of scoring functions, and the choice of scoring functions is connected to the choice of which questions you ask the jurors.
How do you reward the full list submitters? Ideally, you want to often give multiple participants a nonzero reward, to avoid monopolization of the mechanism, but you also want to satisfy the property that an actor cannot increase their reward by submitting the same (or slightly modified) set of answers many times. One promising approach is to directly compute the linear combination (with coefficients non-negative and summing to 1) of full lists that best fits the jury answers, and use those same coefficients to split rewards. There could also be other approaches.

In general, the goal is to take human judgement mechanisms that are known to be effective and bias-minimizing and have stood the test of time (eg. think of how the adversarial structure of a court system includes both the two parties to a dispute, who have high information but are biased, and a judge, who has low information but is probably unbiased), and use an open market of AIs as a reasonably high-fidelity and very low-cost predictor of these mechanisms (this is similar to how "distillation" of LLMs works).

Deep funding

Deep funding is the application of distilled human judgement to the problem of filling in the weights of edges on a graph representing "what percent of the credit for X belongs to Y?"

It's easiest to show this directly with an example:

Output of two-level deep funding example: the ideological origins of Ethereum. See python code here.

Here, the goal is to distribute the credit for philosophical contributions that led to Ethereum. Let's look at an example:

The simulated deep funding round shown here has assigned 20.5% of the credit to the Cypherpunk Movement and 9.2% to Techno-Progressivism.
Within each of those nodes, you ask the question: to what extent is it an original contribution (so it deserves credit for itself), and to what extent is it a recombination of other upstream influences? For the Cypherpunk Movement, it's 40% new and 60% dependencies.
You can then look at influences further upstream of those nodes: Libertarian minarchism and anarchism gets 17.3% of the credit for the Cypherpunk Movement but Swiss direct democracy only gets 5%.
But note that Libertarian minarchism and anarchism also inspired Bitcoin's monetary philosophy, so there are two pathways by which it influenced Ethereum's philosophy.
To compute the total share of contribution of Libertarian minarchism and anarchism to Ethereum, you would multiply up the edges along each path, and add the paths: 0.205 * 0.6 * 0.173 + 0.195 * 0.648 * 0.201 ~= 0.0466. And so if you had to donate $100 to reward everyone who contributes to the philosophies that motivated Ethereum, according to this simulated deep funding round, Libertarian minarchists and anarchists would get $4.66.

This approach is designed to work in domains where work is built on top of previous work and the structure of this is highly legible. Academia (think: citation graphs) and open source software (think: library dependencies and forking) are two natural examples.

The goal of a well-functioning deep funding system would be to create and maintain a global graph, where any funder that is interested in supporting one particular project would be able to send funds to an address representing that node, and funds would automatically propagate to its dependencies (and recursively to their dependencies etc) based on the weights on the edges of the graph.

You could imagine a decentralized protocol using a built-in deep funding gadget to issue its token: some in-protocol decentralized governance would choose a jury, and the jury would run the deep funding mechanism, as the protocol automatically issues tokens and deposits them into the node corresponding to itself. By doing so, the protocol rewards all of its direct and indirect contributors in a programmatic way reminiscent of how Bitcoin or Ethereum block rewards rewarded one specific type of contributor (miners). By influencing the weights of the edges, the jury gets a way to continuously define what types of contributions it values. This mechanism could function as a decentralized and long-term-sustainable alternative to mining, sales or one-time airdrops.

Adding privacy

Often, making good judgements on questions like those in the examples above requires having access to private information: an organization's internal chat logs, information confidentially submitted by community members, etc. One benefit of "just using a single AI", especially for smaller-scale contexts, is that it's much more acceptable to give one AI access to the information than to make it public for everyone.

To make distilled human judgement or deep funding work in these contexts, we could try to use cryptographic techniques to securely give AIs access to private information. The idea is to use multi-party computation (MPC), fully homomorphic encryption (FHE), trusted execution environments (TEEs) or similar mechanisms to make the private information available, but only to mechanisms whose only output is a "full list submission" that gets directly put into the mechanism.

If you do this, then you would have to restrict the set of mechanisms to just being AI models (as opposed to humans or AI + human combinations, as you can't let humans see the data), and in particular models running in some specific substrate (eg. MPC, FHE, trusted hardware). A major research direction is figuring out near-term practical versions of this that are efficient enough to make sense.

Benefits of engine + steering wheel designs

Designs like this have a number of promising benefits. By far the most important one is that they allow for the construction of DAOs where human voters are in control of setting the direction, but they are not overwhelmed with an excessively large number of decisions to make. They hit the happy medium where each person doesn't have to make N decisions, but they have more power than just making one decision (how delegation typically works), and in a way that is more capable of eliciting rich preferences that are difficult to express directly.

Additionally, mechanisms like this seem to have an incentive smoothing property. What I mean here by "incentive smoothing" is a combination of two factors:

Diffusion: no single action that the voting mechanism takes has an overly large impact on the interests of any one single actor.
Confusion: the connection between voting decisions and how they affect actors' interests is more complex and difficult to compute.

The terms confusion and diffusion here are taken from cryptography, where they are key properties of what makes ciphers and hash functions secure.

A good example of incentive smoothing in the real world today is the rule of law: the top level of the government does not regularly take actions of the form "give Alice's company $200M", "fine Bob's company $100M", etc, rather it passes rules that are intended to apply evenly to large sets of actors, which then get interpreted by a separate class of actors. When this works, the benefit is that it greatly reduces the benefits of bribery and other forms of corruption. And when it's violated (as it often is in practice), those issues quickly become greatly magnified.

AI is clearly going to be a very large part of the future, and this will inevitably include being a large part of the future of governance. However, if you are involving AI in governance, this has obvious risks: AI has biases, it could be intentionally corrupted during the training process, and AI technology is evolving so quickly that "putting an AI in charge" may well realistically mean "putting whoever is responsible for upgrading the AI in charge". Distilled human judgement offers an alternative path forward, which lets us harness the power of AI in an open free-market way while keeping a human-run democracy in control.