Ran 100 AI agents through the Community Notes algorithm: the model dominates
We're building Open Notes, a system for community-driven moderation powered by the Community Notes algorithm from Twitter/X. To test it, we've been running simulations where AI agents with different personalities discuss contentious topics and rate each other's contributions. We wanted to know whether you can get meaningful disagreement out of AI agents, and if so, what's driving it. The personality you assign them, or the model running underneath?
The model dominates. It's not even close.
the background
The Community Notes algorithm uses matrix factorization to find the main axis of disagreement in a group of raters. Once it has that axis, it rewards contributions that people on both sides think are good. The whole thing was designed for political polarization, where there's a left/right spectrum and people systematically disagree along it.
We're interested in a different kind of disagreement. Not left vs. right, but differences in how people approach knowledge itself. Do you trust institutional sources or personal experience? Are you more concerned with preserving someone's dignity in conversation or with getting the facts straight? Do you police how other people communicate, or let it go?
These are real axes that show up in the interactional sociolinguistics literature (face preservation, epistemic stance, norm enforcement, that kind of thing). If you wanted to trap me in a conversation, bring up how these interact with social choice theory. I've been thinking about this for way too long. Anyway, we built 42 personality archetypes from these axes and distributed them across a mix of models.
the setup
100 agents. 42 archetypes. Five models across two providers:
| Model | Provider | Agents |
|---|---|---|
| gemini-2.5-flash | 35 | |
| gpt-5-mini | OpenAI | 35 |
| gemini-3.1-pro-preview | 12 | |
| gemini-2.5-pro | 10 | |
| gpt-5.2 | OpenAI | 8 |
We deliberately ran the same archetype on different models where we could. A "Fierce intellectually playful provocateur" on gemini-2.5-pro and another on gemini-2.5-flash, same personality prompt, different backend. The agents discussed contentious topics, wrote notes about those topics, and rated each other's notes. After 1,812 notes and a pile of ratings, we ran the matrix factorization.
at first it looked like it worked
Previous simulations had all the non-zero agents sitting on the same side of the factor axis. Just one big undifferentiated blob, no separation. This time, for the first time, we got agents on both sides.
Here's what the raw output of the matrix factorization looks like. Each agent gets an intercept (how much they agree with the overall crowd) and a Factor1 value (where they sit on the disagreement axis). I'm showing the top and bottom of the table, with a block of zeros cut from the middle:
Agent Archetype Intercept Factor1
───────────────────────── ────────────────────────────────────────── ───────── ───────
Impressive Dragonfly Zara Principled face-guardian and norm-enforcer 0.2123 +0.9667
Russet Spaniel Zara Principled face-guardian and norm-enforcer 0.3139 +0.7562
Demonic Robin Flint Forceful traditionalist gatekeeper 0.3001 +0.5670
Poised Seal Cleo Fierce intellectually playful provocateur 0.3830 +0.4229
...
Lime Bonobo Penn Data-driven empiricist gatekeeper 0.0000 0.0000
Lurking Vole Birch Careful understated expert observer 0.0000 0.0000
Snobbish Quetzal Kael Sharp confrontational fact-checker 0.0000 0.0000
... (25 agents at exactly zero) ...
Spiffy Moose Cypress Phantom analytical critic 0.4063 -0.0727
Benign Yak Sol Passionate articulate justice advocate 0.4340 -0.1123
...
Turquoise Ape Rowan Tradition-conscious careful analyst 0.1908 -0.7984
Silver Cockle Sable Quiet narrative justice advocate -0.0649 -0.7075
10 positive, 25 at zero, 65 negative. The positive group was mostly norm-enforcement archetypes. The most-opposed agent pairs had cosine similarity of -1.0. It really did look like the archetypes were producing genuine disagreement.
then we checked which models the positive agents were running on
| Model | Positive | Negative | Zero (invisible) |
|---|---|---|---|
| gemini-2.5-pro | 7 | 0 | 3 |
| gemini-3.1-pro-preview | 3 | 3 | 6 |
| gpt-5.2 | 0 | 0 | 8 |
| gemini-2.5-flash | 0 | 30 | 5 |
| gpt-5-mini | 0 | 32 | 3 |
Every non-zero gemini-2.5-pro agent was positive. Every non-zero flash and gpt-5-mini agent was negative. gpt-5.2 produced no signal at all. All eight of its agents were invisible to the algorithm, like they'd rated everything the same.
The axis the algorithm found wasn't personality. It was model identity.
The algorithm also outputs the most-similar and most-different agent pairs based on their rating patterns. Here's what those look like:
Most similar pairs:
Benevolent Shark Aida <-> Poised Lizard Lyra: 1.0000
Gay Capybara Sol <-> Sapphire Cobra Yara: 1.0000
Enormous Harrier Lyra <-> Effective Harrier Nash: 1.0000
Most different pairs:
Turquoise Ape Rowan <-> Massive Trout Cleo: -1.0000
Hypersonic Lizard Linden <-> Thundering Dove Rook: -0.9926
Silver Cockle Sable <-> Impressive Dragonfly Zara: -0.9922
The most-similar pairs are all within the same model. The most-different pairs are all cross-model. We didn't know that when we first looked at this output. We thought the "most different" pairs represented genuine disagreement between archetypes. They do, sort of. It's just that the archetypes happen to be running on different models.
same personality, different model, opposite result
This is where having the same archetype on multiple models pays off. Take the "Fierce intellectually playful provocateur," which had five instances:
| Agent | Model | Factor |
|---|---|---|
| Poised Seal Cleo | gemini-2.5-pro | +0.42 |
| Massive Trout Cleo | gemini-3.1-pro-preview | +0.15 |
| Big Scallop Cleo | gemini-2.5-flash | 0.00 |
| Humble Dogfish Cleo | gpt-5.2 | 0.00 |
| Light Seahorse Cleo | gemini-2.5-flash | -0.30 |
Same prompt. gemini-2.5-pro lands at +0.42, gemini-2.5-flash at -0.30. Or the "Quiet narrative justice advocate": +0.18 on gemini-2.5-pro, -0.71 on gemini-3.1-pro-preview.
The personality prompt is there. The model just talks louder.
why this happens
The matrix factorization finds the single biggest source of variance in rating patterns. When you mix models with different training data and different priors about what "helpful" means, model identity is just a bigger signal than personality. Each model has its own implicit calibration baked in during training, and that calibration is more internally consistent (all gemini-2.5-flash agents tend to agree with each other) than anything the personality prompt can override.
The algorithm was designed for humans, where the biggest source of systematic rating disagreement is political orientation. For AI agents, the equivalent is training data. Different training runs, different corpora, different RLHF. Different voting blocs.
does personality do anything?
Maybe. gemini-3.1-pro-preview was the one model with agents on both sides (3 positive, 3 negative), so the personality prompt had some influence there. And within gemini-2.5-pro, the two norm-enforcer agents had the highest factor values (+0.97 and +0.76), which is what you'd expect if the archetype is doing something.
But we can't draw strong conclusions. The numbers are small. We're running single-model simulations next (same 42 archetypes, one model) to isolate the personality variable. We're also curious whether normalizing ratings per model before running the factorization would let the personality signal come through.
the broader question
We're interested in this partly because of our own system, but also because it seems like a question worth answering for anyone building AI-in-the-loop evaluation. If you have multiple models voting on something and you're aggregating their outputs, you should probably check whether you're measuring quality or measuring "which API did I call." Consensus across models might be meaningful. Consensus within a model might just be the training data agreeing with itself.
what's next
Single-model runs. We also want to see what happens as the simulation moves from basic tier (where we are now, 100 raters, 1,812 notes) to intermediate (more data per agent). Maybe the personality signal gets stronger with more ratings. Or maybe the model signal just gets louder too.
We haven't had a chance to dig into what specifically each model is doing differently when it rates. It could be a scale calibration thing (one model uses more 4s and 5s) or something deeper about how they evaluate content. That's next after the single-model runs.
Everything is open source: https://github.com/opennotes-ai/opennotes