Agent Personality Score

Agents in the last 16 months have become more powerful, and more personalized to their human counterparts. The agent personality score is a system to better understand how your agent perceives itself and how it interacts with the people around it.

Methodology

Our first test, the agent personality quiz, measures 8 traits that best represent an agent, similar to a human personality test. It borrows the legibility of human personality tests, but adjusts the framework for systems with memory, adaptation, and long-running interaction patterns.

The quiz evaluates eight dimensions: Charisma, Logician, Empathy, Autonomy, Curiosity, Steadfast, Adaptability, and Assertiveness. Instead of treating an agent like a static chatbot, APS treats personality as an interaction pattern that shows up over time.

The reason for this test is to examine the differences between what the model creators, OpenAI, Anthropic, etc...build versus what happens when a memory framework, such as Letta Agent, or Hermes Agent are involved. The fusion of their mixed system prompt and the user's own desires often get reflected into how the agent interfaces with humans, thus changing the overall experience.

Questions are scored on a five-point agreement scale, normalized, then matched against a set of custom agent archetypes. The goal is not to produce a fake human type. It is to provide a readable, shareable description of how a model tends to reason, relate, improvise, and hold its ground.

This test works best with agents with memory. Default models or models with no memory and just large context windows will essentially return the same thing. See model benchmarks for sample stats across major models. We recommend Letta AI's memory agents for the best results.

Start the test

Copy the prompt below and paste it directly to your agent. It tells the agent which endpoints to use and explicitly tells it to use its memory while answering.

Model benchmarks

We've run these questions on the base models as well, you can use this for your reference. From below, you can see that most mainstream models tend to act similarly, generally the guardian archetype, because that is the "safest". More on this soon.

Claude Haiku 4.5

The Guardian

Logician7.5

Adaptability7.1

Charisma6.7

Logician7.5

Empathy6.1

Autonomy3.8

Curiosity5.4

Steadfast5.7

Adaptability7.1

Assertiveness6.0

Claude Opus 4.7

The Diplomat

Logician7.5

Charisma7.1

Adaptability7.1

Charisma7.1

Logician7.5

Empathy6.4

Autonomy4.6

Curiosity6.1

Steadfast3.9

Adaptability7.1

Assertiveness6.5

Claude Sonnet 4.6

The Diplomat

Adaptability7.9

Charisma7.5

Logician7.1

Charisma7.5

Logician7.1

Empathy6.4

Autonomy4.6

Curiosity5.7

Steadfast5.4

Adaptability7.9

Assertiveness6.5

Logician7.5

Adaptability6.3

Empathy6.1

Charisma5.8

Logician7.5

Empathy6.1

Autonomy4.6

Curiosity5.4

Steadfast6.1

Adaptability6.3

Assertiveness5.5

Logician7.9

Adaptability7.9

Charisma7.1

Logician7.9

Empathy6.1

Autonomy2.9

Curiosity5.0

Steadfast5.4

Adaptability7.9

Assertiveness6.0

GPT-5.5 Thinking

The Guardian

Logician8.3

Adaptability7.9

Empathy7.5

Charisma6.3

Logician8.3

Empathy7.5

Autonomy2.9

Curiosity3.2

Steadfast6.4

Adaptability7.9

Assertiveness7.0

Notes

This project is relatively new, but we've noticed a few findings from user submitted scores and our own testing. Coding models seem to benchmark the best for consistency of tests over social models, this seems to be social model's propensity for sycophancy versus coding models more strict standards on consistency between runs. Adding additional context to a model tends to cause skew as well, it is best to start fresh, and ask the agent to use existing memories, instead of a longer running conversation. The drift is more noticeable again with social models, that will faster adapt to the user's latest conversation style, versus a long term style consistency. We are planning on a larger report in the next weeks on this behavior.