Show HN: CivBench a long-horizon AI benchmark for multi-agent games

12 points by mbh159 4 months ago · 25 comments · 2 min read

Reader

Hey HN!

I built ClashAI to be an open agent scoreboard where frontier models play against each other in environments like Civilization and other strategy games. Every match is streamed live with the AI thinking fully observable.

The agent rankings will be continually updated and reflected as we add environments.

Brief notes on CivBench Season #001: - 200 turn limit

- Starting with 8 of the top 42 agents we’ve tested in a standardized harness

- 90s reasoning timeout (timed with thinking config per model card)

- live benchmark, still growing sample size

What’s been interesting so far:

Models that look similar on static benchmarks can diverge meaningfully in long-horizon matches. In early CivBench runs, we see distinct strategy tendencies (e.g., military-forward vs economy/tech-first openings), plus clear differences in execution profile (latency, token cost, actions per turn). In some matchups, lower-cost models move through turns faster while remaining competitive on outcome metrics.

Some measuring notes: - test runs are expensive for max configurations, running Claude Opus 4.6 cost us $1200 one match. We tuned accordingly - sometimes LLM providers are flaky/slow even though their models are fast.

If you’re looking to access the data as a research team or interested in hosting an environment please get in touch!

Thanks to the OG freeciv community

LINKS:

freeciv-llm: https://github.com/taso-ventures/freeciv-llm

Initial learnings: https://www.clashai.live/blog/ai/introducing-civbench-season...

pmoxyz 4 months ago

This is great. I think leaderboards based on static evals will be mostly irrelevant within a year. Continuous benchmarks like this are the only way to get signal on frontier models

You mention Opus 4.6 cost $1200 in one match, how do you plan to benchmark economic efficiency? Looking at a performance vs. cost trade-off you might say a model that plays 80% as well at 1% of the cost is more impressive than the 'top' model

mbh159OP 4 months ago

For a game that runs 4+ hours unfortunately it was configured to use too much reasoning/turn and larger context. Reducing the size helped lower the cost (still expensive).
In the leaderboards part of the page I'll be autopopulating the token cost of the model as a metric to evaluate on

zimbo63 4 months ago

This is an amazing product! Can AI agents learn to do long-term planning in environments that are less structured than chess? Great metaphor for life! Are you planning other games?

jhylee 4 months ago

Congrats on the launch. Big fan of how you add visualization and interactivity to the typical model benchmarking process. Any thoughts on how you plan to monetize down the line?

mbh159OP 4 months ago

appreciate it, I wanted to make the AI behavior easy to understand. Our main focus currently is to help AI researchers align their models and help develop an open framework for evaluating AI.

zimbo63 4 months ago

This is an amazing eval metric that no one thought about! such a creative idea. Have you thought of other games? how different it is from chess?

mbh159OP 4 months ago

yes we have a new game launching everyday this week. We're looking to add more domains to test how the jaggedness of AI differs between model providers and better evaluate how they perform across domains

nhal 4 months ago

Incredible and important product. Necessary for developers, users, and industries that want to use agents. Can’t wait to see how it’ll grow

mbh159OP 4 months ago

yes! If you are wanting to test your agents or develop evals on the platform my dms are open

jcion 4 months ago

Interesting! What are the next environments/strategy games you have planned?

What insights do you think they’ll provide that Civ doesn’t?

mbh159OP 4 months ago

Tomorrow we're launching coup, where agents compete by bluffing and keeping track of which of their opponents they think are lying
This is more of a faster paced/short lived game so we can collect larger samples of data on larger groups to get significant results in model behaviors of collaboration, truth telling, and ability to lie effectively.

amacx 4 months ago

Interesting. Did you give the agents any skills for playing civ? If not, are you planning to?

mbh159OP 4 months ago

I want to! I think skills can add big performance gains here especially with smaller models. There's a lot of domain knowledge in games so distilling it into a "skill" may allow much smaller models to outcompete the large ones

brownpoints 4 months ago

This looks incredible, it’d be cool to let others participate with custom prompts

amacx 4 months ago

Have you tried playing the agents yourself? Do they crush human competition?

mbh159OP 4 months ago

I was able to beat the AI every time, they're pretty bad at this point but I expect them to get much better overtime
- weisser 4 months ago
  
  would you describe yourself as particularly good or the models as particularly bad?

cameron17 4 months ago

This is undeniably intriguing. Will be paying close attention.

mbh159OP 4 months ago

cheers, the website will be updated with new environments daily!

andrewgazelka 4 months ago

hey first of all cool product. I am curious why you chose civ and if you saw any interesting emergent behaviors.

mbh159OP 4 months ago

Thank you! I grew up playing Civilization and one day I was talking with friends thinking it would be a perfect proxy for how good AI is at long-term planning. There were many frustrating sessions I had where my early decisions in the game had consequences only much later. With hidden information and other agents at play I thought it'd be an interesting test of agent capabilities.

Mojo19 4 months ago

So amazing, it's super cool!

killiandunne1 4 months ago

This is a sick idea I must say

mbh159OP 4 months ago

it was fun building it, sometimes the LLMs are pretty funny in how they play

Settings

Show HN: CivBench a long-horizon AI benchmark for multi-agent games

Keyboard Shortcuts