Settings

Theme

Show HN: jj-benchmark – Evaluating AI agents on Jujutsu version control

tabbyml.github.io

5 points by wsxiaoys 12 days ago · 2 comments · 2 min read

Reader

Hi HN, Meng from TabbyML here.

We decided to build this simply because we find Jujutsu (jj) really interesting, and many folks on our team have started trying it out recently. Since it introduces a very different workflow compared to traditional Git, we thought it would be a fun challenge to see how well current AI coding agents can actually use it.

To build this, we created a semi-automated pipeline. We used AI to research the official Jujutsu documentation and websites, which then helped us bootstrap a dataset of 63 distinct evaluation tasks. Each task includes instructions, bootstrap scripts, and tests. We then ran the evaluations using the Harbor framework and our Pochi agent.

Some interesting insights from our initial leaderboard:

Claude 4.6 Sonnet is the clear winner: It achieved a 92% success rate (passing 58/63 tasks), beating out Opus and OpenAI's top models. It seems exceptionally good at parsing the novel CLI rules of jj. The Speed vs. Accuracy Trade-off: While GPT-5.4 sits at #5 with an 81% success rate, it is incredibly fast, averaging just 77.6s per task. In contrast, Gemini-3.1-pro achieved 84% but took over 3x as long (267.6s average). Open Weights / Regional Models are competitive: Models like Kimi-k2.5 (79%) put up a very respectable fight on a relatively niche tool. The benchmark isn't completely solved yet, but the fact that top models can successfully navigate a relatively new version control system by reasoning through the tasks is pretty exciting.

If there are specific jj edge cases you think we should add to the dataset, feel free to open up a PR!

Davidiusdadi 8 days ago

Thanks for creating this.

Would be interesting to know how much a jj-specific SKILL.md would race the score.

Maybe this does not fit the evaluation framework but I'd still be interested in your experience / setup with e.g. a terminal based coding agent such as claude code.

  • wsxiaoysOP a day ago

    > Would be interesting to know how much a jj-specific SKILL.md would race the score.

    That is definitely something we're interested in; we will try running this evaluation with skills soon.

    > This might not fit the evaluation framework, but I'd still be interested in your experience/setup with terminal-based coding agents like Claude Code.

    We have adopted Harbor as our evaluation framework, so evaluating Claude Code is straightforward: https://harborframework.com/docs/agents#installed-agents

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection