Settings

Theme

LoCoMo AI Benchmark: 6.4% of answer key wrong, judge accepts 63% of fake answers

github.com

3 points by dial481 a month ago · 3 comments

Reader

dial481OP a month ago

We audited the LoCoMo benchmark (one of the most cited eval for LLM agent memory) and found 99 score-corrupting errors in 1,540 questions (6.4%). Separately, we tested the LLM judge with adversarially generated wrong answers, it accepted 62.81% of vague-but-topical wrong answers. Some published system scores barely clear that bar. Full audit with methodology, all 99 errors documented, and reproducible scripts.

  • PaulHoule a month ago

    I've worked in IR and this has been true about TREC data sets from the beginning and it has also been true about visual data sets. The first step to build a world beating commercial system has been to clean up the garbage in open evals to raise the possible accuracy ceiling.

    • dial481OP a month ago

      That's encouraging to hear from someone with IR experience, thanks. Agree completely.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection