LoCoMo AI Benchmark: 6.4% of answer key wrong, judge accepts 63% of fake answers

3 points by dial481 3 months ago · 3 comments

Reader

dial481OP 3 months ago

We audited the LoCoMo benchmark (one of the most cited eval for LLM agent memory) and found 99 score-corrupting errors in 1,540 questions (6.4%). Separately, we tested the LLM judge with adversarially generated wrong answers, it accepted 62.81% of vague-but-topical wrong answers. Some published system scores barely clear that bar. Full audit with methodology, all 99 errors documented, and reproducible scripts.

PaulHoule 3 months ago

I've worked in IR and this has been true about TREC data sets from the beginning and it has also been true about visual data sets. The first step to build a world beating commercial system has been to clean up the garbage in open evals to raise the possible accuracy ceiling.
- dial481OP 3 months ago
  
  That's encouraging to hear from someone with IR experience, thanks. Agree completely.

Settings

LoCoMo AI Benchmark: 6.4% of answer key wrong, judge accepts 63% of fake answers

Keyboard Shortcuts