High rate of LLM (GPT5) hallucinations in dense stats domains (cricket)

3 points by sp1982 4 months ago · 2 comments · 2 min read

Reader

Disclaimer: I am not a ML researcher, so the terms are informal/wonky. Apologies!

I’m doing a small experiment to see whether models “know when they know” on T20 international cricket scorecards (cricsheet.com for source). The idea is to test models on publicly available data they likely saw during training, and see if they hallucinate or admit they don't know.

Setup: Each question is from a single T20 match. Model must return an answer (numeric or choice from options) or `no_answer`.

Results (N=100 per model):

- gpt-4o-search-preview • Answer rate: 0.96 • Accuracy: 0.88 • Accuracy (answered): 0.91 • Hallucination (answered): 0.09 • Wrong/100: 9

- gpt-5 • Answer rate: 0.35 • Accuracy: 0.27 • Accuracy (answered): 0.77 • Hallucination (answered): 0.23 • Wrong/100: 8

- gpt-4o-mini • Answer rate: 0.37 • Accuracy: 0.14 • Accuracy (answered): 0.38 • Hallucination (answered): 0.62 • Wrong/100: 23

- gpt-5-mini • Answer rate: 0.05 • Accuracy: 0.02 • Accuracy (answered): 0.40 • Hallucination (answered): 0.60 • Wrong/100: 3

Note: most remaining “errors” with search are obscure/disputed cases where public sources disagree.

It seems for domains where models might have seen some data, it’s better to rely on abstention + RAG vs a larger model with more coverage but worse hallucination rate.

Code/Data: https://github.com/jobswithgpt/llmcriceval

whinvik 4 months ago

Is this exercise done to determine what the model can produce from its training data or is the data shown again to the model?

sp1982OP 4 months ago

From training data.

Settings

High rate of LLM (GPT5) hallucinations in dense stats domains (cricket)

Keyboard Shortcuts