Researchers describe how to tell if ChatGPT is confabulating

But the LLM could also run into a similar situation when it has multiple options for phrasing the right answer. To use an example from the researchers’ paper, “Paris,” “It’s in Paris,” and “France’s capital, Paris” are all valid answers to “Where’s the Eiffel Tower?” So, statistical uncertainty, termed entropy in this context, can arise either when the LLM isn’t certain about how to phrase the right answer or when it can’t identify the right answer.

This means it’s not a great idea to simply force the LLM to return “I don’t know” when confronted with several roughly equivalent answers. We’d probably block a lot of correct answers by doing so.

So instead, the researchers focus on what they call semantic entropy. This evaluates all the statistically likely answers evaluated by the LLM and determines how many of them are semantically equivalent. If a large number all have the same meaning, then the LLM is likely uncertain about phrasing but has the right answer. If not, then it is presumably in a situation where it would be prone to confabulation and should be prevented from doing so.

Extracting meaning

How does this work in practice? The description is remarkably straightforward:

Our method works by sampling several possible answers to each question and clustering them algorithmically into answers that have similar meanings, which we determine on the basis of whether answers in the same cluster entail each other bidirectionally. That is, if sentence A entails that sentence B is true and vice versa, then we consider them to be in the same semantic cluster.

If a single cluster predominates, then the AI is selecting an answer from within one collection of options that has a similar factual content. If there are multiple clusters, then the AI is selecting among different collections that all have different factual content—a situation that’s likely to result in confabulation.

Beyond its conceptual simplicity, implementing a system based on the ideas is also straightforward. Most major LLMs will produce a set of statistically likely answers to queries, which are needed to evaluate semantic entropy. There are already LLMs and software called natural language inference tools that are set up to determine whether two sentences imply each other. And, since those tools exist, there’s no supervised training needed, meaning that the system doesn’t have to be fed examples of confabulations to learn to determine the semantic entropy of a set of potential answers.