How Confident are AI Classifiers About Their Own Confidence?

Injury classsification using the 2024 NEISS

Python

Large Language Models

AI ‘Confidence’ Scores

LLMs have found a lot of practical uses as text classifiers across a ton of different areas. In my day job we commonly use them to sort and classify documents into different categories for routing or processing. I would say these have largely supplanted the previous generation of BERT-esque based neural network models for natural language processing. Indeed, LLMs are typically quite good at a variety of classification tasks with fairly minimal instructions or pre-training required.

How confident is “confident”?

However, getting classification “probabilities” from an LLM is a bit more challenging than from conventional machine learning or neural network approaches. LLMs don’t have an analogous fit.predict_proba(X) to retrieve classification probabilities. There are really two ways that I have seen this done for LLM classification models. Broadly speaking, they are:

Prompt the LLM to estimate its confidence in the classification and return the result in the output
Directly extract token-level probabilities from the model output

The first is definitely the most common one I’ve seen, while the latter is slightly more direct and more unusual (but directly supported by OpenAI!). My friend, Andy Wheeler, has a good example of doing the latter in his book(Wheeler 2026). With this in mind, what I wanted to do for this post was to look into what For this post, I wanted to explore a bit about how close to reality the AI-generated confidence scores actually are.

Calibrating probabilities

If we want to generate calibrated probabilities, the approach is slightly different in the multi-class scenario compared to the binary one. There are a few approaches, but the simplest seems to be a proposed “top versus all” approach (Le Coz, Herbin, and Adjed 2024). In short, this is really just a generalization of the binary case where we calibrate based on the correctness of the highest probability. In cases where we have many possible categories (here, over 20), it is difficult and probably not useful to directly calibrate each category. Rather, we use a model to calibrate the primary token probability. A simple way to do this is use isotonic regression, which applies a strictly increasing step function that maps back to the probabilities:

# get the correct labels and token probabilities
correct <- res$is_correct
pred_prob <- res$body_part_logprob_prob_1

# Do a top-versus-all calibration
# using isotonic step function
tva_iso <- isoreg(pred_prob, correct)
tva_predict <- as.stepfun(tva_iso)
calibrated_conf_iso <- tva_predict(pred_prob)

# check what a token probability of .85 now maps to
print(tva_predict(.85))

Now we see that the model re-maps probabilities to bins corresponding to observed accuracy. For example, an original token probability of .85 returns a calibrated probability of .61. The plot below shows how the original probabilities are remapped to a calibrated bins. In an actual production environment I would use a sample of cases to build the calibration model, validate this on a hold-out set, then apply it on future classifications.

References

Le Coz, Adrien, Stéphane Herbin, and Faouzi Adjed. 2024. “Confidence Calibration of Classifiers with Many Classes.” Advances in Neural Information Processing Systems 37: 77686–725.

Wheeler, Andrew. 2026. “Large Language Models for Mortals: A Practical Guide for Analysts with Python.” CrimRxiv.