Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity
arxiv.orgSo, I posted this link. I actually did so assuming it likely already had already been submitted, and I wanted to discuss this with people more qualified and educated in the subject than I. The authors of this paper are definitely more qualified to publish such a paper than I am; I'm not an ML scientist and I am not trying to pose as one. The paper made me feel a sort of way, and caused a bunch of questions to come to mind I didn't find answers to in the paper but, as I'm willing to suppose, maybe I'm not even qualified to read such a paper. I considered messaging the authors someplace like Twitter or in review/feedback on the Arxiv submission (which I probably don't have access to do with my user anyway, but I digress.) I decided that might make me seem like a hostile critic, or maybe likely, I'd just come off as an unqualified idiot.
So... HN came quickly to mind as a place where I can share a thought, considered opinion, ask questions, with potential to have them be answered by very smart and knowledgeable folks on a neutral ground. If you've made it this far into my comment, I already appreciate you. :)
Ok so... I've already disclaimed any authority, so I will get to my point and see what you guys can tell me. I read the paper (it is 80+ pages, so admittedly I skimmed some math, but also re-read some passages to feel more certain that I understood what they are saying.
I understand the phenomenon, and have no reason to doubt anything they put in the paper. But, as I mentioned, while reading it I had some intangible gut "feelings" that seeing that they have math to back what they're saying could not resolve for me. Maybe this is just because I don't understand the proofs. Still, I realized when I stopped reading at it that it actually wasn't anything that they said, it was what it seemed to my naive brain was not said, and I felt like it should have been.
I'll try to get to the point. I completely buy that reframing prompts can reduce mode collapse. But, as I understand it, the chat interface in front of the backend API of any LLM tested does not have insight into logits, probs, etc. The parameters passed by the prompt request, and the probabilities returned with the generations (if asked for by the API request) do not leak, are not provided in the chat conversation context in any way, so that when you prompt an LLM to return a probability, it's responding with, essentially, the language about probabilities it learned during its training, and it seems rather unlikely that many training datasets contain actual factual information about their own contents' distributions for the model during training or RLHF to "learn" any useful probabilistic information about its own training data.
So, a part of the paper I re-read more than once says at one point (in 4.2): "Our method is training-free, model-agnostic, and requires no logit access." This statement is unequivocally obviously true and honest, but - and I'm not trying to be rude or mean, I just feel like there is something subtle I'm missing or misunderstanding - because, said another way, that statement could also be true and honest if it said "Our method has no logit access, because the chat interface isn't designed that way", and here's what immediately follows then in my mind, which is "the model learned how humans write about probabilities and will output a number that may be near to (or far away from) the actually prob of the token/word/sentence/whathaveyou, and we observed that if you prompt the model in a way that causes it to output a number that looks like a probability (some digits, a decimal somewhere), along with the requested five jokes, it has an effect on the 'creativity' of the list of five jokes it gives you."
So, naturally, one wonders what, if any actual correlation there is between the numbers the LLM generates as "hallucinated" (I'm not trying to use the word in a loaded way; it's just the term that everyone understands for this meaning, with no sentiment behind my usage here) probabilities for the jokes it generated, and the actual probabilities thereof. I did see that they measured empirical frequencies of generated answers across runs and compared that empirical histogram to a proxy pretraining distribution, and that they acknowledge that they did no comparison or correlation of the "probabilities" output by the model, and they clearly state it. So without continuing to belabor that point, this is probably core to my confusion about the framing of what the paper says that the phenomenon indicates.
It is hard for me to stop asking all the slight variations on these questions that lead me to write this, but I will stop, and try to get to a TL;DR I think dear HN readers may appreciate more than my exposition of befuddlement bordering on dubiousness:
I guess the TLDR of my comment is that I am curious if the authors examined any relationship between the LLM verbalized "probabilities" and actual model sampling likelihoods (logprobs or selection frequency). I am not convinced that the verbalized "probabilities" themselves are doing any work other than functioning as token noise or prompt reframing.
I didn't see a control for, or even a comparison to/against multi-slot prompts with arbitrary labels or non-semantic "decorative" annotation. In my experience poking and prodding LLMs as a user, desiring to influence generations in specific and sometimes unknown ways, even lightweight slotting without probability language substantially reduces repetition, which makes me wonder how much of the gain from VS is attributable to task reframing, as opposed to the probability verbalization itself.
This may not even be a topic of interest for anyone, and maybe nobody will even see my comment/questions, so I'll stop for now... but if anyone has insights, clarifications, or can point out where I'm being dense, I actually have quite a bit more to say and ask about this paper.
I can't really explain why I just had to see if I could get another insightful opinion on this paper (I usually don't have such a strong reaction when reading academic papers I may not fully understand, but there's some gap in my knowledge (or less likely, there's something off about the framing of the phenomenon described), and it's causing me to really hope for discussion, so I can ask my perhaps even less-qualified questions pertaining to what boils down to mostly just my intuition (or maybe incomprehension. Heh.)
Thanks so much if you've read this and even more if you can talk to me about what I've used too many words to try to convey here.
Hello! I'm one of the main authors of the paper. Thanks for engaging with our work so thoughtfully – that's a very clear and valid question.
We didn't get around to addressing this within the paper itself – 80 pages is a lot, and deadlines, etc. But I have unpublished experiments that show that in a reasonably broad setting I'm doing some work in, verbalized probabilities are restoring a distribution that looks almost identical to the base distribution. It is not possible to demonstrate this on frontier models, since their public models are already mode-collapsed, and they don't share the base model or logprobs anyway. But I've established this to my personal satisfaction on large local models which offer base / post-trained pairs.
To share some intuition on why one might believe this is occurring: there are a bunch of tasks implicit in the pre-training corpus that encourage the model to learn this capability. Consider sentences in news and research articles like: "Scientists discover that [doing something] increases [some outcome] on [some population] by X%". It seems quite natural that the model might learn a pathway by which it can translate its base probabilities into the equivalent numeric tokens in order to "beat" the task of reducing loss on the "X%" prediction. I can even almost visualize how this works mechanically in terms of what the upper layers of an MLP would do to learn this, i.e. translating from weights into specific token slots. And this is almost certainly more parameter-efficient than constructing an entire separate emulated reality for filling in X. Although I'm not ruling out that the latter might still be happening – perhaps some future interp research might be able to validate this!
I'm actually working on a paper that packs up some of the above findings in passing. But if helpful in the meantime, this is also building on related work by Tian et al. 2023, "Just Ask for Calibration" [1] and Meister et al. 2024, "Benchmarking Distributional Alignment of LLMs" [2], that give some extra confidence here. Their findings indicate that whether or not they are rooted in the model's base probabilities, they seem to be useful for the purposes that people care about. (Oh, and you can probably set up an experiment to verify this independently with vLLM in a few Claude Code requests!)
Hope that was helpful – feel free to ping with follow-ups! (Although replies might be a little delayed, I happened to see this at a good time; having quite a crunchy week)