LLMs know more than what they say
arjunbansal.substack.comI believe the paper to be cited is "The Internal State of an LLM Knows When It's Lying", published last year: https://arxiv.org/abs/2304.13734
Thanks for sharing the reference. It's definitely related! But the approach in that paper is different in a couple of very significant ways:
1. They train a classifier on top of the activations of the base LLM. so you would have to have access to the activations of the LLM you are using for the task. We use the activations of an open evaluator LLM (e.g. Llama-3.1-8B-Instruct)
2. Our features are projections - a few numbers that derive their predictive power from the linear directions we project onto. That allows us to fit our classifier with 10s of samples instead of 1000s to fit a feedforward neural network on top of the high dimensional activation space
Would be interesting to compare the accuracy of the two approaches on the same benchmark, but our approach is broadly applicable even when you aren't working with an open LLM, and doesn't require a ton of training data
I didn't mean to suggest it as a competition to the method presented (LSR: latent space readout). It is old after all. LSR's use in evaluator LLM and working with small samples (because it works with linear direction) does seem novel and useful to me.
An advantage of being aware of early papers is that it accumulates citations so you can often find good works in reverse citations. I had a brief look and the following seems interesting:
GRATH: Gradual Self-Truthifying for Large Language Models https://arxiv.org/abs/2401.12292
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space https://arxiv.org/abs/2402.17811
In contrast, Lynx may be a better model and HaluBench may be a better benchmark, but the paper is too new so it has zero reverse citations on Google Scholar at the moment. Interestingly, Lynx paper does cite Azaria 2023 although in a very cursory way.
"An advantage of being aware of early papers is that it accumulates citations so you can often find good works in reverse citations." - absolutely.
Thanks again for sharing interesting references. Cool that GRATH uses contrast pairs in an iterative process with DPO and TruthX is steering (using the term broadly) with a creating architecture to determine the inference time edits.
One thing about Lynx and HaluBench - as we understand it, Halubench is the test set for Lynx's training data. They do have a couple of held out data sources besides the four they train with, but as far as we could tell from their paper they use the same hallucination-inducing function. Be curious to hear your thoughts on that.
and yeah, we fully acknowledge that our list of references is not comprehensive :) thanks again for sharing!
This is cool! I have not read up on evaluation techniques that use LLM-as-a-Judge, so I hadn't heard of the term "evaluator LLM" before.
Questions that came to mind:
- How are you deciding on the positive/negative concept pairs to generate your latent "evaluation direction?"
- What layer of activations on the evaluator model do you use—the output layer?
- What base model are you using for solving the HaluEval task?
- I notice that LSR on Lynx 8B actually does worse than naive completion, and is in fact worse than LSR on Llama-3-8B-Instruct. Why do you think that is?
Ty!
- We generate contrast pairs (for this post, using gpt-4o) and do some post processing for quality (synthetic data). The impact of different types of contrast pairs is a continuing area of research for us.
- We treat the evaluator model layers as hyper parameters, similar to the steering research (some of which we cite in our “non-comprehensive list of references”). We also see that the middle layers tend to be most effective.
- For base model, we use both Llama-3-8b-Instruct and Llama-3.1-8b-Instruct to show LSR taking advantage of the improved base model (maybe I misunderstood the question?)
- Re: Lynx being worse with LSR, it depends on data source. It's worse for HaluEval but you can see in the PubMedQA table it’s slightly better there. That’s consistent with the analysis in Contrastive Activation Addition https://arxiv.org/pdf/2312.06681 (section 6, sometimes the impacts of fine tuning and latent space steering are cumulative, sometimes the opposite). Would love to know if anyone has seen research as to why.
Thanks for the reply!
Re question #3: I'm not sure I understand why you need to vary the base model or how doing so would allow LSR to take advantage? Isn't your LSR technique used on the activations of the evaluator model?
As a note of feedback, I found the original article a bit hard to understand even with multiple reads. I would have really benefited from a traditional "methodology" section like in an ML paper! The graphs upfront don't make sense to someone who isn't familiar with the problem setting, and even now I'm not sure if the x-axis in the HaluEval Benchmark bar chart refers to the base model or the evaluator model. Maybe it's just me.
Re #3 - my bad, mixing terminology in my answer above. It’s the “base model” for the evaluator model (vs a fine tuned evaluator model). Just using the labeled Halubench dataset as the outputs to be evaluated, so no base model for the Halueval task.
Thanks for the feedback, really helpful. We may edit for clarity.
Ah understood. Makes sense now!
if what I understand is correct, that they project the LLM's internal activations into meaningful linear directions derived from contrasting examples, I guess this is similar to how we began to derive a lot more value from ebeddings by using the embedding values for various things.
yes that's correct! we project an evaluator LLM's internal activations onto meaningful linear directions, derived from contrasting examples. the strongest connection is to LLM interpretability (existence of meaningful linear directions) and steering research (computation from contrast pairs). This has been done with base model activations to understand base model behavior, but we show you can boost evaluation accuracy this way too, with a small number of human feedback
this is pretty interesting work, I am curious to what emerges from this kind of thing in the future. I have been working on something 'similar' that involves the same problem, how do you evaluate a response without a source of truth, and this gets at that.
I was looking at it from the stand point of the embeddings of the output with different temperatures
This is a stupid arguement. I wish author understood an ounce of how LLMs works. Of course, they know more than whay they say. That's because LLMs are nothing but probabistic structures. They mix and match and provide probabilistic approach. Therefore, they are always making a choice between multiple options.
I wish there was a global mandatory course before these substacky authors write for fame.
This looks cool, but I'm confused as to how this is surfaced in your product, llama-8 is not present in your model list.
I thought maybe you offer hallucination detection, but I also don't see that. RAG evals also not visible
Thanks for the q.
The LSR version of AutoFeedback is on our Pro tier. You can sign up for more info on that here: https://tally.so/r/w2XVQA
In the free trial of Log10.io you can use the ICL (self-improving) version of AutoFeedback for hallucination detection and RAG evals. In generalize, you can customize evals for a wide range of use cases using AutoFeedback: https://docs.log10.io/feedback/auto_feedback https://github.com/log10-io/log10-cookbook/tree/main https://arjunbansal.substack.com/p/hybrid-evaluation-scaling...
As for models, we support evaluating a wide range of models including Llama - in addition we support OpenAI, Anthropic, Gemini, Mistral, MosaicML, Together, and self-hosted models. We are also compatible with frameworks such as Langchain, Magentic and LiteLLM. For how to integrate different models please see here: https://github.com/log10-io/log10
Also wanted to address the confusion on the role of llama-3-8b and llama-3.1-8b. In the blog post we use these models as an example of an evaluator LLM. We select what is best for your custom eval under the hood. LSR is just one example of the research powering our custom evals
And how to use LLM interpretability research for applied evaluation