Settings

Theme

LLMs know more than what they say

arjunbansal.substack.com

148 points by nqnielsen a year ago · 18 comments

Reader

sanxiyn a year ago

I believe the paper to be cited is "The Internal State of an LLM Knows When It's Lying", published last year: https://arxiv.org/abs/2304.13734

  • ruby314 a year ago

    Thanks for sharing the reference. It's definitely related! But the approach in that paper is different in a couple of very significant ways:

    1. They train a classifier on top of the activations of the base LLM. so you would have to have access to the activations of the LLM you are using for the task. We use the activations of an open evaluator LLM (e.g. Llama-3.1-8B-Instruct)

    2. Our features are projections - a few numbers that derive their predictive power from the linear directions we project onto. That allows us to fit our classifier with 10s of samples instead of 1000s to fit a feedforward neural network on top of the high dimensional activation space

    Would be interesting to compare the accuracy of the two approaches on the same benchmark, but our approach is broadly applicable even when you aren't working with an open LLM, and doesn't require a ton of training data

    • sanxiyn a year ago

      I didn't mean to suggest it as a competition to the method presented (LSR: latent space readout). It is old after all. LSR's use in evaluator LLM and working with small samples (because it works with linear direction) does seem novel and useful to me.

      An advantage of being aware of early papers is that it accumulates citations so you can often find good works in reverse citations. I had a brief look and the following seems interesting:

      GRATH: Gradual Self-Truthifying for Large Language Models https://arxiv.org/abs/2401.12292

      TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space https://arxiv.org/abs/2402.17811

      In contrast, Lynx may be a better model and HaluBench may be a better benchmark, but the paper is too new so it has zero reverse citations on Google Scholar at the moment. Interestingly, Lynx paper does cite Azaria 2023 although in a very cursory way.

      • ruby314 a year ago

        "An advantage of being aware of early papers is that it accumulates citations so you can often find good works in reverse citations." - absolutely.

        Thanks again for sharing interesting references. Cool that GRATH uses contrast pairs in an iterative process with DPO and TruthX is steering (using the term broadly) with a creating architecture to determine the inference time edits.

        One thing about Lynx and HaluBench - as we understand it, Halubench is the test set for Lynx's training data. They do have a couple of held out data sources besides the four they train with, but as far as we could tell from their paper they use the same hallucination-inducing function. Be curious to hear your thoughts on that.

      • ruby314 a year ago

        and yeah, we fully acknowledge that our list of references is not comprehensive :) thanks again for sharing!

    • eric_gu a year ago

      This is cool! I have not read up on evaluation techniques that use LLM-as-a-Judge, so I hadn't heard of the term "evaluator LLM" before.

      Questions that came to mind:

      - How are you deciding on the positive/negative concept pairs to generate your latent "evaluation direction?"

      - What layer of activations on the evaluator model do you use—the output layer?

      - What base model are you using for solving the HaluEval task?

      - I notice that LSR on Lynx 8B actually does worse than naive completion, and is in fact worse than LSR on Llama-3-8B-Instruct. Why do you think that is?

      • ruby314 a year ago

        Ty!

        - We generate contrast pairs (for this post, using gpt-4o) and do some post processing for quality (synthetic data). The impact of different types of contrast pairs is a continuing area of research for us.

        - We treat the evaluator model layers as hyper parameters, similar to the steering research (some of which we cite in our “non-comprehensive list of references”). We also see that the middle layers tend to be most effective.

        - For base model, we use both Llama-3-8b-Instruct and Llama-3.1-8b-Instruct to show LSR taking advantage of the improved base model (maybe I misunderstood the question?)

        - Re: Lynx being worse with LSR, it depends on data source. It's worse for HaluEval but you can see in the PubMedQA table it’s slightly better there. That’s consistent with the analysis in Contrastive Activation Addition https://arxiv.org/pdf/2312.06681 (section 6, sometimes the impacts of fine tuning and latent space steering are cumulative, sometimes the opposite). Would love to know if anyone has seen research as to why.

        • eric_gu a year ago

          Thanks for the reply!

          Re question #3: I'm not sure I understand why you need to vary the base model or how doing so would allow LSR to take advantage? Isn't your LSR technique used on the activations of the evaluator model?

          As a note of feedback, I found the original article a bit hard to understand even with multiple reads. I would have really benefited from a traditional "methodology" section like in an ML paper! The graphs upfront don't make sense to someone who isn't familiar with the problem setting, and even now I'm not sure if the x-axis in the HaluEval Benchmark bar chart refers to the base model or the evaluator model. Maybe it's just me.

          • ruby314 a year ago

            Re #3 - my bad, mixing terminology in my answer above. It’s the “base model” for the evaluator model (vs a fine tuned evaluator model). Just using the labeled Halubench dataset as the outputs to be evaluated, so no base model for the Halueval task.

            Thanks for the feedback, really helpful. We may edit for clarity.

autokad a year ago

if what I understand is correct, that they project the LLM's internal activations into meaningful linear directions derived from contrasting examples, I guess this is similar to how we began to derive a lot more value from ebeddings by using the embedding values for various things.

  • ruby314 a year ago

    yes that's correct! we project an evaluator LLM's internal activations onto meaningful linear directions, derived from contrasting examples. the strongest connection is to LLM interpretability (existence of meaningful linear directions) and steering research (computation from contrast pairs). This has been done with base model activations to understand base model behavior, but we show you can boost evaluation accuracy this way too, with a small number of human feedback

    • autokad a year ago

      this is pretty interesting work, I am curious to what emerges from this kind of thing in the future. I have been working on something 'similar' that involves the same problem, how do you evaluate a response without a source of truth, and this gets at that.

      I was looking at it from the stand point of the embeddings of the output with different temperatures

uiDevofNW a year ago

This is a stupid arguement. I wish author understood an ounce of how LLMs works. Of course, they know more than whay they say. That's because LLMs are nothing but probabistic structures. They mix and match and provide probabilistic approach. Therefore, they are always making a choice between multiple options.

I wish there was a global mandatory course before these substacky authors write for fame.

tarasglek a year ago

This looks cool, but I'm confused as to how this is surfaced in your product, llama-8 is not present in your model list.

I thought maybe you offer hallucination detection, but I also don't see that. RAG evals also not visible

nqnielsenOP a year ago

And how to use LLM interpretability research for applied evaluation

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection