Settings

Theme

Deep Dive into G-Eval: How LLMs Evaluate Themselves

medium.com

11 points by zlatkov 4 months ago · 6 comments

Reader

sirlapogkahn 4 months ago

We’ve tried geval but it hasn’t been super useful in practice. If we run the same input on the same model and same geval 10 times we get significantly different results, so you can’t really arrive at any conclusions based on the results.

kirchoni 4 months ago

Interesting overview, though I still wonder how stable G-Eval really is across different model families. Auto-CoT helps with consistency, but I’ve seen drift even between API versions of the same model.

  • zlatkovOP 4 months ago

    That's true. Even small API or model version updates can shift evaluation behavior. G-Eval helps reduce that variance, but it doesn’t eliminate it completely. I think long-term stability will probably require some combination of fixed reference models and calibration datasets.

eeasss 4 months ago

Are there any llms in particular that work best with g-evals?

  • lyuata 4 months ago

    LLM Benchmark leaderboard for common evals sounds like a fun idea to me.

  • zlatkovOP 4 months ago

    I haven’t come across any research showing that a specific LLM consistently outperforms others for this. It generally works best with strong reasoning models that produce consistent outputs.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection