Deep Dive into G-Eval: How LLMs Evaluate Themselves

19 min read Original article ↗

Alexander Zlatkov

When you ship a typical web service, you instrument it with metrics and traces. Application Performance Monitoring (APM) tells you if requests are fast, if error rates spike, and if your system is behaving as expected in production.

You also write unit tests and integration tests to catch regressions before they go live. Those two pillars, pre‑release testing and live observability, are what give you confidence in traditional software.

Now we’re implementing LLMs into everything from chatbots to workflow automation.

They’re not deterministic functions.
They don’t throw exceptions when they hallucinate.
They can produce fluent nonsense that looks plausible until you look closely.

What we need is a way to observe and test the qualitative behavior of these models and treat their outputs as first‑class citizens in our quality pipeline.

That’s where LLM evaluations (evals) come in. Evals act like unit tests and health checks for model outputs: they tell you whether the answer you got is accurate, relevant, helpful, or safe.

In the rest of this article, we’ll explore how evals work, why LLM‑as‑a‑Judge has emerged as a powerful technique for running them at scale, and how frameworks like G‑Eval implement it in practice.

What are Evals

When you ship an LLM‑powered feature, you need a way to measure how well the model and the corresponding prompts are doing on the tasks you care about.

That’s what evals are: structured tests for LLMs. Instead of checking whether a function returns the correct value, you ask questions like:

  • “Did the summary capture the key facts of the article?”
  • “Was the generated SQL valid and efficient?”
  • “Is the chatbot’s answer helpful and non‑toxic?”

Evals can take many forms:

  • Reference‑Based Metrics compare the model’s output against a ground truth. Classic examples include BLEU and ROUGE for translation and summarization.
  • Unit‑Style Tests run specific prompts and assert that certain patterns or keywords appear in the response.
  • Human‑in‑the‑Loop Reviews involve annotators rating outputs for quality, factuality, or safety.
  • LLM‑as‑a‑Judge (discussed next) automatеs the scoring against custom evaluation criteria you define

LLM‑as‑a‑Judge: Using Models to Evaluate Models

Once you’ve decided what you want to measure, the next question is how to measure it in a scalable and reliable manner. That’s where LLM‑as‑a‑Judge comes in.

Instead of relying on humans to rate every output or on brittle metrics like BLEU and ROUGE, you can ask an LLM to be the evaluator (judge).

Research has shown that when used correctly, state-of-the-art LLMs such as the already old GPT-4 have the ability to align with human judgment to up to 85% of the scoring. It also makes it possible to run thousands of tests whenever you tweak a prompt, model or chain.

These are the types of LLM-as-a-Judge for different use cases:

  1. Single‑Output: takes in one LLM response (or one full conversation with an LLM) at a time. It assigns a score based on specific evaluation criteria such as correctness, coherence, or helpfulness.
    This method can be reference-based, where the judge compares the answer to an expected “golden dataset,” or referenceless, where it judges quality without a reference.
  2. Pairwise: Instead of assigning a score, the judge compares two model responses to the same prompt and decides which one is better according to chosen criteria. Instead of assigning a numeric score, the judge outputs a winner (and sometimes a tie). It’s essentially an A/B test powered by an LLM, ideal for comparing models, prompts, or fine-tuning strategies to see which produces higher-quality results.

Press enter or click to view image in full size

To implement either type of judge in your own stack:

  • Pick a Model: The judge can be the same model you’re evaluating or a more capable one. Many teams use GPT‑5, Claude, or specialized scorer models.
  • Define a Clear Rubric: Write a short set of criteria explaining what “good” means. This might be factuality, coherence, safety, conciseness, or domain‑specific guidelines.
  • Provide the Inputs: Pass in the original prompt, the candidate outputs, and any reference text or metadata the judge needs to make an informed decision.

The judge then produces a metric score, based on the evaluation criteria you define. For example, you could use a prompt like this to have an LLM judge assess the coherence of a summary:

prompt = """ 
You will receive a summary (LLM output) of a news article.
Your task is to rate the summary on how coherent it is to the original text (input).

Original Text:
{input}

Summary:
{llm_output}

Score:
"""

By collecting these metric scores, you can build a comprehensive set of LLM evaluation results to:

  • Automated Regression Testing: Run prompts through your system and see if quality metrics drift. Keep in mind that the consistency of LLM scoring, that is, how reliably the model assigns the same score to the same answer across runs, can be a challenge. We discuss ways to address this later in the article.
  • Model Comparison: Quickly benchmark a new model against your production model across a battery of tasks.
  • Prompt and Chain Optimization: Evaluate different system prompts, tool call orders, or memory strategies to find the best configuration.

Used thoughtfully, this technique becomes a powerful complement to traditional metrics and human reviews. In the next section, we’ll explore the limitations and pitfalls of LLM‑as‑a‑Judge, when it works, when it doesn’t, and how to mitigate its biases.

What is G-Eval?

G-Eval is a framework that applies the LLM-as-a-Judge paradigm using a structured chain-of-thought (CoT) process to evaluate LLM outputs against any user-defined criteria. Instead of relying on fixed benchmarks or static reference datasets, G-Eval converts natural-language evaluation instructions into step-by-step reasoning tasks that an LLM can follow to produce consistent, interpretable judgments.

Originally introduced in the paper NLG Evaluation using GPT-4 with Better Human Alignment, G-Eval was designed to overcome the rigidity of traditional metrics like BLEU and ROUGE, which struggle with open-ended or subjective tasks that depend on nuance, style, or semantic alignment.

G-Eval is composed of three main components: the prompt, automatic CoT (Chain of Thought) reasoning, and the scoring function.

1. The Prompt

The initial prompt is the only user-defined input to G-Eval in most out-of-the-box implementations of the metric. This prompt is a natural language instruction that defines:

  • The Evaluation Task
  • The Evaluation Criteria
TASK_INTRODUCTION = """

You will be given one summary written for a news article. Your task is to rate the summary on one metric.
Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing and refer to it as needed.

"""

The initial prompt should also include evaluation criteria for the model to use. We may also define the scoring system we wish the model to use. For example:

COHERENCE_EVALUATION_CRITERIA = """  

Coherence (1-5) - the collective quality of all sentences.
We align this dimension with the DUC quality question of structure and coherence whereby the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic.

"""

As the only user-defined input to G-Eval, you’ll spend most of your time optimizing these two variables.

2. Auto Chain-of-Thought (CoT) Reasoning

Traditionally, Chain-of-Thought (CoT) prompting involves breaking down reasoning steps one by one to help an LLM think more clearly and make better decisions, which improves its outputs. Doing this manually can be slow and labor-intensive.

Modern LLMs can now handle this automatically. When prompted, they can generate their own reasoning steps, a process known as Auto CoT. This not only gives G-Eval a stronger, more structured reasoning flow (and better results), but also makes it far more scalable than manual methods.

This automatic reasoning step is especially valuable for complex evaluations that require multiple layers of logic or interdependent checks.

To create a prompt for the CoT, we need to take the TASK_INTRODUCTION and COHERECE_EVALUATION_CRITERIA prompts and append “Evaluation Steps:” at the end:

COT__PROMPT = f"""{TASK_INTRODUCTION.strip()} 

Evaluation Criteria: {COHERENCE_EVALUATION_CRITERIA.strip()}

Evaluation Steps:

"""

Which will result in the following prompt:

“You will be given one summary written for a news article. Your task is to rate the summary on one metric.   

Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing and refer to it as needed.

Evaluation Criteria: Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic.

Evaluation Steps:”

Now we can call the LLM with this prompt, which results in our actual EVALUATION_STEPS:

“”” 

Read the summary and the source document carefully.

Compare the summary to the source document and identify the main points of the article.

Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.

Assign a relevance score from 1 to 5.

“””

It’s important to note again that these EVALUATION_STEPS are a result of our CoT prompt and are automatically generated by the LLM.

3. Scoring Function

The user-defined task introduction and evaluation criteria, along with the auto-generated evaluation steps, are now concatenated with the original context and target text, and the result is passed to the scoring function. The scoring function calls the evaluating LLM, which is prompted to output a score using a form-filling paradigm. The resulting input template might look like the following example from the original G-Eval paper:

SCORING_FUNCTION = f"""{TASK_INTRODUCTION.strip()} 

Evaluation Criteria:
{COHERENCE_EVALUATION_CRITERIA.strip()}

Evaluation Steps:
{EVALUATION_STEPS.strip()}

Example:
Source Text:
{document}
Summary:
{summary}

Evaluation Form (scores ONLY):
- Coherence:
"""

Where document and summary are provided for the specific task.

So finally, the prompt will look something like this:

“You will be given one summary written for a news article. Your task is to rate the summary on one metric.   

Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing and refer to it as needed.

Evaluation Criteria: Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic.

Evaluation Steps:

1. Read the summary and the source document carefully.

2. Compare the summary to the source document and identify the main points of the article.

3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.

4. Assign a relevance score from 1 to 5.

Example:

Source Text:

Some random source text to be summarized.

Summary:

Summary of random text.

Evaluation Form (scores ONLY):

- Coherence:“

Practical Challenges

One practical challenge with the scoring returned by the LLM is that the system tends to produce outputs clustered around a dominant integer value.

As shown in this paper, models sparsely predict scores within the 1–100 range on the SummEval dataset. Frequencies of some scores, such as 90 and 95, are far higher than scores such as 92 or 19, and much of the range is almost entirely ignored (1–60).

Press enter or click to view image in full size

Large Language Models are Inconsistent and Biased Evaluators

This compression of responses leads to low score variance, which in turn weakens the correlation between automated evaluations and human judgments.

A second issue concerns score granularity. Despite explicit instructions to output decimal or fine-grained scores, the LLM typically outputs only discrete integer values. As a result, many model outputs receive identical scores, masking subtle quality differences between texts that would be perceptible to human evaluators.

The graph above also shows that the score distribution has several peaks for round numbers such as 70, 80, 90 (Similarly for 75, 85, and 95), indicating that LLM evaluators also have round number bias like humans.

To mitigate these problems, G-Eval implements a refinement that leverages the token probability distribution produced by the LLM.

What is the Token Probability Distribution

When an LLM generates text, it doesn’t simply “choose” words; it predicts a probability distribution over all possible next tokens in its vocabulary.

For example, if you prompt it with:

“Rate the summary for coherence (1–5):”

The model internally computes something like:

+-------+-------------+
| Token | Probability |
+-------+-------------+
| "1" | 0.05 |
| "2" | 0.10 |
| "3" | 0.45 |
| "4" | 0.35 |
| "5" | 0.05 |
+-------+-------------+

These probabilities represent the model’s confidence that each token (score) is the correct output, based on the prompt and context.

Normally, only the most likely token (here, 3) would be selected as the output. But that throws away valuable information. The fact that 4 was nearly as likely (35%) still tells us something about the model’s uncertainty.

How G-Eval utilizes these distributions

The authors of the G-Eval paper realized that using just the single chosen integer (like 3) caused low variance and poor correlation with human judgments. So, they proposed a probability-weighted scoring scheme using the whole distribution instead of the single highest probability value.

Mathematically, they define the final continuous score as:

where:

  • si = score value (e.g., 1–5)
  • p(si) = model’s predicted probability for each score token

Using the earlier example:

score = (0.05 * 1) + (0.1 * 2) + (0.45 * 3) + (0.35 * 4) + (0.05 * 5)
which results to 3.25

That gives a continuous, fine-grained score instead of just 3.

These probabilities can be obtained from the APIs of the LLM you use, or if not supported, there are other techniques to calculate them.

Accessing The Probabilities

Accessing token probabilities can differ between older and newer models, and we’ll cover both approaches.

Earlier generations of LLMs, such as GPT-3.5, provided direct access to token-level log-probabilities through their APIs. These log-probabilities, returned via the parameters logprobs=true and top_logprobs=k, quantified the model’s confidence in each possible next token. In practice, this information allowed evaluators to reconstruct the probability distribution over discrete scores or response options, thereby producing continuous, confidence-weighted evaluations rather than categorical outputs.

For example, when assessing coherence on a 1–5 scale, the evaluator could call an API such as:

{ 
"model": "text-davinci-003",
"prompt": "Rate the summary for coherence (1–5):",
"max_tokens": 1,
"logprobs": true,
"top_logprobs": 5
}

The response would return:

{ 

"id": "cmpl-7eGdJ8B3",
"object": "text_completion",
"created": 1684560000,
"model": "text-davinci-003",
"choices": [
{
"text": "3",
"logprobs": {
"tokens": ["3"],
"token_logprobs": [-0.58],
"top_logprobs": [
{
"1": -2.99,
"2": -1.90,
"3": -0.58,
"4": -1.61,
"5": -2.99
}
],
"text_offset": [32]
},
"finish_reason": "stop",
"index": 0
}
]
}

Instead of probabilities, the model uses their logarithmic form, called a log-probability, as it’s a standard.

Deprecation in Newer Models

Newer models, including GPT-5, have deprecated access to token log-probabilities.
The Chat Completions API for these models no longer supports the parameters logprobs and top_logprobs.

This deprecation prevents evaluators from directly retrieving a model’s token-level confidence distribution.

When direct access to token-level log-probabilities is unavailable, the evaluator can still approximate the underlying probability distribution through sampling, as proposed in the paper:

  1. Generate Multiple Completions: For each evaluation prompt, produce several outputs (for example, n=20) while increasing randomness in generation. This is done by setting:
    - temperature = 1. A higher temperature (close to 1) makes the model sample less confidently, exploring more possible outputs; a lower value (close to 0) makes it deterministic and focused on the most likely token.
    - top_p = 1, which refers to nucleus sampling. This parameter defines how much of the total probability mass to consider when sampling the next token. Setting it to 1 means all tokens are eligible, ensuring maximum diversity in the generated completions.
  2. Count Frequency of Each Score Token: Record how many times each discrete rating (e.g., “1” — “5”) appears across the samples.
  3. Normalize the Counts: This converts raw counts into estimated probabilities.
    For example, if a score of 3 appears 10 times out of 20 samples, then p(3) = 10/20 = 0.5.
  4. Compute the Continuous Score: Combine these normalized probabilities into a weighted average.

The Monte Carlo-style method reconstructs an empirical approximation of the model’s confidence distribution. It preserves G-Eval’s core principle of confidence-weighted evaluation, allowing fine-grained scoring even when true log-probabilities cannot be retrieved from the API.

Why Use Log-Probabilities

This section is intended only if you’re curious, as it contains some math. Your fundamental understanding of G-Eval will not change much.

Log-probabilities are used because they make token prediction faster and are easier for computers to work with.

It’s cheaper for computers to do addition than it is to do multiplication. Figuring the next token is easier when you’re adding the log probabilities of each token, instead of multiplying their actual probabilities.

Let’s say we want the probability of the sentence:

“The cat sat”

An LLM doesn’t treat this whole sentence as one atomic event.
Instead, it breaks it into tokens:

t1=”The”, t2=”cat”, t3=”sat”

Then it asks:

  • What’s the probability of “The”? → P(t1)
  • Given “The”, what’s the probability of “cat” afterwards? → P(t2 | t1)
  • Given “The cat”, what’s the probability of “sat” afterwards? → P(t3 | t1, t2)

The chain rule from probability theory says that the joint probability of several events happening together can be written as a product of conditional probabilities, so:

P("The", "cat", "sat") = P("The") * P("cat' | "The") * P("sat" | "The", "cat")

Since probabilities are small numbers and multiplying many small numbers can lead to underflow, models use log-probabilities.

Log allows models to use addition instead of multiplication due to the following property:

log(a * b) = log(a) + log(b)

So, if we take the example above, we can do the following:

log(P("The", "cat", "sat")) = log(P("The")) + log(P("cat" | "The")) + log(P("sat" | "The", "cat")

This means instead of multiplying many small numbers, the models can add log values, which are negative decimals (as seen in the model output in the previous paragraph).

LLM-as-a-Judge Limitations and How G-Eval Solves Them

Let’s see how each limitation is addressed:

  1. Inconsistent scoring:

LLMs are inherently non-deterministic. The same response can yield different scores across evaluation runs. This variability stems from random sampling in model generation, temperature settings, and hidden context sensitivity. As a result, evaluations may fluctuate between runs, making it difficult to benchmark model performance accurately or reproduce results with confidence.

How G-Eval solves it: G-Eval uses Auto-CoT to break down evaluations into structured steps. CoT itself was not a novel concept, as it was first introduced by Wei et al. (2022) as a prompting technique to encourage LLMs to engage in intermediate reasoning steps before arriving at final answers.

G-Eval was the first framework to use CoT reasoning for evaluation by prompting the LLM judge to explicitly generate its own step-by-step evaluation process. By breaking a criterion into smaller, well-defined checks, the model assesses outputs through multiple focused sub-criteria instead of one broad judgment.

This structure improves robustness and reduces randomness across runs, while simpler sub-criteria help minimize bias and sharpen accuracy. The result is a more consistent and reproducible evaluation process overall.

2. Lack of Fine-Grained Judgment:

While LLMs can provide broad, high-level assessments like assigning a score from 1 to 5, they often struggle with more detailed, fine-grained evaluations. The same output might receive slightly different ratings depending on phrasing or length, leading to noisy, imprecise scoring. This becomes especially problematic when subtle distinctions such as tone, factual nuance, or reasoning quality are critical.

How G-Eval solves it: G-Eval applies probability normalization, using token-level confidence values to compute a probability-weighted score that captures fine-grained variations in model output quality.

By weighting each judgment based on its log-probability rather than relying on raw model scores, G-Eval minimizes bias and sharpens the model’s ability to distinguish between outputs of similar quality.

3. Verbosity Bias:

LLMs tend to favor verbose answers, often perceiving longer outputs as more informative or higher quality. This bias can distort evaluation results by rewarding unnecessary length and penalizing concise, direct responses. In use cases where clarity and brevity matter, verbosity bias introduces unwanted skew into performance comparisons.

How G-Eval solves it: Because G-Eval is fully customizable, you can define evaluation criteria that penalize verbosity, reward conciseness, or remain neutral, depending on your goals and use case. When these criteria are written clearly and kept simple, they help minimize verbosity bias and other unwanted tendencies in LLM judgments.

4. Narcissistic Bias:

LLMs may show a preference toward outputs that resemble their own generation style or model family, which a phenomenon known as self-preference bias. Research indicates that models like GPT-4 or Claude -v1 can favor their own responses by a noticeable margin during comparative evaluations. This bias undermines objectivity, particularly in scenarios involving cross-model benchmarking or self-evaluation.

Press enter or click to view image in full size

Narcissistic bias among LLMs (Liu et al.)

This is the only limitation that is not eliminated by G-Eval.

Making G-Eval Production Scale

Back in the day, when teams first started instrumenting web services, everyone built their own dashboards. They tracked latency, error rates until APM tools like Datadog and Sentry came along and made it obvious that no one should be maintaining such a service on their own.

We’re now at the same inflection point with LLM and agentic systems. Evaluations like G-Eval give you the “unit test” layer. They tell you if a model output is correct, coherent, or safe before deployment. But once your application hits production, you need the observability layer: a way to continuously monitor how your LLM or agent behaves in the wild, trace decisions across multi-step chains, and detect regressions before users do.

At production scale, building observability for LLMs and agents is quite the engineering challenge. You’re orchestrating complex multi-step traces, logging every model call, versioning prompts and datasets, tracking latency and cost, and correlating evaluation results across thousands of runs. You also have to make it secure because you’re often dealing with sensitive data and user content flowing through these systems. Ensuring encryption, access control, and compliance adds another layer of complexity. It’s the same class of challenge that gave rise to entire product categories like Application Performance Monitoring (APM).

That’s exactly why our small but fast-moving team at Progress is building an LLM & Agentic Observability Platform. Although Progress is a big company, our small team is acting like a startup within, moving super-fast and spending most of our time talking with users and iterating rapidly. The Progress Observability Platform combines metrics, traces, evaluations, and analytics all into one cohesive platform so you can monitor and improve your AI systems with the same rigor as conventional software.

Press enter or click to view image in full size

Progress Observability Platform

We’re currently in early access, and if you’re building or already running production-grade LLM or agent apps, we’d love to have you try it out. Your feedback will mean the world to us.

Request early access here.

Resources: