Evaluating job search ranking with LLM judged NDCG

Job search queries come in different shapes. Some are broad, such as product manager, software engineer, or sales development representative. Some are narrower role or skill queries, such as SQL Server DBA, biomedical engineer, or data platform engineer. Some are rare exact match queries, such as haskell, elm, or ocaml. Others combine role, skill, seniority, domain, or other constraints, such as senior backend engineer fintech python.

We wanted one scoring method that works across these cases. The question we care about is simple: for this query, did the product rank the strongest available matches near the top?

To answer that, we use LLM judged NDCG.

Evaluation case

Each evaluation case starts with a frozen eligible corpus: one query, one location filter, and one frozen snapshot of eligible jobs. The location filter defines the set of jobs that could have been shown. Once that set is fixed, the evaluation is about ranking quality.

For each case, the evaluator loads every eligible job from the frozen corpus, sends each job to an LLM judge with compact evidence from the posting, gets a 0 to 100 relevance score, sorts all judged jobs by score to create the ideal ranking under the judge, and compares the product's top results against that ideal ranking with NDCG.

Eligibility constraints, such as location and active state filters, are handled before judging. The judge scores job intent relevance. For example, with k8s, location: San Francisco, the evaluator scores the eligible corpus, identifies the strongest Kubernetes related matches under the judge, and checks whether the product ranked them near the top.

Why NDCG

Precision and recall force an extra decision: what score counts as relevant? In this setup, that means choosing a cutoff such as 70 or 90, or constructing some other target set for each query.

That loses information because the judge already gives us a graded 0 to 100 score. If one page has scores 95, 94, 93, 92, 91 and another has 75, 74, 73, 72, 71, both pages get precision@5 = 1.0 with a cutoff of 70. Those pages are clearly different.

NDCG preserves the score differences and the ordering. It uses the full graded score to construct the ideal ranking under the judge, then compares the product's ranked page against that ideal. The scoring question becomes: how much of the ideal top k score did the product capture, with earlier ranks weighted more heavily?

Why keep the 0 to 100 score

Coarse relevance buckets lose too much information for this use case. A 0 to 3 label scheme may be enough when there are only a few highly relevant results. Job search often has hundreds of strong matches. If every excellent job collapses into the same bucket, the metric cannot distinguish weaker excellent matches from stronger excellent matches.

The biggest impact range for us is often around 75 to 85. Those jobs are usually plausible matches, but not all plausible matches are equally good. If that whole range collapses into one "good" bucket, the metric loses the ability to tell whether the product put the stronger good matches ahead of the weaker ones. With linear gain NDCG over the full 0 to 100 score, that ordering still matters.

We do not treat every one point difference as exact truth. The score is a judge signal, not a physical measurement. But preserving the full score gives the evaluator resolution inside the large set of good and excellent matches.

Why linear gain

NDCG needs a gain function. A common formulation uses exponential gain, gain(rel) = 2^rel - 1. That works when relevance labels are small ordinal values like 0, 1, 2, 3.

It does not work directly with 0 to 100 scores. Exponential gain would make the top end dominate the metric unless we first bucketed or rescaled the scores, which would undo the reason we kept the 0 to 100 signal.

So we use linear gain: gain(score) = score. A score of 100 contributes twice as much raw gain as a score of 50, and a 95 contributes slightly more than a 90.

That tradeoff is intentional. We want the metric to preserve ordering inside strong matches without making small score differences explode. Linear gain also keeps the metric inspectable: the NDCG result is tied directly to the judge scores shown in the report.

The important caveat is judge reliability. If the judge cannot consistently distinguish 91 from 93, then the metric should not be read as proving that one job is meaningfully better than the other. The useful signal is usually at the ranking and page level: did the product put stronger scored jobs near the top, and did it miss jobs that the judge scored much higher?

Metric

We use linear gain NDCG over the LLM's 0 to 100 relevance scores.

DCG@k = sum(score_i / log2(rank_i + 1))
NDCG@k = DCG@k(product_top_k) / DCG@k(ideal_top_k)

The denominator is the maximum possible discounted score for the top k page, using the highest scoring jobs from the frozen judged corpus. So NDCG@10 asks how close the product's top 10 was to the highest scoring top 10 under the judge.

A score of 1.0 means the product matched the ideal top k ordering under the judge. A lower score means higher scoring jobs were missing or ranked too low. We report NDCG@10 and NDCG@26. The first measures the top of the page. The second matches the larger result page shown in the product.

Corpus strength

Because NDCG is normalized by the ideal top k, it measures ranking quality relative to the available corpus. The raw ideal scores tell us how strong that corpus was for the query.

If the ideal top 10 scores are 45, 44, 43, 42, 41, 40, 39, 38, 37, 36 and the product returns exactly that order, NDCG@10 is 1.0. That score says the product matched the highest scoring page available under the judge. The raw scores say the available matches were weak.

So the report keeps both views: ranking quality, meaning whether the product ranked the strongest available jobs well, and corpus strength, meaning whether this evaluation case contained strong jobs in the first place.

Judge reliability

The metric depends on the judge. That does not require the judge to be perfect, but it does need to be stable enough for the score to be useful. If the same job moves from 92 to 78 across runs with the same evidence, the ranking metric is measuring judge noise as much as product quality.

So the evaluation keeps the judge auditable: prompt version, model version, query, job id, evidence payload, judge score, judge reason, and frozen corpus identifier.

The practical checks are whether repeated runs produce similar score distributions, whether high scoring jobs remain high scoring under small prompt changes, whether obvious misses show up as large score gaps instead of one point differences, and whether sampled judgments look reasonable to a human reviewer.

The goal is not to claim that a 94 is objectively better than a 93. The goal is to make the ranking comparison stable enough that large misses and bad ordering decisions are visible.

Report contents

The headline metrics are NDCG@10 and NDCG@26. The report also keeps the artifacts needed to interpret the score: product top k jobs, product top k scores, ideal top k jobs under the judge, ideal top k scores, missed high scoring jobs, judge reasons, prompt version, model version, and frozen corpus identifier.

The raw scores provide context for interpreting NDCG. They are not separate headline metrics.

Recent baseline

On saved San Francisco traces from May 18, 2026, the evaluator covered 11 cases and 286,341 judged jobs. The aggregate linear gain NDCG results were 0.8799 for mean NDCG@10 and 0.8510 for mean NDCG@26.

The aggregate is useful as a regression signal, not as a final verdict. The more useful view is case level movement as the hybrid search algorithm evolves: which query improved, which query regressed, which high scoring jobs were missed, and whether the misses came from eligibility, candidate generation, or ranking.

The interesting cases are usually the ones where the product missed jobs that the judge scored much higher than the jobs shown. That is where NDCG turns into a debugging tool rather than just a score.

Takeaway

The evaluation asks: for this query, how much of the ideal top k score under the judge did the product capture?

The denominator comes from the frozen eligible corpus. The ideal ranking comes from sorting the 0 to 100 LLM judge scores over that corpus. The metric is linear gain NDCG against that ideal ranking.

That gives us one scoring framework for broad queries, narrower role or skill queries, rare exact match queries, and mixed intent queries, while keeping corpus strength and ranking quality separate.