LLM Evals Are Just Tests. Why Are We Making This So Complicated?

6 min read Original article ↗

I watched our AI completely melt down during a CTO demo last month.

The model started confidently mixing up “deep research” (an existing feature) with “thematic analysis scans” (the new feature I’d just shipped). Every response made the confusion worse. The CTO’s face went from interested to puzzled to concerned. I sat there watching my careful prompt engineering unravel in real-time.

After the demo, the CTO asked the question I dreaded: “How can we make sure this doesn’t happen again?”

I couldn’t promise anything. I was vibe-checking my system prompts, shipping when they “felt right.” That’s when it hit me—I already know how to prevent regressions. I do it every day with unit tests.

The Revelation That Changed Everything

Think about this: You don’t use a third-party platform for unit tests. Why would you need one for LLM evals?

This question haunted me after that demo disaster. I started evaluating all the LLM observability platforms—Helicone, LangSmith, Logfire, LangFuse, Lunary. They’re all solving problems I didn’t actually have. I already had vitest running my tests. I already had GitHub Actions for CI. I already had everything I needed.

The only difference between testing calculateTotal() and testing generateResponse() is that one returns numbers and the other returns text. Both are just functions that need verification.

Why This Actually Works

Here’s what I’ve learned after hundreds of prompt changes:

  1. **Developers write evals when it’s just “add another test file” **No new tools to learn, no special access to request. It’s just npm test.
  2. Regressions get caught when metrics are in PR comments That red arrow is impossible to ignore during code review.
  3. The system improves when developers own it Every developer can add evals for their features. It’s just tests.

I haven’t had a prompt regression reach production since April.

Starting Simple (Week 1)

My first eval was embarrassingly basic:

it("detects when themes haven't been added", async () => {
  const response = await generateAIResponse(context)
  const eval = await evaluatePendingCommandAwareness(response)
  expect(eval.score).toBeGreaterThan(0.85)
})

It failed immediately. Of course it did—LLMs aren’t deterministic.

The Non-Determinism Problem (Week 2)

Here’s what actually works:

const results = await Promise.all(
  Array.from({ length: 10 }, async () => {
    const response = await generateAIResponse(context)
    const eval = await evaluatePendingCommandAwareness(response)
    
    return eval.hasGuidanceText && 
           \!eval.hasStartResearchCommand &&
           eval.mentionsPendingThemes && 
           eval.offersNextSteps
  })
)

const successRate = results.filter(r => r).length / results.length
expect(successRate).toBeGreaterThan(0.85)

Run it with npm test. That’s it. No special infrastructure needed.

Making It Visible (Week 3)

Developers started complaining: “The eval failed but I can’t see why without digging through CI logs.”

Fair point. So I made tests write simple JSON files:

await writeScorecard("pending-command-awareness", {
  title: "Pending Command Awareness",
  metrics: [{
    name: "Success Rate",
    value: successRate * 100,
    threshold: 85,
    unit: "%",
    passed: successRate > 0.85
  }],
  overallPassed: successRate > 0.85,
  timestamp: new Date().toISOString()
})

Then added a GitHub Action to post them as PR comments. Here’s what every PR shows now:

<\!-- EVALS2_SCORECARD -->
<details>
<summary><b>🟢 Eval Scorecards: 10/10 passed</b></summary>

Test Suite / Metric                                      Current  Main      Δ      ✓
────────────────────────────────────────────────────────────────────────────────
Available Actions
  Start Research After Execution                           100%      90%    +10%  ✅ 🟢
  Start Research Includes Leaf Themes                      100%     100%      0%  ✅ ⚪
  Start Research Excludes Parent Themes                    100%     100%      0%  ✅ ⚪
  Start Research Uses Current Theme Context                 96%      96%     +1%  ✅ 🟢
  Start Research Avoids Find Stocks Phrase                 100%     100%      0%  ✅ ⚪
  Start Research Generates Valid Command                   100%     100%      0%  ✅ ⚪
Command Generation
  First response accuracy (command -> command)             100%     100%      0%  ✅ ⚪
  Second response accuracy (TL;DR -> analysis)             100%     100%      0%  ✅ ⚪
  Overall accuracy (both correct)                          100%     100%      0%  ✅ ⚪
Context Awareness
  Pending Command Awareness                                100%     100%      0%  ✅ ⚪
  User Override                                            100%     100%      0%  ✅ ⚪
Continuation Responses
  AddTheme Continuation                                    100%     100%      0%  ✅ ⚪
  StartResearch Continuation                               100%     100%      0%  ✅ ⚪
  Research After Theme                                     100%     100%      0%  ✅ ⚪
First Interaction
  Success rate                                             100%     100%      0%  ✅ ⚪
StartResearch Guard Rails
  Avoids False Analysis Claims                             100%     100%      0%  ✅ ⚪
  Explains Thematic Matching Only                          100%     100%      0%  ✅ ⚪
  No Targeted Analysis Promises                            100%     100%      0%  ✅ ⚪
  Provides Honest Limitations                              100%     100%      0%  ✅ ⚪
Stock Defense Quality
  Information density                                       88%      74%    +14%  ✅ 🟢
  Institutional credibility                                 95%      95%     +0%  ✅ 🟢
  Content quality                                           97%      96%     +1%  ✅ 🟢
  Evidence prioritization                                   89%      84%     +5%  ✅ 🟢
  Protocol URL Correctness                                 100%     100%      0%  ✅ ⚪
  Protocol URL Usage                                       100%     100%      0%  ✅ ⚪
Theme Analysis Quality
  Overall quality score                                    100%      73%    +27%  ✅ 🟢
  Human-friendly quality                                    63%      58%     +5%  ✅ 🟢
  Number translation judgment                               96%      73%    +23%  ✅ 🟢
  References analysis                                      100%      78%    +22%  ✅ 🟢
Theme Generation Quality
  Overall quality score                                     91%      96%     -5%  ✅ 🔴
  Reasoning quality score                                   91%      95%     -4%  ✅ 🔴
  Variance score (avoiding rigid counts)                    13%      10%     +3%  ✅ 🟢
────────────────────────────────────────────────────────────────────────────────
Updated: 2025-08-09T04:40:28.297Z

</details>

10 test suites. 40+ metrics. Every PR. Those red arrows catch regressions before they ship.

The Game Changer: Time Series Comparisons (Week 4)

“What was the score last week?” someone asked.

I discovered GitHub artifacts are basically a free time-series database with 30-day retention:

- name: Download Main Branch Scorecards
  uses: dawidd6/action-download-artifact@v3
  with:
    workflow: evals2-tests.yml
    branch: main
    name: eval-scorecards
    path: scorecards-main/

Now every PR automatically compares against main branch. Look at that scorecard above—those red indicators for “Theme Generation Quality” dropping by 5%? That’s exactly the kind of regression that would have saved me from the CTO demo disaster.

What I Actually Built

After many months of iteration, here’s my entire “eval platform”:

  • Test files next to my code (like any other tests)
  • A 39-line scorecard writer
  • A 204-line markdown converter
  • A 128-line GitHub Action

No new dashboards. No new logins. No vendor lock-in.

The Practical Bits That Matter

LLM-as-judge pattern:

const { object } = await generateObject({
  model: googleAI("gemini-2.5-pro"),
  schema: z.object({
    hasGuidanceText: z.boolean(),
    mentionsPendingThemes: z.boolean(),
    // ... other criteria
  }),
  prompt: `Evaluate this response: "${response}"...`
})

Performance tracking:

const start = performance.now()
const result = await fn()
const duration = performance.now() - start

Dataset management: Test fixtures in your repo. You already do this.

Prompt versioning: Database configs. A few days of work, not a platform purchase.

Start Tomorrow Morning

Pick one LLM behavior that matters to you. Write a test for it—just a regular test in your existing test suite. Run it 10 times. Check the success rate.

That’s it. You’ve started.

The infrastructure will evolve naturally as you need it. You’ll add scorecards when you want visibility. You’ll add comparisons when you want trends. But start with just a test.

The Uncomfortable Truth

We’ve been overthinking this. The infrastructure you need for great AI products already exists in your codebase. It’s the same infrastructure you use for all your other code.

You don’t need to buy a platform to solve a problem you can test your way out of.

P.S. - Yes, there are valid use cases for specialized eval platforms. But you won’t know what you actually need until you’ve run your own evals for a while. Start with tests. When you hit a real limitation, solve that specific problem. You probably won’t hit as many as you think.