I watched our AI completely melt down during a CTO demo last month.
The model started confidently mixing up “deep research” (an existing feature) with “thematic analysis scans” (the new feature I’d just shipped). Every response made the confusion worse. The CTO’s face went from interested to puzzled to concerned. I sat there watching my careful prompt engineering unravel in real-time.
After the demo, the CTO asked the question I dreaded: “How can we make sure this doesn’t happen again?”
I couldn’t promise anything. I was vibe-checking my system prompts, shipping when they “felt right.” That’s when it hit me—I already know how to prevent regressions. I do it every day with unit tests.
The Revelation That Changed Everything
Think about this: You don’t use a third-party platform for unit tests. Why would you need one for LLM evals?
This question haunted me after that demo disaster. I started evaluating all the LLM observability platforms—Helicone, LangSmith, Logfire, LangFuse, Lunary. They’re all solving problems I didn’t actually have. I already had vitest running my tests. I already had GitHub Actions for CI. I already had everything I needed.
The only difference between testing calculateTotal() and testing generateResponse() is that one returns numbers and the other returns text. Both are just functions that need verification.
Why This Actually Works
Here’s what I’ve learned after hundreds of prompt changes:
- **Developers write evals when it’s just “add another test file”
**No new tools to learn, no special access to request. It’s just
npm test. - Regressions get caught when metrics are in PR comments That red arrow is impossible to ignore during code review.
- The system improves when developers own it Every developer can add evals for their features. It’s just tests.
I haven’t had a prompt regression reach production since April.
Starting Simple (Week 1)
My first eval was embarrassingly basic:
it("detects when themes haven't been added", async () => {
const response = await generateAIResponse(context)
const eval = await evaluatePendingCommandAwareness(response)
expect(eval.score).toBeGreaterThan(0.85)
})
It failed immediately. Of course it did—LLMs aren’t deterministic.
The Non-Determinism Problem (Week 2)
Here’s what actually works:
const results = await Promise.all(
Array.from({ length: 10 }, async () => {
const response = await generateAIResponse(context)
const eval = await evaluatePendingCommandAwareness(response)
return eval.hasGuidanceText &&
\!eval.hasStartResearchCommand &&
eval.mentionsPendingThemes &&
eval.offersNextSteps
})
)
const successRate = results.filter(r => r).length / results.length
expect(successRate).toBeGreaterThan(0.85)
Run it with npm test. That’s it. No special infrastructure needed.
Making It Visible (Week 3)
Developers started complaining: “The eval failed but I can’t see why without digging through CI logs.”
Fair point. So I made tests write simple JSON files:
await writeScorecard("pending-command-awareness", {
title: "Pending Command Awareness",
metrics: [{
name: "Success Rate",
value: successRate * 100,
threshold: 85,
unit: "%",
passed: successRate > 0.85
}],
overallPassed: successRate > 0.85,
timestamp: new Date().toISOString()
})
Then added a GitHub Action to post them as PR comments. Here’s what every PR shows now:
<\!-- EVALS2_SCORECARD -->
<details>
<summary><b>🟢 Eval Scorecards: 10/10 passed</b></summary>
Test Suite / Metric Current Main Δ ✓
────────────────────────────────────────────────────────────────────────────────
Available Actions
Start Research After Execution 100% 90% +10% ✅ 🟢
Start Research Includes Leaf Themes 100% 100% 0% ✅ ⚪
Start Research Excludes Parent Themes 100% 100% 0% ✅ ⚪
Start Research Uses Current Theme Context 96% 96% +1% ✅ 🟢
Start Research Avoids Find Stocks Phrase 100% 100% 0% ✅ ⚪
Start Research Generates Valid Command 100% 100% 0% ✅ ⚪
Command Generation
First response accuracy (command -> command) 100% 100% 0% ✅ ⚪
Second response accuracy (TL;DR -> analysis) 100% 100% 0% ✅ ⚪
Overall accuracy (both correct) 100% 100% 0% ✅ ⚪
Context Awareness
Pending Command Awareness 100% 100% 0% ✅ ⚪
User Override 100% 100% 0% ✅ ⚪
Continuation Responses
AddTheme Continuation 100% 100% 0% ✅ ⚪
StartResearch Continuation 100% 100% 0% ✅ ⚪
Research After Theme 100% 100% 0% ✅ ⚪
First Interaction
Success rate 100% 100% 0% ✅ ⚪
StartResearch Guard Rails
Avoids False Analysis Claims 100% 100% 0% ✅ ⚪
Explains Thematic Matching Only 100% 100% 0% ✅ ⚪
No Targeted Analysis Promises 100% 100% 0% ✅ ⚪
Provides Honest Limitations 100% 100% 0% ✅ ⚪
Stock Defense Quality
Information density 88% 74% +14% ✅ 🟢
Institutional credibility 95% 95% +0% ✅ 🟢
Content quality 97% 96% +1% ✅ 🟢
Evidence prioritization 89% 84% +5% ✅ 🟢
Protocol URL Correctness 100% 100% 0% ✅ ⚪
Protocol URL Usage 100% 100% 0% ✅ ⚪
Theme Analysis Quality
Overall quality score 100% 73% +27% ✅ 🟢
Human-friendly quality 63% 58% +5% ✅ 🟢
Number translation judgment 96% 73% +23% ✅ 🟢
References analysis 100% 78% +22% ✅ 🟢
Theme Generation Quality
Overall quality score 91% 96% -5% ✅ 🔴
Reasoning quality score 91% 95% -4% ✅ 🔴
Variance score (avoiding rigid counts) 13% 10% +3% ✅ 🟢
────────────────────────────────────────────────────────────────────────────────
Updated: 2025-08-09T04:40:28.297Z
</details>
10 test suites. 40+ metrics. Every PR. Those red arrows catch regressions before they ship.
The Game Changer: Time Series Comparisons (Week 4)
“What was the score last week?” someone asked.
I discovered GitHub artifacts are basically a free time-series database with 30-day retention:
- name: Download Main Branch Scorecards
uses: dawidd6/action-download-artifact@v3
with:
workflow: evals2-tests.yml
branch: main
name: eval-scorecards
path: scorecards-main/
Now every PR automatically compares against main branch. Look at that scorecard above—those red indicators for “Theme Generation Quality” dropping by 5%? That’s exactly the kind of regression that would have saved me from the CTO demo disaster.
What I Actually Built
After many months of iteration, here’s my entire “eval platform”:
- Test files next to my code (like any other tests)
- A 39-line scorecard writer
- A 204-line markdown converter
- A 128-line GitHub Action
No new dashboards. No new logins. No vendor lock-in.
The Practical Bits That Matter
LLM-as-judge pattern:
const { object } = await generateObject({
model: googleAI("gemini-2.5-pro"),
schema: z.object({
hasGuidanceText: z.boolean(),
mentionsPendingThemes: z.boolean(),
// ... other criteria
}),
prompt: `Evaluate this response: "${response}"...`
})
Performance tracking:
const start = performance.now()
const result = await fn()
const duration = performance.now() - start
Dataset management: Test fixtures in your repo. You already do this.
Prompt versioning: Database configs. A few days of work, not a platform purchase.
Start Tomorrow Morning
Pick one LLM behavior that matters to you. Write a test for it—just a regular test in your existing test suite. Run it 10 times. Check the success rate.
That’s it. You’ve started.
The infrastructure will evolve naturally as you need it. You’ll add scorecards when you want visibility. You’ll add comparisons when you want trends. But start with just a test.
The Uncomfortable Truth
We’ve been overthinking this. The infrastructure you need for great AI products already exists in your codebase. It’s the same infrastructure you use for all your other code.
You don’t need to buy a platform to solve a problem you can test your way out of.
P.S. - Yes, there are valid use cases for specialized eval platforms. But you won’t know what you actually need until you’ve run your own evals for a while. Start with tests. When you hit a real limitation, solve that specific problem. You probably won’t hit as many as you think.