GitHub - NehmeAILabs/llm-sanity-checks

6 min read Original article ↗

LLM Sanity Checks

A practical guide to not over-engineering your AI stack.

Before you reach for a frontier model, ask yourself: does this actually need a trillion-parameter model?

Most tasks don't. This repo helps you figure out which ones.


The Decision Tree

                         You have a task.
                                │
                                ▼
              ┌─────────────────────────────────┐
              │  Can regex, rules, or lookup    │
              │  tables solve it?               │──── YES ───► Use that. Stop.
              └─────────────────────────────────┘
                                │ NO
                                ▼
              ┌─────────────────────────────────┐
              │  Is it structured/tabular data? │──── YES ───► XGBoost, random forest,
              │  (predictions from features)    │              or logistic regression.
              └─────────────────────────────────┘              Often beats LLMs here.
                                │ NO
                                ▼
              ┌─────────────────────────────────┐
              │  Is it search/retrieval?        │──── YES ───► BM25 first. Add vector
              │                                 │              search if needed.
              └─────────────────────────────────┘
                                │ NO
                                ▼
              ┌─────────────────────────────────┐
              │  Does it need external knowledge│──── YES ───► <100 pages? Stuff in context.
              │  beyond the input text?         │              Larger? Then consider RAG.
              └─────────────────────────────────┘
                                │ NO
                                ▼
              ┌─────────────────────────────────┐
              │  Is the task simple?            │──── YES ───► Small model (1B-8B).
              │  (classification, extraction,   │              Test it first.
              │   summarization)                │
              └─────────────────────────────────┘
                                │ NO
                                ▼
                      You might need a frontier
                      model. But measure first.

Quick Checks

Check 1: Can you describe the task in one sentence?

If yes → probably a small model task.

  • "Extract the company name from this email" → Gemma 4B
  • "Classify this support ticket as billing/technical/other" → Gemma 1B
  • "Summarize this paragraph in 2 sentences" → Phi-4

If no → you might have an architecture problem, not a model problem.

Check 2: What's your accuracy requirement?

Accuracy needed Model size Why
70-80% 1B-4B Good enough for suggestions, drafts, triage
85-95% 4B-12B Production-ready for most tasks
95%+ Consider fine-tuning a small model, not scaling up

Scaling to frontier models rarely buys you more than 5% accuracy on simple tasks. That 5% costs 50x more.

Check 3: How many output tokens do you need?

Output tokens are the bottleneck. They determine latency and cost.

Output type Tokens Consider
Yes/No, True/False 1-5 Tiny model, or even logit bias
Category label 1-10 1B model is enough
Short extraction 10-50 Delimiter-separated output > JSON
Paragraph 50-200 4B-8B models
Long generation 500+ Maybe you need a bigger model

The JSON Tax

Everyone defaults to JSON for structured output. But JSON has overhead:

# JSON output (35 tokens)
{"name": "John Smith", "company": "Acme Corp", "title": "CTO", "status": "lead"}

# Delimiter output (11 tokens)
John Smith::Acme Corp::CTO::lead

For simple extraction tasks:

  • 3x fewer output tokens
  • 3x faster inference
  • 3x cheaper

When to use JSON: nested structures, optional fields, API contracts. When to use delimiters: simple extraction, high-volume pipelines.

Read more: The JSON Tax →


Model Selection Cheat Sheet

Tiny (1B-4B params)

Best for: classification, yes/no, simple extraction

Model Params Good at
Gemma 3 1B 1B Classification, simple Q&A
Phi-4-mini 3.8B Reasoning, function calling
Gemma 3 4B 4B Best tiny all-rounder

Small (8B-17B params)

Best for: most production tasks, RAG, extraction, summarization

Model Params Good at
Qwen 3 8B 8B Multilingual, reasoning
Gemma 3 12B 12B Quality/speed balance
Phi-4 14B Reasoning, math
Llama 4 Scout 17B Multimodal, long context (10M tokens)

Medium (27B-70B params)

Best for: complex reasoning, long context, multi-step tasks

Model Params Good at
Gemma 3 27B 27B Near-frontier quality
Qwen 3 32B 32B Complex tasks
Llama 4 Maverick 400B (17B active) MoE, strong all-rounder

Frontier (100B+ dense params)

Best for: novel tasks, complex reasoning, when nothing else works

Before you use these, ask: have you tried a smaller model?


Anti-Patterns

❌ "We use GPT-5 for everything"

That's not a flex. That's a $50K/month cloud bill waiting to happen.

❌ "We need the best model for our enterprise customers"

Your enterprise customers care about latency, reliability, and cost. Not model prestige.

❌ "Small models aren't accurate enough"

Did you test? With the right prompt? On your actual data?

❌ "We'll optimize later"

You'll optimize never. The technical debt compounds. Start right-sized.

❌ "JSON output is industry standard"

For simple extraction, it's industry waste. See: The JSON Tax.

❌ "We need RAG for our documents"

For small document sets? No you don't.

Context windows are now 2M-10M tokens. That's thousands of pages. If your knowledge base is <100 pages, just stuff it in context. Preprocess, convert to markdown, include directly.

RAG adds complexity: chunking strategies, embedding models, vector databases, retrieval tuning, reranking. All that infrastructure for documents that fit in a single prompt.

When RAG makes sense:

  • Millions of documents
  • Frequently changing content (re-embedding is cheaper than re-caching)
  • Cost-sensitive at scale (RAG is 8-80x cheaper per query for large, static corpora)

When to skip RAG:

  • <100 pages of docs
  • Static content that rarely changes
  • You value simplicity over marginal retrieval precision

Patterns

✅ Cascade Architecture

Start with smallest model. Verify output. Escalate only on failure.

Input Text
    │
    ▼
┌─────────┐              ┌────────────┐    verified?     ┌──────────┐
│ Gemma 4B│ ───────────► │  Verifier  │ ───────────────► │ Return   │
└─────────┘   output     └────────────┘   yes            └──────────┘
                              │ no
                              ▼
┌─────────────┐          ┌────────────┐    verified?     ┌──────────┐
│ Llama Scout │ ───────► │  Verifier  │ ───────────────► │ Return   │
└─────────────┘          └────────────┘                  └──────────┘

Verifier can be: format validation, a classifier, or FlashCheck for grounding checks.

See examples/cascade.py for a working extraction example.

✅ Task-Specific Models

One model per task type, sized appropriately.

  • Classification → 1B
  • Extraction → 4B
  • Summarization → 8B
  • Complex reasoning → Frontier

✅ Measure First, Scale Never

Before adding a bigger model:

  1. Benchmark current model on 100 real examples
  2. Identify failure modes
  3. Try prompt engineering
  4. Try fine-tuning small model
  5. Only then consider scaling

✅ Simple Tools Over Browser Automation

For research tasks, don't reach for computer use or Puppeteer.

Search API → Fetch URL → HTML to Markdown → LLM synthesis

Three tools. No browser. No screenshots. No vision model.

Browser automation is only for: login walls, dynamic forms, actions (booking, purchasing).

See patterns/agents.md for the full agent decision tree.


More Patterns


Tools

RightSize

Test your prompts against multiple model sizes. See what's actually needed.

→ Try RightSize

FlashCheck

Verify LLM outputs with tiny specialized models. Sub-10ms verification.

→ Learn about FlashCheck


Contributing

Found a pattern that works? Open a PR.

Keep it practical. Keep it measured. No vibes-based claims.


License

MIT. Use it. Share it. Don't over-engineer it.


Built by Nehme AI Labs — AI architecture consultancy.