LLM Sanity Checks
A practical guide to not over-engineering your AI stack.
Before you reach for a frontier model, ask yourself: does this actually need a trillion-parameter model?
Most tasks don't. This repo helps you figure out which ones.
The Decision Tree
You have a task.
│
▼
┌─────────────────────────────────┐
│ Can regex, rules, or lookup │
│ tables solve it? │──── YES ───► Use that. Stop.
└─────────────────────────────────┘
│ NO
▼
┌─────────────────────────────────┐
│ Is it structured/tabular data? │──── YES ───► XGBoost, random forest,
│ (predictions from features) │ or logistic regression.
└─────────────────────────────────┘ Often beats LLMs here.
│ NO
▼
┌─────────────────────────────────┐
│ Is it search/retrieval? │──── YES ───► BM25 first. Add vector
│ │ search if needed.
└─────────────────────────────────┘
│ NO
▼
┌─────────────────────────────────┐
│ Does it need external knowledge│──── YES ───► <100 pages? Stuff in context.
│ beyond the input text? │ Larger? Then consider RAG.
└─────────────────────────────────┘
│ NO
▼
┌─────────────────────────────────┐
│ Is the task simple? │──── YES ───► Small model (1B-8B).
│ (classification, extraction, │ Test it first.
│ summarization) │
└─────────────────────────────────┘
│ NO
▼
You might need a frontier
model. But measure first.
Quick Checks
Check 1: Can you describe the task in one sentence?
If yes → probably a small model task.
- "Extract the company name from this email" → Gemma 4B
- "Classify this support ticket as billing/technical/other" → Gemma 1B
- "Summarize this paragraph in 2 sentences" → Phi-4
If no → you might have an architecture problem, not a model problem.
Check 2: What's your accuracy requirement?
| Accuracy needed | Model size | Why |
|---|---|---|
| 70-80% | 1B-4B | Good enough for suggestions, drafts, triage |
| 85-95% | 4B-12B | Production-ready for most tasks |
| 95%+ | Consider fine-tuning a small model, not scaling up |
Scaling to frontier models rarely buys you more than 5% accuracy on simple tasks. That 5% costs 50x more.
Check 3: How many output tokens do you need?
Output tokens are the bottleneck. They determine latency and cost.
| Output type | Tokens | Consider |
|---|---|---|
| Yes/No, True/False | 1-5 | Tiny model, or even logit bias |
| Category label | 1-10 | 1B model is enough |
| Short extraction | 10-50 | Delimiter-separated output > JSON |
| Paragraph | 50-200 | 4B-8B models |
| Long generation | 500+ | Maybe you need a bigger model |
The JSON Tax
Everyone defaults to JSON for structured output. But JSON has overhead:
# JSON output (35 tokens)
{"name": "John Smith", "company": "Acme Corp", "title": "CTO", "status": "lead"}
# Delimiter output (11 tokens)
John Smith::Acme Corp::CTO::lead
For simple extraction tasks:
- 3x fewer output tokens
- 3x faster inference
- 3x cheaper
When to use JSON: nested structures, optional fields, API contracts. When to use delimiters: simple extraction, high-volume pipelines.
Model Selection Cheat Sheet
Tiny (1B-4B params)
Best for: classification, yes/no, simple extraction
| Model | Params | Good at |
|---|---|---|
| Gemma 3 1B | 1B | Classification, simple Q&A |
| Phi-4-mini | 3.8B | Reasoning, function calling |
| Gemma 3 4B | 4B | Best tiny all-rounder |
Small (8B-17B params)
Best for: most production tasks, RAG, extraction, summarization
| Model | Params | Good at |
|---|---|---|
| Qwen 3 8B | 8B | Multilingual, reasoning |
| Gemma 3 12B | 12B | Quality/speed balance |
| Phi-4 | 14B | Reasoning, math |
| Llama 4 Scout | 17B | Multimodal, long context (10M tokens) |
Medium (27B-70B params)
Best for: complex reasoning, long context, multi-step tasks
| Model | Params | Good at |
|---|---|---|
| Gemma 3 27B | 27B | Near-frontier quality |
| Qwen 3 32B | 32B | Complex tasks |
| Llama 4 Maverick | 400B (17B active) | MoE, strong all-rounder |
Frontier (100B+ dense params)
Best for: novel tasks, complex reasoning, when nothing else works
Before you use these, ask: have you tried a smaller model?
Anti-Patterns
❌ "We use GPT-5 for everything"
That's not a flex. That's a $50K/month cloud bill waiting to happen.
❌ "We need the best model for our enterprise customers"
Your enterprise customers care about latency, reliability, and cost. Not model prestige.
❌ "Small models aren't accurate enough"
Did you test? With the right prompt? On your actual data?
❌ "We'll optimize later"
You'll optimize never. The technical debt compounds. Start right-sized.
❌ "JSON output is industry standard"
For simple extraction, it's industry waste. See: The JSON Tax.
❌ "We need RAG for our documents"
For small document sets? No you don't.
Context windows are now 2M-10M tokens. That's thousands of pages. If your knowledge base is <100 pages, just stuff it in context. Preprocess, convert to markdown, include directly.
RAG adds complexity: chunking strategies, embedding models, vector databases, retrieval tuning, reranking. All that infrastructure for documents that fit in a single prompt.
When RAG makes sense:
- Millions of documents
- Frequently changing content (re-embedding is cheaper than re-caching)
- Cost-sensitive at scale (RAG is 8-80x cheaper per query for large, static corpora)
When to skip RAG:
- <100 pages of docs
- Static content that rarely changes
- You value simplicity over marginal retrieval precision
Patterns
✅ Cascade Architecture
Start with smallest model. Verify output. Escalate only on failure.
Input Text
│
▼
┌─────────┐ ┌────────────┐ verified? ┌──────────┐
│ Gemma 4B│ ───────────► │ Verifier │ ───────────────► │ Return │
└─────────┘ output └────────────┘ yes └──────────┘
│ no
▼
┌─────────────┐ ┌────────────┐ verified? ┌──────────┐
│ Llama Scout │ ───────► │ Verifier │ ───────────────► │ Return │
└─────────────┘ └────────────┘ └──────────┘
Verifier can be: format validation, a classifier, or FlashCheck for grounding checks.
See examples/cascade.py for a working extraction example.
✅ Task-Specific Models
One model per task type, sized appropriately.
- Classification → 1B
- Extraction → 4B
- Summarization → 8B
- Complex reasoning → Frontier
✅ Measure First, Scale Never
Before adding a bigger model:
- Benchmark current model on 100 real examples
- Identify failure modes
- Try prompt engineering
- Try fine-tuning small model
- Only then consider scaling
✅ Simple Tools Over Browser Automation
For research tasks, don't reach for computer use or Puppeteer.
Search API → Fetch URL → HTML to Markdown → LLM synthesis
Three tools. No browser. No screenshots. No vision model.
Browser automation is only for: login walls, dynamic forms, actions (booking, purchasing).
See patterns/agents.md for the full agent decision tree.
More Patterns
- Extraction patterns — regex → NER → small LLM ladder
- Agent patterns — when to use simple tools vs browser automation
Tools
RightSize
Test your prompts against multiple model sizes. See what's actually needed.
FlashCheck
Verify LLM outputs with tiny specialized models. Sub-10ms verification.
Contributing
Found a pattern that works? Open a PR.
Keep it practical. Keep it measured. No vibes-based claims.
License
MIT. Use it. Share it. Don't over-engineer it.
Built by Nehme AI Labs — AI architecture consultancy.