Automated LLM Selection & Evaluation — find the best model for your task in minutes.
🎬 Demo
Watch the LLM Evaluator Tool in action — from task input to ranked results in minutes.
✨ What It Does
LLM Evaluator Tool automates the process of selecting and benchmarking the best LLMs for any task you define. It uses Gemini 3.1 Pro (via OpenRouter) as a Judge LLM to fairly evaluate candidate models across multiple dimensions.
Core Workflow
Your Task Description
│
▼
┌───────────────────┐
│ 1. Test Suite │ Judge LLM generates tailored test cases
│ Generation │
└────────┬──────────┘
│
▼
┌───────────────────┐
│ 2. Model │ Discovers top LLMs for your task category
│ Discovery │
└────────┬──────────┘
│
▼
┌───────────────────┐
│ 3. Benchmarking │ Runs all tests across all candidate models
│ Execution │ (captures responses + latency)
└────────┬──────────┘
│
▼
┌───────────────────┐
│ 4. Evaluation │ Judge scores on accuracy, hallucination,
│ (Judge LLM) │ grounding, tool-calling, clarity
└────────┬──────────┘
│
▼
┌───────────────────┐
│ 5. Ranking & │ Top 3 models + latency stats +
│ Prompt Opt. │ optimized system prompt
└───────────────────┘
🚀 Quick Start
1. Clone the Repository
git clone https://github.com/gauravvij/llm-evaluator.git
cd llm-evaluator2. Set Up a Virtual Environment
python3 -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate pip install -r requirements.txt
3. Configure Your OpenRouter API Key
The tool requires an OpenRouter API key. Never hardcode your key — use one of the two methods below:
Option A — Environment Variable (Recommended)
export OPENROUTER_API_KEY="sk-or-v1-your-key-here"
Add this to your ~/.bashrc or ~/.zshrc to persist across sessions.
Option B — .env File
Copy the example file and fill in your key:
Edit .env:
OPENROUTER_API_KEY=sk-or-v1-your-key-here
🔑 Get your free API key at openrouter.ai/keys
🖥️ Usage
Interactive Mode
You'll be prompted to enter your task description.
CLI Mode
# Evaluate LLMs for a coding task python main.py --task "Python software engineering assistant" # Math tutoring with 3 test cases python main.py --task "Math tutoring for high school students" --num-tests 3 # Customer support with 4 candidates, no report saved python main.py --task "Customer support chatbot" --max-candidates 4 --no-save # Custom output directory python main.py --task "Creative writing assistant" --output-dir ./results
All CLI Options
| Flag | Short | Default | Description |
|---|---|---|---|
--task |
-t |
(prompted) | Natural language task description |
--num-tests |
-n |
5 |
Number of test cases to generate |
--max-candidates |
-c |
6 |
Max candidate models to evaluate |
--output-dir |
-o |
./analysis |
Directory to save JSON report |
--no-save |
— | False |
Skip saving the JSON report |
📊 Sample Output
╔══════════════════════════════════════════════════════╗
║ LLM Evaluator Tool — Evaluation Report ║
╚══════════════════════════════════════════════════════╝
Task: Python software engineering assistant
── Step 1/5 — Generating Test Suite ──────────────────
✓ Generated 5 test cases
── Step 2/5 — Discovering Candidate Models ───────────
✓ Found 6 candidate models
── Step 3/5 — Running Benchmark ──────────────────────
✓ 30 responses collected
── Step 4/5 — Evaluating Responses ───────────────────
✓ Judge scored all responses
── Step 5/5 — Ranking & Prompt Optimization ──────────
🏆 Top 3 Models for Your Task:
#1 google/gemini-2.5-pro Score: 92.4 Latency: 1.8s avg
#2 openai/gpt-4.1 Score: 89.1 Latency: 2.1s avg
#3 anthropic/claude-sonnet-4-5 Score: 87.6 Latency: 1.5s avg
📝 Optimized System Prompt (for gemini-2.5-pro):
"You are an expert Python software engineer..."
🏗️ Project Structure
llm-evaluator/
├── main.py # CLI entry point
├── requirements.txt # Python dependencies
├── .env.example # Environment variable template
├── .gitignore
└── src/
├── config.py # Configuration & API key loading
├── openrouter_client.py # OpenRouter API client
├── suite_generator.py # Test suite generation (Judge LLM)
├── model_discovery.py # Candidate model discovery
├── benchmarker.py # Parallel benchmarking engine
├── evaluator.py # Multi-dimensional evaluation
├── prompt_optimizer.py # Optimized prompt generation
└── reporter.py # Rich CLI output & JSON reports
🔒 Security
- No API keys are ever hardcoded in this codebase.
- Keys are loaded exclusively from the
OPENROUTER_API_KEYenvironment variable or a local.envfile. - The
.gitignoreexcludes.envfiles and any local config containing secrets. - Never commit your
.envfile or share your API key publicly.
🤝 Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Commit your changes:
git commit -m 'Add my feature' - Push to the branch:
git push origin feature/my-feature - Open a Pull Request
📄 License
MIT License
