rightsize-cli
The Biggest Model for Every Task? That's Just Lazy.
Stop overpaying for AI. Benchmark your prompts against 200+ models via OpenRouter to find the cheapest one that still works.
This is the production-grade CLI version of the RightSize web tool.
Installation
# Using pip pip install rightsize-cli # Using uv uv pip install rightsize-cli
Quick Start
# Set your OpenRouter API key export RIGHTSIZE_OPENROUTER_API_KEY="sk-or-..." # List available models rightsize-cli models # Run a benchmark rightsize-cli benchmark test_cases.csv \ -t prompt.j2 \ -m google/gemma-3-12b-it \ -m deepseek/deepseek-chat-v3.1 \ -j google/gemini-3-flash-preview \ -b google/gemini-2.5-flash # Open interactive visualization in the browser rightsize-cli benchmark data/test_cases.csv \ -t prompts/classify.j2 \ -m google/gemma-3-12b-it \ -m deepseek/deepseek-chat-v3.1 \ -j google/gemini-3-flash-preview \ -b google/gemini-2.5-flash \ --visualize
Run without installing (uvx)
# Set API key export RIGHTSIZE_OPENROUTER_API_KEY="sk-or-..." # List models uvx rightsize-cli models # Run benchmark uvx rightsize-cli benchmark test_cases.csv \ -t prompt.j2 \ -m google/gemma-3-12b-it \ -m deepseek/deepseek-chat-v3.1 \ -j google/gemini-3-flash-preview \ -b google/gemini-2.5-flash # Run benchmark + open web visualization uvx rightsize-cli benchmark test_cases.csv \ -t prompt.j2 \ -m google/gemma-3-12b-it \ -m deepseek/deepseek-chat-v3.1 \ -j google/gemini-3-flash-preview \ -b google/gemini-2.5-flash \ --visualize
Output
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Model ┃ Accuracy ┃ Latency (p95) ┃ Cost/1k ┃ Savings ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ google/gemma-3-12b-it │ 71.0% │ 4200ms │ $0.0028 │ +93.7% │
│ deepseek/deepseek-chat-v3.1 │ 95.0% │ 800ms │ $0.0180 │ +60.0% │
│ google/gemini-2.5-flash │ 100.0% │ 1900ms │ $0.0450 │ — │
└─────────────────────────────┴──────────┴───────────────┴──────────┴──────────┘
How It Works
- You provide test cases - A CSV with inputs and expected outputs
- Candidate models compete - All models run the same prompts in parallel
- LLM-as-Judge scores - A judge model compares each output to your expected output
- You see the results - Cost, accuracy, latency - pick the cheapest model that meets your bar
CSV Format
Two columns: input_data and expected_output:
input_data,expected_output "My order hasn't arrived.",billing::high "How do I reset my password?",account::high "I want a refund!",refund::high
The judge model compares each model's output to expected_output and scores:
- 1.0 - Exact or semantic match
- 0.8 - Very close with minor differences
- 0.5 - Partially correct
- 0.0 - Wrong or irrelevant
Best practices for test data
- Use minimal output formats - Delimiter-separated (
category::confidence) keeps responses short, costs low - Consistent task type - All rows should be the same kind of task
- Representative samples - Use real data from your production use case
- Clear expected outputs - Unambiguous so the judge can score fairly
- 10-20 test cases - Enough to be statistically meaningful, fast to run
Prompt Templates
Templates wrap your inputs with instructions. Supports Jinja2 (.j2) or Python f-strings.
Example: Classification template (prompt.j2):
Classify this support ticket. CATEGORIES: billing, account, refund, subscription, technical CONFIDENCE: high, medium, low OUTPUT FORMAT: <category>::<confidence> OUTPUT ONLY the format above. No explanation. No punctuation. No other text. TICKET: {{ input_data }} OUTPUT:
Example: Extraction template (extract.j2):
Extract the email from this text. OUTPUT FORMAT: <email or NONE> OUTPUT ONLY the format above. No explanation. No other text. TEXT: {{ input_data }} OUTPUT:
Template variable
| Variable | Description |
|---|---|
input_data |
The value from your CSV's input_data column |
CLI Reference
rightsize-cli benchmark
rightsize-cli benchmark <csv_file> [OPTIONS]
| Option | Short | Default | Description |
|---|---|---|---|
--template |
-t |
(required) | Path to prompt template file |
--model |
-m |
(required) | Model ID to test (repeat for multiple) |
--judge |
-j |
(required) | Model for judging outputs |
--baseline |
-b |
None | Baseline model for savings calculation |
--concurrency |
-c |
10 | Max parallel requests |
--output |
-o |
table |
Output format: table, json, csv |
--verbose |
-v |
False | Show detailed outputs and judge scores |
--visualize |
-V |
False | Open interactive web visualization |
rightsize-cli models
List all available models and their pricing:
Configuration
Set via environment variables or .env file:
| Variable | Required | Default | Description |
|---|---|---|---|
RIGHTSIZE_OPENROUTER_API_KEY |
Yes | - | Your OpenRouter API key |
RIGHTSIZE_MAX_CONCURRENCY |
No | 10 | Default concurrency |
RIGHTSIZE_TIMEOUT_SECONDS |
No | 60 | Request timeout |
Examples
Compare cheap models against a baseline
uvx rightsize-cli benchmark test_cases.csv \ -t prompt.j2 \ -m google/gemma-3-12b-it \ -m google/gemma-3-27b-it \ -m qwen/qwen3-8b \ -m meta-llama/llama-3.3-70b-instruct \ -j google/gemini-3-flash-preview \ -b google/gemini-2.5-flash
Use a stronger judge model
uvx rightsize-cli benchmark test_cases.csv \ -t prompt.j2 \ -m google/gemma-3-12b-it \ -m deepseek/deepseek-chat-v3.1 \ -j anthropic/claude-sonnet-4 \ -b google/gemini-2.5-flash
Export results to JSON
uvx rightsize-cli benchmark test_cases.csv \
-t prompt.j2 \
-m google/gemma-3-12b-it \
-m deepseek/deepseek-chat-v3.1 \
-j google/gemini-3-flash-preview \
-b google/gemini-2.5-flash \
-o json > results.jsonDebug with verbose mode
uvx rightsize-cli benchmark test_cases.csv \ -t prompt.j2 \ -m google/gemma-3-12b-it \ -j google/gemini-3-flash-preview \ -b google/gemini-2.5-flash \ -v
Tips
- Use minimal output formats -
category::confidenceis cheaper than JSON, JSON is cheaper than prose - End prompts with "OUTPUT:" - Primes the model to respond immediately without preamble
- Start with 10-20 test cases - Enough to be representative, fast to iterate
- Set a quality bar - Decide what accuracy % is acceptable (e.g., 95%+)
- Consider latency - Sometimes a slower cheap model isn't worth it
- Iterate on prompts - A better prompt can make cheaper models work better
Development
# Clone the repo git clone https://github.com/NehmeAILabs/rightsize-cli.git cd rightsize-cli # Install in dev mode uv pip install -e . # Run locally rightsize-cli models
License
MIT