GitHub - NehmeAILabs/rightsize-cli: Benchmark prompts against multiple LLMs via OpenRouter to find the cheapest model that works.

rightsize-cli

The Biggest Model for Every Task? That's Just Lazy.

Stop overpaying for AI. Benchmark your prompts against 200+ models via OpenRouter to find the cheapest one that still works.

This is the production-grade CLI version of the RightSize web tool.

Installation

# Using pip
pip install rightsize-cli

# Using uv
uv pip install rightsize-cli

Quick Start

# Set your OpenRouter API key
export RIGHTSIZE_OPENROUTER_API_KEY="sk-or-..."

# List available models
rightsize-cli models

# Run a benchmark
rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-3-flash-preview \
  -b google/gemini-2.5-flash

# Open interactive visualization in the browser
rightsize-cli benchmark data/test_cases.csv \
  -t prompts/classify.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-3-flash-preview \
  -b google/gemini-2.5-flash \
  --visualize

Run without installing (uvx)

# Set API key
export RIGHTSIZE_OPENROUTER_API_KEY="sk-or-..."

# List models
uvx rightsize-cli models

# Run benchmark
uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-3-flash-preview \
  -b google/gemini-2.5-flash

# Run benchmark + open web visualization
uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-3-flash-preview \
  -b google/gemini-2.5-flash \
  --visualize

Output

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Model                       ┃ Accuracy ┃ Latency (p95) ┃ Cost/1k  ┃ Savings  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ google/gemma-3-12b-it       │    71.0% │        4200ms │  $0.0028 │   +93.7% │
│ deepseek/deepseek-chat-v3.1 │    95.0% │         800ms │  $0.0180 │   +60.0% │
│ google/gemini-2.5-flash     │   100.0% │        1900ms │  $0.0450 │       —  │
└─────────────────────────────┴──────────┴───────────────┴──────────┴──────────┘

How It Works

You provide test cases - A CSV with inputs and expected outputs
Candidate models compete - All models run the same prompts in parallel
LLM-as-Judge scores - A judge model compares each output to your expected output
You see the results - Cost, accuracy, latency - pick the cheapest model that meets your bar

CSV Format

Two columns: input_data and expected_output:

input_data,expected_output
"My order hasn't arrived.",billing::high
"How do I reset my password?",account::high
"I want a refund!",refund::high

The judge model compares each model's output to expected_output and scores:

1.0 - Exact or semantic match
0.8 - Very close with minor differences
0.5 - Partially correct
0.0 - Wrong or irrelevant

Best practices for test data

Use minimal output formats - Delimiter-separated (category::confidence) keeps responses short, costs low
Consistent task type - All rows should be the same kind of task
Representative samples - Use real data from your production use case
Clear expected outputs - Unambiguous so the judge can score fairly
10-20 test cases - Enough to be statistically meaningful, fast to run

Prompt Templates

Templates wrap your inputs with instructions. Supports Jinja2 (.j2) or Python f-strings.

Example: Classification template (prompt.j2):

Classify this support ticket.

CATEGORIES: billing, account, refund, subscription, technical
CONFIDENCE: high, medium, low

OUTPUT FORMAT: <category>::<confidence>
OUTPUT ONLY the format above. No explanation. No punctuation. No other text.

TICKET: {{ input_data }}

OUTPUT:

Example: Extraction template (extract.j2):

Extract the email from this text.

OUTPUT FORMAT: <email or NONE>
OUTPUT ONLY the format above. No explanation. No other text.

TEXT: {{ input_data }}

OUTPUT:

Template variable

Variable	Description
`input_data`	The value from your CSV's `input_data` column

CLI Reference

`rightsize-cli benchmark`

rightsize-cli benchmark <csv_file> [OPTIONS]

Option	Short	Default	Description
`--template`	`-t`	(required)	Path to prompt template file
`--model`	`-m`	(required)	Model ID to test (repeat for multiple)
`--judge`	`-j`	(required)	Model for judging outputs
`--baseline`	`-b`	None	Baseline model for savings calculation
`--concurrency`	`-c`	10	Max parallel requests
`--output`	`-o`	`table`	Output format: table, json, csv
`--verbose`	`-v`	False	Show detailed outputs and judge scores
`--visualize`	`-V`	False	Open interactive web visualization

`rightsize-cli models`

List all available models and their pricing:

Configuration

Set via environment variables or .env file:

Variable	Required	Default	Description
`RIGHTSIZE_OPENROUTER_API_KEY`	Yes	-	Your OpenRouter API key
`RIGHTSIZE_MAX_CONCURRENCY`	No	10	Default concurrency
`RIGHTSIZE_TIMEOUT_SECONDS`	No	60	Request timeout

Examples

Compare cheap models against a baseline

uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m google/gemma-3-27b-it \
  -m qwen/qwen3-8b \
  -m meta-llama/llama-3.3-70b-instruct \
  -j google/gemini-3-flash-preview \
  -b google/gemini-2.5-flash

Use a stronger judge model

uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j anthropic/claude-sonnet-4 \
  -b google/gemini-2.5-flash

Export results to JSON

uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-3-flash-preview \
  -b google/gemini-2.5-flash \
  -o json > results.json

Debug with verbose mode

uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -j google/gemini-3-flash-preview \
  -b google/gemini-2.5-flash \
  -v

Tips

Use minimal output formats - category::confidence is cheaper than JSON, JSON is cheaper than prose
End prompts with "OUTPUT:" - Primes the model to respond immediately without preamble
Start with 10-20 test cases - Enough to be representative, fast to iterate
Set a quality bar - Decide what accuracy % is acceptable (e.g., 95%+)
Consider latency - Sometimes a slower cheap model isn't worth it
Iterate on prompts - A better prompt can make cheaper models work better

Development

# Clone the repo
git clone https://github.com/NehmeAILabs/rightsize-cli.git
cd rightsize-cli

# Install in dev mode
uv pip install -e .

# Run locally
rightsize-cli models

License

MIT