๐ llm-benchmark
Everywhere-Ready LLM Code Optimizer & Self-Validating Benchmark Suite
Ship "optimized by AI" code with confidence. llm-benchmark automatically generates, validates, and benchmarks LLM-optimized variants of your functions across multiple providers.
โจ Features
- ๐ค Multi-Provider Support - OpenAI, Anthropic, Azure, Ollama, and more
- ๐ Polyglot - JavaScript, TypeScript, Python, Rust, Go, and growing
- โ Self-Validating - Ensures functional equivalence before benchmarking
- ๐ Rich Benchmarks - Ops/sec, percentiles, memory usage, cost analysis
- ๐จ Beautiful TUI - Real-time progress, results visualization
- ๐ Extensible - Plugin architecture for languages and providers
- ๐ฆ Zero Lock-in - Export to JSON, CSV, JUnit, HTML
๐ Quick Start
# Install globally npm install -g llm-benchmark # Or use npx npx llm-benchmark demo # Optimize a function (must be exported) llm-benchmark optimizeProcess.js # With specific providers llm-benchmark optimizeProcess.js --providers openai:gpt-4o anthropic:claude-3 # Named export llm-benchmark utils.js myFunction # CI mode (no interactive UI) llm-benchmark optimizeProcess.js --ci
Note: Your function must be exported (either as default export or named export) for the tool to find it.
๐ Prerequisites
- Node.js โฅ 18
- API keys for your chosen providers (OpenAI, Anthropic, etc.)
๐ง Configuration
Create llm-benchmark.yaml in your project:
providers: - openai:gpt-4o - anthropic:claude-3-sonnet validation: mode: record-replay # or 'static' or 'property-based' cases: ./test-cases.json bench: runs: 5000 warmup: 20 langPlugins: - js - py - rust
Set up your .env:
OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-...
๐ Example
Given this function:
// optimizeProcess.js export default function optimizeProcess(records) { const valid = records.filter((r) => r.status === 'active' && r.value > 0); const transformed = valid.map((r) => ({ ...r, value: r.value * 1.1, category: r.category.toUpperCase(), })); return Object.values( transformed.reduce((acc, r) => { if (!acc[r.category]) { acc[r.category] = { count: 0, total: 0 }; } acc[r.category].count++; acc[r.category].total += r.value; return acc; }, {}), ); }
Running the Benchmark
# Step 1: Navigate to the example directory cd examples/js # Step 2: Install dependencies (if needed) npm install # Step 3: Run the benchmark llm-benchmark optimizeProcess.js # Or run from the monorepo root cd ../.. node packages/core/bin/llm-benchmark.js examples/js/optimizeProcess.js
Sample Output
๐ LLM Benchmark
๐ Generating optimized variants...
โ openai:gpt-4o completed
โ anthropic:claude-3-sonnet completed
โ
Validating variants...
โ All variants passed 100 test cases
๐ Running benchmarks...
๐ Benchmark Results
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Variant Ops/sec Improvement P95 (ms) ฯ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฅ openai.gpt_4o 125,420 +34.2% 0.045 ยฑ2.1%
anthropic.claude_3 118,230 +26.5% 0.048 ยฑ1.8%
original 93,420 baseline 0.062 ยฑ2.3%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
All variants passed validation (1,000 test cases)
๐ฐ Total cost: $0.0234
๐ Results saved to: ./results.json
Generated Optimized Code
The tool will generate optimized variants like:
// optimizeProcess.openai.gpt-4o.js export default function optimizeProcess(records) { const grouped = {}; // Single pass through records for (let i = 0; i < records.length; i++) { const record = records[i]; if (record && record.status === 'active' && record.value > 0) { const category = record.category.toUpperCase(); const transformedValue = record.value * 1.1; if (!grouped[category]) { grouped[category] = { total: 0, count: 0 }; } grouped[category].total += transformedValue; grouped[category].count++; } } return Object.values(grouped); }
๐๏ธ Architecture
llm-benchmark
โโโ packages/
โ โโโ core/ # CLI and orchestration
โ โโโ adapters/ # Provider adapters (OpenAI, Anthropic, etc.)
โ โโโ plugins/ # Language plugins (JS, Python, Rust, etc.)
โโโ examples/ # Example projects
โโโ docs/ # Documentation
๐งฉ Plugins
Language Plugins
- โ JavaScript/TypeScript
- โ Python
- โ Rust
- ๐ง Go
- ๐ง Java
- ๐ง C/C++
Provider Adapters
- โ OpenAI (GPT-4, GPT-3.5)
- โ Anthropic (Claude 3)
- ๐ง Azure OpenAI
- ๐ง Google Vertex AI
- ๐ง Ollama (local models)
- ๐ง Cohere
๐ ๏ธ Advanced Usage
Command Reference
# Generate variants only llm-benchmark generate <file> [function] # Validate existing variants llm-benchmark validate <file> [function] # Benchmark validated variants llm-benchmark bench <file> [function] # Preview prompts llm-benchmark prompt <file> [function]
Global Options
--config <path>- Config file path (default:llm-benchmark.json)--providers <providers...>- Override configured providers--runs <number>- Override benchmark iterations--ci- CI mode (no interactive UI)--no-color- Disable colored output--debug- Enable debug logging
๐งช Validation Modes
Static Test Cases
Provide test cases in JSON/YAML:
{
"cases": [
{
"input": [{ "status": "active", "value": 100, "category": "electronics" }],
"output": { "ELECTRONICS": { "count": 1, "total": 110 } }
}
]
}Record-Replay
Automatically capture real execution:
validation: mode: record-replay recordingEnabled: true
Property-Based Testing
Generate test inputs with invariants:
validation: mode: property-based propertyTests: invariants: - 'output.total >= 0' - 'output.count === input.length'
๐ Output Formats
- JSON - Detailed results with metadata
- CSV - Spreadsheet-friendly format
- JUnit XML - CI integration
- HTML - Interactive report
๐ค Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
# Clone the repo git clone https://github.com/ajaxdavis/llm-benchmark.git cd llm-benchmark # Install dependencies pnpm install # Run tests pnpm test # Build all packages pnpm build
๐ License
MIT ยฉ Ajax Davis
๐ Acknowledgments
Built with:
- Commander.js - CLI framework
- Ink - React for CLIs
- Benchmark.js - Benchmarking library
Made with โค๏ธ by developers, for developers
