GitHub - thomasdavis/llm-benchmark

4 min read Original article โ†—

๐Ÿš€ llm-benchmark

Everywhere-Ready LLM Code Optimizer & Self-Validating Benchmark Suite

npm version CI License: MIT

Ship "optimized by AI" code with confidence. llm-benchmark automatically generates, validates, and benchmarks LLM-optimized variants of your functions across multiple providers.

Demo

โœจ Features

  • ๐Ÿค– Multi-Provider Support - OpenAI, Anthropic, Azure, Ollama, and more
  • ๐ŸŒ Polyglot - JavaScript, TypeScript, Python, Rust, Go, and growing
  • โœ… Self-Validating - Ensures functional equivalence before benchmarking
  • ๐Ÿ“Š Rich Benchmarks - Ops/sec, percentiles, memory usage, cost analysis
  • ๐ŸŽจ Beautiful TUI - Real-time progress, results visualization
  • ๐Ÿ”Œ Extensible - Plugin architecture for languages and providers
  • ๐Ÿ“ฆ Zero Lock-in - Export to JSON, CSV, JUnit, HTML

๐Ÿš€ Quick Start

# Install globally
npm install -g llm-benchmark

# Or use npx
npx llm-benchmark demo

# Optimize a function (must be exported)
llm-benchmark optimizeProcess.js

# With specific providers
llm-benchmark optimizeProcess.js --providers openai:gpt-4o anthropic:claude-3

# Named export
llm-benchmark utils.js myFunction

# CI mode (no interactive UI)
llm-benchmark optimizeProcess.js --ci

Note: Your function must be exported (either as default export or named export) for the tool to find it.

๐Ÿ“‹ Prerequisites

  • Node.js โ‰ฅ 18
  • API keys for your chosen providers (OpenAI, Anthropic, etc.)

๐Ÿ”ง Configuration

Create llm-benchmark.yaml in your project:

providers:
  - openai:gpt-4o
  - anthropic:claude-3-sonnet

validation:
  mode: record-replay # or 'static' or 'property-based'
  cases: ./test-cases.json

bench:
  runs: 5000
  warmup: 20

langPlugins:
  - js
  - py
  - rust

Set up your .env:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

๐Ÿ“š Example

Given this function:

// optimizeProcess.js
export default function optimizeProcess(records) {
  const valid = records.filter((r) => r.status === 'active' && r.value > 0);
  const transformed = valid.map((r) => ({
    ...r,
    value: r.value * 1.1,
    category: r.category.toUpperCase(),
  }));

  return Object.values(
    transformed.reduce((acc, r) => {
      if (!acc[r.category]) {
        acc[r.category] = { count: 0, total: 0 };
      }
      acc[r.category].count++;
      acc[r.category].total += r.value;
      return acc;
    }, {}),
  );
}

Running the Benchmark

# Step 1: Navigate to the example directory
cd examples/js

# Step 2: Install dependencies (if needed)
npm install

# Step 3: Run the benchmark
llm-benchmark optimizeProcess.js

# Or run from the monorepo root
cd ../..
node packages/core/bin/llm-benchmark.js examples/js/optimizeProcess.js

Sample Output

๐Ÿš€ LLM Benchmark

๐Ÿ“ Generating optimized variants...
  โœ“ openai:gpt-4o completed
  โœ“ anthropic:claude-3-sonnet completed

โœ… Validating variants...
  โœ“ All variants passed 100 test cases

๐Ÿ“Š Running benchmarks...

๐Ÿ† Benchmark Results
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Variant                     Ops/sec      Improvement   P95 (ms)   ฯƒ
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
๐Ÿ”ฅ openai.gpt_4o           125,420      +34.2%        0.045      ยฑ2.1%
   anthropic.claude_3      118,230      +26.5%        0.048      ยฑ1.8%
   original                 93,420      baseline      0.062      ยฑ2.3%
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

โœ… All variants passed validation (1,000 test cases)
๐Ÿ’ฐ Total cost: $0.0234
๐Ÿ“„ Results saved to: ./results.json

Generated Optimized Code

The tool will generate optimized variants like:

// optimizeProcess.openai.gpt-4o.js
export default function optimizeProcess(records) {
  const grouped = {};

  // Single pass through records
  for (let i = 0; i < records.length; i++) {
    const record = records[i];
    if (record && record.status === 'active' && record.value > 0) {
      const category = record.category.toUpperCase();
      const transformedValue = record.value * 1.1;

      if (!grouped[category]) {
        grouped[category] = { total: 0, count: 0 };
      }

      grouped[category].total += transformedValue;
      grouped[category].count++;
    }
  }

  return Object.values(grouped);
}

๐Ÿ—๏ธ Architecture

llm-benchmark
โ”œโ”€โ”€ packages/
โ”‚   โ”œโ”€โ”€ core/              # CLI and orchestration
โ”‚   โ”œโ”€โ”€ adapters/         # Provider adapters (OpenAI, Anthropic, etc.)
โ”‚   โ””โ”€โ”€ plugins/          # Language plugins (JS, Python, Rust, etc.)
โ”œโ”€โ”€ examples/             # Example projects
โ””โ”€โ”€ docs/                # Documentation

๐Ÿงฉ Plugins

Language Plugins

  • โœ… JavaScript/TypeScript
  • โœ… Python
  • โœ… Rust
  • ๐Ÿšง Go
  • ๐Ÿšง Java
  • ๐Ÿšง C/C++

Provider Adapters

  • โœ… OpenAI (GPT-4, GPT-3.5)
  • โœ… Anthropic (Claude 3)
  • ๐Ÿšง Azure OpenAI
  • ๐Ÿšง Google Vertex AI
  • ๐Ÿšง Ollama (local models)
  • ๐Ÿšง Cohere

๐Ÿ› ๏ธ Advanced Usage

Command Reference

# Generate variants only
llm-benchmark generate <file> [function]

# Validate existing variants
llm-benchmark validate <file> [function]

# Benchmark validated variants
llm-benchmark bench <file> [function]

# Preview prompts
llm-benchmark prompt <file> [function]

Global Options

  • --config <path> - Config file path (default: llm-benchmark.json)
  • --providers <providers...> - Override configured providers
  • --runs <number> - Override benchmark iterations
  • --ci - CI mode (no interactive UI)
  • --no-color - Disable colored output
  • --debug - Enable debug logging

๐Ÿงช Validation Modes

Static Test Cases

Provide test cases in JSON/YAML:

{
  "cases": [
    {
      "input": [{ "status": "active", "value": 100, "category": "electronics" }],
      "output": { "ELECTRONICS": { "count": 1, "total": 110 } }
    }
  ]
}

Record-Replay

Automatically capture real execution:

validation:
  mode: record-replay
  recordingEnabled: true

Property-Based Testing

Generate test inputs with invariants:

validation:
  mode: property-based
  propertyTests:
    invariants:
      - 'output.total >= 0'
      - 'output.count === input.length'

๐Ÿ“Š Output Formats

  • JSON - Detailed results with metadata
  • CSV - Spreadsheet-friendly format
  • JUnit XML - CI integration
  • HTML - Interactive report

๐Ÿค Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Clone the repo
git clone https://github.com/ajaxdavis/llm-benchmark.git
cd llm-benchmark

# Install dependencies
pnpm install

# Run tests
pnpm test

# Build all packages
pnpm build

๐Ÿ“œ License

MIT ยฉ Ajax Davis

๐Ÿ™ Acknowledgments

Built with:


Made with โค๏ธ by developers, for developers