GitHub - wizenheimer/periscope: LLM Performance Testing | K6 + Grafana + InfluxDB | A tiny toolkit for load testing and benchmarking OpenAI-like inference endpoints using K6 + Grafana + InfluxDB

Periscope

A comprehensive framework for load testing and benchmarking OpenAI API endpoints using K6, with a focus on measuring performance metrics for completions and embeddings.

📋 Table of Contents

Overview

This framework provides a Docker-based environment for performance testing of OpenAI API endpoints. It includes preconfigured K6, InfluxDB, and Grafana services, along with customized scripts designed specifically for testing various aspects of OpenAI's API services. The framework allows you to measure key metrics like response times, token usage efficiency, throughput, and error rates under different load scenarios.

Features

Containerized Environment: Docker and Docker Compose based deployment
Metrics Visualization: Pre-configured Grafana dashboards for test results
Core API Testing:
- Chat completions testing
- Embeddings generation testing (single and batch)
- Code completion with prefix caching
Performance Testing Patterns:
- Smoke tests for basic functionality validation
- Stress tests for identifying breaking points
- Spike tests for sudden load surges
- Soak tests for long-duration stability
- Recovery tests for measuring system stabilization
- Prefill-heavy tests for context processing efficiency
- Decode-heavy tests for output generation throughput
Extensible Framework: Modular design for custom test script creation
Comprehensive Metrics: Token usage, latency, throughput, and processing rates
Automated Workflows: Makefile-based command system for test management

Prerequisites

Docker and Docker Compose
OpenAI API Key
Bash or compatible shell environment

Installation

Clone this repository:

git clone https://github.com/wizenheimer/periscope.git
cd periscope

Initialize the environment:
Start the infrastructure services:
Verify services are running:

Architecture

The framework consists of three main components:

K6: Open-source load testing tool that executes the test scripts
InfluxDB: Time-series database that stores test metrics
Grafana: Visualization platform that displays real-time and historical test results

These components are orchestrated using Docker Compose, with configuration files for seamless integration.

Directory Structure

k6-openai-testing/
├── docker-compose.yaml    # Container orchestration
├── Makefile               # Simplified command interface
├── grafana/               # Grafana configuration
│   ├── grafana-dashboard.yaml
│   ├── grafana-datasource.yaml
├── dashboards/            # Dashboard templates
│   ├── k6-load-testing-results_rev3.json
│   └── k6-openai-tokens_rev1.json
├── scripts/               # Test scripts
│   ├── config.js          # Shared configuration
│   ├── openai-completions.js
│   ├── openai-embeddings.js
│   ├── openai-benchmark.js
│   ├── openai-prefix-caching.js
│   ├── helpers/           # Utilities
│   │   ├── openaiGeneric.js
│   │   ├── utils.js
│   │   └── http.js
│   └── payloads/          # Request templates
│       ├── completions.js
│       └── embeddings.js
└── README.md

Usage

Basic Usage

Set your OpenAI API key:
```
export OPENAI_API_KEY=your_api_key
```
Run a test:
View results:

Running Different Tests

The framework includes several specialized test scripts:

# Test chat completions
make test-completions

# Test embedding generation
make test-embeddings

# Test code completion with prefix caching
make test-prefix-caching

# Run comprehensive benchmark
make test-benchmark

# Run all tests sequentially
make test-all

Makefile Reference

Basic Setup and Operation

# Initial setup
make setup

# Start services
make start

# Check status
make status

# Open Grafana dashboard
make grafana-dashboard

Running Tests

# Set your OpenAI API key (replace with your actual key)
export OPENAI_API_KEY=sk-your-api-key

# Run completions test
make test-completions

# Run embeddings test
make test-embeddings

# Run benchmark test
make test-benchmark

# Run prefix caching test
make test-prefix-caching

# Run all tests sequentially
make test-all

# Run a specific script
make test script=custom-script.js

Configuration Options

You can override any configuration option either through environment variables or by passing them as arguments:

# Override with environment variables
export OPENAI_COMPLETION_MODEL=gpt-4
make test-completions

# Or pass directly as arguments
make test-completions OPENAI_COMPLETION_MODEL=gpt-4 MAX_TOKENS=128

Maintenance

# View logs
make logs

# Restart services
make restart

# Stop services
make stop

# Clean up (stop and remove containers)
make clean

# Full purge (remove containers, volumes, and data)
make purge

Getting Help

# Show all available commands and configuration
make help

Configuration Options

Option	Description	Default
`OPENAI_API_KEY`	Your OpenAI API key	"your-api-key-here"
`OPENAI_BASE_URL`	Base URL for OpenAI API	"https://api.openai.com"
`OPENAI_COMPLETION_MODEL`	Model for completion requests	"gpt-3.5-turbo"
`OPENAI_EMBEDDING_MODEL`	Model for embedding requests	"text-embedding-3-small"
`OPENAI_CODING_MODEL`	Model for code completion requests	"gpt-3.5-turbo"
`MAX_TOKENS`	Maximum tokens to generate	64
`VUS`	Number of virtual users	1
`ENABLE_BATCH_MODE`	Enable batch embedding requests	"false"

Example Workflows

Testing Model Performance

Compare performance metrics between different models:

# Test with GPT-3.5 Turbo
make test-completions OPENAI_COMPLETION_MODEL=gpt-3.5-turbo

# Test with GPT-4
make test-completions OPENAI_COMPLETION_MODEL=gpt-4

Load Testing

Test how the API performs under increased load:

# Test with 1 virtual user
make test-completions VUS=1

# Test with 5 virtual users
make test-completions VUS=5

Testing with Different Input Sizes

Test how input size affects performance:

# Test with default token limit
make test-prefix-caching MAX_TOKENS=16

# Test with larger token limit
make test-prefix-caching MAX_TOKENS=128

Testing Alternative API Endpoints

Test against a compatible alternative API:

make test-completions OPENAI_BASE_URL=https://alternative-api.example.com

Scripts Explained

openai-completions.js

Tests the chat completions endpoint with various prompts. Measures response time, token usage, and generation throughput.

openai-embeddings.js

Tests the embeddings endpoint for single text embedding requests. Measures embedding generation latency and tracks vector dimensions for individual text items.

openai-batch-embeddings.js

Tests the embeddings endpoint specifically for batch processing (multiple texts in a single request). Measures batch processing efficiency, per-text latency, and compares performance across different batch sizes.

openai-prefix-caching.js

Simulates an IDE-like code completion scenario where each completion is appended to the prefix for the next request. Tests continuous usage patterns and measures token efficiency.

openai-benchmark.js

Comprehensive benchmark that tests both completions and embeddings endpoints with increasing numbers of virtual users. Provides comparative performance metrics.

Custom Tests

You can easily create custom test scripts by using the provided helpers and utilities:

Create a new JavaScript file in the scripts directory:

// scripts/my-custom-test.js
import * as oai from "./helpers/openaiGeneric.js";
import config from "./config.js";

export const options = {
  vus: 1,
  duration: "30s",
};

const client = oai.createClient({
  url: config.openai.url,
  options: {
    model: "gpt-3.5-turbo",
  },
  headers: {
    Authorization: `Bearer ${config.openai.key}`,
  },
});

export default function () {
  const response = client.chatComplete({
    messages: [
      {
        role: "user",
        content: "Generate a random number between 1 and 100",
      },
    ],
  });

  console.log(oai.getContent(response));
}

Run your custom test:
```
make test script=my-custom-test.js
```

Specialized Performance Test Scripts

The framework includes a comprehensive set of specialized test scripts for different performance testing scenarios:

Test Types

Smoke Tests - Basic functionality verification with minimal load

make test-completions-smoke
make test-embeddings-smoke
make test-smoke-all

Stress Tests - Testing system behavior under high load to find breaking points

make test-completions-stress
make test-embeddings-stress
make test-stress-all

Spike Tests - Testing system reaction to sudden, dramatic increases in load

make test-completions-spike
make test-embeddings-spike
make test-spike-all

Soak Tests - Long-duration testing to identify issues that appear over time
```
make test-completions-soak
make test-embeddings-soak
make test-soak-all
```

Recovery Tests - Testing how the system recovers after failure or high load

make test-completions-recovery
make test-embeddings-recovery
make test-recovery-all

Test Patterns Explained

Smoke Tests: Minimal load (1 VU, few iterations) to verify basic functionality is working correctly before running more intensive tests.
Stress Tests: Gradually increasing load until performance degradation or failures occur, to identify maximum operational capacity.
Spike Tests: Sudden jumps to high user counts, then returning to baseline, to evaluate how the API handles unexpected traffic surges.
Soak Tests: Moderate but consistent load maintained for extended periods, to catch issues that only appear over time (memory leaks, gradual degradation).
Recovery Tests: High load followed by a return to normal levels, to measure how quickly the system stabilizes after stress.

Recommended Testing Sequence

For a comprehensive evaluation of the API's performance characteristics:

Start with smoke tests to verify basic functionality
Run stress tests to identify performance limits
Run spike tests to assess resilience to sudden load
Run recovery tests to measure stabilization capabilities
Run soak tests to verify long-term stability

Customizing Test Duration

For soak tests and other long-running tests, you may want to modify the duration:

# Edit the test file to change the duration settings
# Or pass duration parameters as environment variables
SOAK_DURATION=60m make test-completions-soak

Interpreting Test Results

Each specialized test outputs different metrics relevant to its test pattern:

Smoke Tests: Basic response validation and error rates
Stress Tests: Identifies breaking points and maximum throughput
Spike Tests: Measures failure rates during load spikes and recovery times
Soak Tests: Tracks performance stability over time and error accumulation
Recovery Tests: Measures stabilization time after stress periods

All these metrics are visualized in the Grafana dashboard for easy analysis.

Specialized Heavy Workload Tests

The framework includes specialized tests designed to evaluate performance under different workload patterns:

Prefill-Heavy Tests

Prefill-heavy tests focus on scenarios with large input contexts but relatively shorter outputs. These tests evaluate how effectively the model processes and understands extensive context.

make test-completions-prefill-heavy

This test simulates:

Long document analysis
Multi-turn conversations with extensive history
Complex questions requiring deep context understanding
Legal documents, research papers, or literature analysis

Key metrics:

Prefill processing time
Token processing rate (tokens/second)
Performance with increasing context size

Decode-Heavy Tests

Decode-heavy tests focus on scenarios that require generating lengthy, detailed outputs from relatively concise prompts. These tests evaluate the model's token generation speed and throughput.

make test-completions-decode-heavy

This test simulates:

Detailed explanations and tutorials
Creative writing tasks
Comprehensive guides and analyses
Step-by-step instructions

Key metrics:

Output generation time
Token generation rate (tokens/second)
Performance with varying output lengths

Running All Heavy Tests

To run both types of heavy workload tests:

make test-completions-heavy-all

Customizing Heavy Tests

You can customize these tests with environment variables:

# Set maximum output tokens for decode-heavy test
MAX_OUTPUT_TOKENS=2000 make test-completions-decode-heavy

# Use a specific model for heavy tests
OPENAI_COMPLETION_MODEL=gpt-4 make test-completions-heavy-all

Use Cases

These specialized tests are particularly valuable for:

Model Comparison: Compare how different models handle prefill vs. decode tasks
Pricing Optimization: Understand performance tradeoffs between models to optimize cost
Application Design: Make informed decisions about prompt design based on performance characteristics
Resource Planning: Plan infrastructure based on expected workload patterns

Troubleshooting

Common Issues

API Connection Errors
- Verify your API key is correct
- Check if you're hitting rate limits
- Ensure your network allows connections to the OpenAI API
Container Issues
- Try restarting the services: make restart
- Check logs for errors: make logs
- Verify Docker is running properly
Grafana Dashboard Not Showing Data
- Ensure InfluxDB is running: make status
- Verify tests are outputting data to InfluxDB
- Try restarting Grafana: docker-compose restart k6-grafana

Advanced Usage

Custom Load Patterns

K6 supports various load patterns that can be defined in your test scripts:

export const options = {
  // Ramping pattern
  stages: [
    { duration: "1m", target: 5 }, // Ramp up to 5 VUs
    { duration: "3m", target: 5 }, // Stay at 5 VUs
    { duration: "1m", target: 0 }, // Ramp down to 0 VUs
  ],

  // Or use fixed VUs
  // vus: 10,
  // duration: '5m',
};

Capturing Custom Metrics

You can define custom metrics in your test scripts:

import { Trend } from "k6/metrics";

// Define custom metrics
const promptLength = new Trend("prompt_length");
const responseLength = new Trend("response_length");

export default function () {
  // Your test logic

  // Record metrics
  promptLength.add(prompt.length);
  responseLength.add(response.length);
}

Testing with Thresholds

Define pass/fail criteria for your tests:

export const options = {
  thresholds: {
    http_req_duration: ["p(95)<500"], // 95% of requests must complete below 500ms
    http_req_failed: ["rate<0.01"], // Error rate must be less than 1%
  },
};

License

This project is licensed under the MIT License and is provided "as is" and comes with absolutely no guarantees. If it breaks your system, well, that's kind of the point, isn't it? Congratulations, you're now doing perf testing!

Use at your own risk. Side effects may include improved system resilience, fewer 3 AM panic attacks, and an irresistible urge to push big red buttons.

Credits

Consider this as my small act of rebellion against the "just eyeball the performance" approach to perf testing. Feel free to star the repo - each star will be printed and taped to my manager's door.

A Developer With Too Much Time and Not Enough Approval