GitHub - pulzeai-oss/evals: Evals is an open-source evaluation framework designed to benchmark and assess the performance of AI models on the Pulze AI platform.

Unified Evaluation CLI

A comprehensive command-line interface for running evaluations across multiple benchmarks including FinanceBench, MMLU, Pulze-v0.1, and Marketing benchmarks.

Features

Multi-benchmark support: FinanceBench, MMLU, Pulze-v0.1, and Marketing
Flexible model support: Works with Pulze API, OpenAI API, and other OpenAI-compatible endpoints
Template-based evaluation: Support for custom evaluation templates
Comprehensive leaderboards: View results across benchmarks and subjects
Export capabilities: Export results to CSV, HTML formats
Modular architecture: Easy to extend with new benchmarks and evaluators

Installation

Clone the repository:

git clone <repository-url>
cd evals

Install dependencies using Poetry:

# Install Poetry if you haven't already
curl -sSL https://install.python-poetry.org | python3 -

# Install project dependencies
poetry install

# For development dependencies as well
poetry install --with dev

Set up environment variables:

# Copy the template and fill in your API keys
cp .env.template .env
# Edit .env with your API keys

Configuration

The system uses environment variables for configuration:

Required Environment Variables

At least one API key must be provided:

PULZE_API_KEY: Your Pulze API key
OPENAI_API_KEY: Your OpenAI API key (also used for other OpenAI-compatible endpoints)

Optional Environment Variables

PULZE_BASE_URL: Pulze API base URL (default: https://api.pulze.ai/v1)
OPENAI_BASE_URL: OpenAI API base URL (default: https://api.openai.com/v1)
DEFAULT_TEMPLATE: Default evaluation template (default: default)
DEFAULT_RATER_MODEL: Default rater model (default: gpt-4)
RESULTS_DIR: Results directory (default: results)
MAX_RETRIES: Maximum API retries (default: 3)
REQUEST_TIMEOUT: Request timeout in seconds (default: 60)

Environment Template

Create a .env file based on the template:

python eval_cli.py config  # This will show you current config

Or create an environment template:

from utils import ConfigLoader
ConfigLoader.create_env_template()

Usage

Basic Commands

1. List Available Benchmarks

This shows all available benchmarks and their subjects.

2. Run Evaluations

FinanceBench Evaluation:

python eval_cli.py run --benchmark financebench --model pulze/llama-3.1-70b-instruct --rater gpt-4

MMLU Marketing Evaluation:

python eval_cli.py run --benchmark mmlu --subject marketing --model openai/gpt-4

Pulze Evaluation with Template:

python eval_cli.py run --benchmark pulze --subject writing_marketing_materials --model pulze/llama-3.1-70b-instruct --template pulze_multi_dimensional_evaluation

Marketing Benchmark:

python eval_cli.py run --benchmark marketing --subject writing_marketing_materials --model anthropic/claude-sonnet-4-0

3. View Leaderboards

Single Benchmark Leaderboard:

python eval_cli.py leaderboard --benchmark marketing

Cross-Benchmark Leaderboard:

python eval_cli.py leaderboard --all

Export Leaderboard:

python eval_cli.py leaderboard --benchmark mmlu --export html
python eval_cli.py leaderboard --benchmark financebench --export csv

4. Show Configuration

python eval_cli.py config

Advanced Usage

Model Naming Conventions

The system automatically routes to the appropriate API based on model names:

pulze/model-name → Pulze API
openai/model-name → OpenAI API
anthropic/model-name → OpenAI-compatible endpoint
gpt-3.5-turbo, gpt-4, etc. → OpenAI API
Other models → Default to OpenAI-compatible endpoint

Subject-Specific Evaluations

Each benchmark supports different subjects:

MMLU Subjects (57 available):

marketing, business_ethics, management, economics, etc.

Pulze-v0.1 Subjects (56 available):

writing_marketing_materials, creative_writing, data_analysis, etc.

Marketing Subjects (3 available):

writing_marketing_materials, Ulta Beauty, marketing

FinanceBench:

Single subject: financial_analysis

Template System

Templates define how evaluations are conducted:

Default: Basic question-answer evaluation
pulze_multi_dimensional_evaluation: Multi-dimensional scoring
Custom templates can be added to the pulze-v0.1/templates/ directory

Benchmarks

1. FinanceBench

Purpose: Financial document analysis
Format: Question-answer with evidence from financial documents
Scoring: AI-rated responses against expected answers
Data: financebench/data/financebench_open_source.jsonl

2. MMLU (Massive Multitask Language Understanding)

Purpose: Multiple-choice questions across 57 subjects
Format: 4-choice multiple-choice questions
Scoring: Exact match (1.0 for correct, 0.0 for incorrect)
Data: mmlu/data/mmlu_*.jsonl

3. Pulze-v0.1

Purpose: Template-based evaluations across 56 subjects
Format: Configurable prompts with template-based scoring
Scoring: AI-rated based on template criteria
Data: pulze-v0.1/data/*.jsonl
Templates: pulze-v0.1/templates/*.json

4. Marketing

Purpose: Marketing-specific evaluations
Format: Marketing scenarios and tasks
Scoring: Multi-dimensional marketing criteria
Data: marketing/*.jsonl

Results and Leaderboards

Results Storage

Results are stored in JSONL format in the results/ directory:

results/
├── financebench/
│   └── model_template_rater_timestamp.jsonl
├── mmlu/
│   └── model_template_rater_subject_timestamp.jsonl
├── pulze/
│   └── model_template_rater_subject_timestamp.jsonl
└── marketing/
    └── model_template_rater_subject_timestamp.jsonl

Leaderboard Features

Benchmark-specific leaderboards: Compare models on individual benchmarks
Cross-benchmark leaderboards: Compare models across multiple benchmarks
Subject breakdowns: Detailed performance by subject
Export options: HTML and CSV formats
Sorting options: By average score, total score, or count

Result Format

Each result entry contains:

{
  "question_id": "unique_id",
  "question": "question_text",
  "model_answer": "model_response",
  "expected_answer": "expected_answer",
  "score": 0.85,
  "benchmark": "benchmark_name",
  "subject": "subject_name",
  "model": "model_name",
  "template": "template_name",
  "rater_model": "rater_model_name",
  "timestamp": "2024-01-01T12:00:00"
}

Architecture

Modular Design

├── eval_cli.py              # Main CLI interface
├── evaluators/              # Benchmark evaluators
│   ├── __init__.py         # Evaluator factory
│   ├── base_evaluator.py   # Abstract base class
│   ├── financebench_evaluator.py
│   ├── mmlu_evaluator.py
│   ├── pulze_evaluator.py
│   └── marketing_evaluator.py
└── utils/                   # Utility modules
    ├── __init__.py
    ├── config_loader.py     # Configuration management
    ├── results_manager.py   # Results storage/loading
    └── leaderboard_generator.py  # Leaderboard generation

Adding New Benchmarks

Create a new evaluator class inheriting from BaseEvaluator
Implement required methods: load_data(), get_available_subjects(), evaluate_item()
Add the evaluator to evaluators/__init__.py
Update CLI choices in eval_cli.py

Adding New Templates

Create a JSON template file in pulze-v0.1/templates/
Define system_prompt, prompt_template, and evaluation_criteria
The template will be automatically available for Pulze evaluations

Examples

Complete Evaluation Workflow

# 1. Check configuration
python eval_cli.py config

# 2. List available benchmarks
python eval_cli.py list

# 3. Run evaluations
python eval_cli.py run --benchmark mmlu --subject marketing --model pulze/llama-3.1-70b-instruct
python eval_cli.py run --benchmark marketing --model openai/gpt-4

# 4. View results
python eval_cli.py leaderboard --benchmark mmlu
python eval_cli.py leaderboard --all

# 5. Export results
python eval_cli.py leaderboard --benchmark marketing --export html

Batch Evaluation Script

#!/bin/bash
# Evaluate multiple models on marketing subjects

models=("pulze/llama-3.1-70b-instruct" "openai/gpt-4" "anthropic/claude-sonnet-4-0")

for model in "${models[@]}"; do
    echo "Evaluating $model on marketing..."
    python eval_cli.py run --benchmark marketing --model "$model"
    python eval_cli.py run --benchmark mmlu --subject marketing --model "$model"
done

# Generate comprehensive leaderboard
python eval_cli.py leaderboard --all --export html

Troubleshooting

Common Issues

API Key Errors: Ensure your API keys are set in environment variables
Model Not Found: Check model name format and API endpoint
Template Not Found: Verify template exists in pulze-v0.1/templates/
Subject Not Found: Use python eval_cli.py list to see available subjects

Debug Mode

Set environment variable for detailed logging:

export PYTHONPATH=.
python -v eval_cli.py run --benchmark mmlu --subject marketing --model gpt-4

Configuration Validation

python eval_cli.py config

This will show your current configuration and highlight any issues.

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

Development Setup

# Install development dependencies
poetry install --with dev

# Activate the virtual environment
poetry shell

# Run tests
poetry run pytest tests/

# Format code
poetry run black .
poetry run isort .

# Run linting
poetry run flake8 .

License

[Add your license information here]