GitHub - mixpeek/multimodal-benchmarks: Open evaluation suite for multimodal retrieval systems with benchmarks for financial documents, medical devices, and educational content

4 min read Original article β†—

Mixpeek Benchmarks

Multimodal Benchmarks

The open evaluation suite for multimodal retrieval systems.

Standard datasets, queries, and relevance judgments for benchmarking retrieval across video, image, audio, and document modalitiesβ€”particularly in regulated and high-stakes domains.

🎯 Quick Start

Choose your benchmark and get started in 60 seconds:

Benchmark Domain Learn More Leaderboard
Financial Documents SEC filings, earnings reports mxp.co/finance View β†’
Medical Devices IFUs, regulatory docs mxp.co/device View β†’
Curriculum Search Educational videos, lectures mxp.co/learning View β†’

Run Any Benchmark

# Finance benchmark
cd finance && python run.py --quick

# Medical device benchmark
cd device && python run.py --quick

# Curriculum benchmark
cd learning && python run.py --quick

Each runs in ~1 second with demo data. See QUICKSTART.md for full guide.

Why This Exists

Most retrieval benchmarks assume text-only search on clean web data. Real-world multimodal retrieval is harder:

  • Medical device IFUs with nested tables, diagrams, and regulatory language
  • SEC filings with embedded charts, footnotes, and cross-references
  • Educational videos requiring temporal understanding and code-lecture alignment
  • Regulatory documents spanning technical specs, clinical data, and safety reports

This repo provides ground-truth evaluation sets for these verticalsβ€”so you can measure what actually matters.

πŸ“Š Benchmarks Overview

All benchmarks are available now and include sample queries with human-annotated relevance judgments.

Benchmark Best NDCG@10 Status Documentation
Finance 0.78 βœ… Available README Β· Leaderboard
Device 0.78 βœ… Available README Β· Leaderboard
Learning 0.84 βœ… Available README Β· Leaderboard

πŸ“ Structure

benchmarks/
β”œβ”€β”€ shared/                      # Shared utilities
β”‚   β”œβ”€β”€ metrics.py              # Standard evaluation metrics
β”‚   β”œβ”€β”€ evaluator.py            # Benchmark runner
β”‚   └── __init__.py
β”‚
β”œβ”€β”€ finance/                     # Financial document benchmark
β”‚   β”œβ”€β”€ run.py                  # Main benchmark script
β”‚   β”œβ”€β”€ README.md               # Full documentation
β”‚   β”œβ”€β”€ LEADERBOARD.md          # Results leaderboard
β”‚   └── results/                # Benchmark results
β”‚
β”œβ”€β”€ device/                      # Medical device benchmark
β”‚   β”œβ”€β”€ run.py
β”‚   β”œβ”€β”€ README.md
β”‚   β”œβ”€β”€ LEADERBOARD.md
β”‚   └── results/
β”‚
└── learning/                    # Curriculum search benchmark
    β”œβ”€β”€ run.py
    β”œβ”€β”€ README.md
    β”œβ”€β”€ LEADERBOARD.md
    └── results/

πŸš€ Quick Start

1. Install Dependencies

# Install shared dependencies
pip install numpy

2. Run a Benchmark

# Run with demo data (no setup required)
cd finance && python run.py --quick

# Run with your own data
cd finance && python run.py --data-dir /path/to/documents

3. Evaluate Your Retriever

All benchmarks use a standard interface:

from shared import BenchmarkEvaluator, Query, RelevanceJudgment

# Your retrieval function
def my_retriever(query: str) -> list[str]:
    # Returns ranked list of document IDs
    ...

# Create evaluator
evaluator = BenchmarkEvaluator(
    name="my-system",
    retriever_fn=my_retriever,
    k_values=[5, 10, 20]
)

# Run benchmark
queries = [...]  # Load your queries
judgments = [...]  # Load ground truth
report = evaluator.run(queries, judgments)

# Print results
evaluator.print_summary(report)
evaluator.save_report(report, "results.json")

πŸ“ Standard Metrics

All benchmarks use consistent evaluation metrics:

  • NDCG@k - Ranking quality (primary metric)
  • Recall@k - Coverage of relevant documents
  • MRR - Position of first relevant result
  • Precision@k - Accuracy at cutoff
  • MAP - Mean Average Precision
  • Latency (p95) - 95th percentile response time

Detailed metric definitions in shared/metrics.py

πŸ† Leaderboards

Each benchmark maintains its own leaderboard:

Submit Your Results

Beat the baseline? Submit your results:

  1. Run benchmark: cd finance && python run.py
  2. Results in: finance/results/benchmark_results.json
  3. Open PR with results + system description
  4. Appear on the leaderboard!

See individual benchmark READMEs for detailed submission instructions.

πŸ“š Documentation

Contributing a Benchmark

We welcome contributions from researchers and practitioners working on vertical-specific retrieval.

Requirements

  1. Minimum 100 queries with relevance judgments
  2. Clear licensing for underlying data
  3. Reproducible baseline using at least one open retriever
  4. Documentation describing the domain and evaluation protocol

Submission Process

  1. Fork this repo
  2. Add your benchmark under a new directory
  3. Include all required files (see structure above)
  4. Open a PR with benchmark description

See CONTRIBUTING.md for full guidelines.

Citation

If you use these benchmarks in your research:

@misc{mixpeek-multimodal-benchmarks,
  title={Multimodal Benchmarks: Evaluation Suite for Vertical Retrieval Systems},
  author={Mixpeek},
  year={2025},
  url={https://github.com/mixpeek/multimodal-benchmarks}
}

License

Benchmark code: MIT License

Datasets: Individual licensing per benchmark (see each benchmark's LICENSE file)


Built by Mixpeek β€” Multimodal AI infrastructure for regulated industries.