Multimodal Benchmarks
The open evaluation suite for multimodal retrieval systems.
Standard datasets, queries, and relevance judgments for benchmarking retrieval across video, image, audio, and document modalitiesβparticularly in regulated and high-stakes domains.
π― Quick Start
Choose your benchmark and get started in 60 seconds:
| Benchmark | Domain | Learn More | Leaderboard |
|---|---|---|---|
| Financial Documents | SEC filings, earnings reports | mxp.co/finance | View β |
| Medical Devices | IFUs, regulatory docs | mxp.co/device | View β |
| Curriculum Search | Educational videos, lectures | mxp.co/learning | View β |
Run Any Benchmark
# Finance benchmark cd finance && python run.py --quick # Medical device benchmark cd device && python run.py --quick # Curriculum benchmark cd learning && python run.py --quick
Each runs in ~1 second with demo data. See QUICKSTART.md for full guide.
Why This Exists
Most retrieval benchmarks assume text-only search on clean web data. Real-world multimodal retrieval is harder:
- Medical device IFUs with nested tables, diagrams, and regulatory language
- SEC filings with embedded charts, footnotes, and cross-references
- Educational videos requiring temporal understanding and code-lecture alignment
- Regulatory documents spanning technical specs, clinical data, and safety reports
This repo provides ground-truth evaluation sets for these verticalsβso you can measure what actually matters.
π Benchmarks Overview
All benchmarks are available now and include sample queries with human-annotated relevance judgments.
| Benchmark | Best NDCG@10 | Status | Documentation |
|---|---|---|---|
| Finance | 0.78 | β Available | README Β· Leaderboard |
| Device | 0.78 | β Available | README Β· Leaderboard |
| Learning | 0.84 | β Available | README Β· Leaderboard |
π Structure
benchmarks/
βββ shared/ # Shared utilities
β βββ metrics.py # Standard evaluation metrics
β βββ evaluator.py # Benchmark runner
β βββ __init__.py
β
βββ finance/ # Financial document benchmark
β βββ run.py # Main benchmark script
β βββ README.md # Full documentation
β βββ LEADERBOARD.md # Results leaderboard
β βββ results/ # Benchmark results
β
βββ device/ # Medical device benchmark
β βββ run.py
β βββ README.md
β βββ LEADERBOARD.md
β βββ results/
β
βββ learning/ # Curriculum search benchmark
βββ run.py
βββ README.md
βββ LEADERBOARD.md
βββ results/
π Quick Start
1. Install Dependencies
# Install shared dependencies
pip install numpy2. Run a Benchmark
# Run with demo data (no setup required) cd finance && python run.py --quick # Run with your own data cd finance && python run.py --data-dir /path/to/documents
3. Evaluate Your Retriever
All benchmarks use a standard interface:
from shared import BenchmarkEvaluator, Query, RelevanceJudgment # Your retrieval function def my_retriever(query: str) -> list[str]: # Returns ranked list of document IDs ... # Create evaluator evaluator = BenchmarkEvaluator( name="my-system", retriever_fn=my_retriever, k_values=[5, 10, 20] ) # Run benchmark queries = [...] # Load your queries judgments = [...] # Load ground truth report = evaluator.run(queries, judgments) # Print results evaluator.print_summary(report) evaluator.save_report(report, "results.json")
π Standard Metrics
All benchmarks use consistent evaluation metrics:
- NDCG@k - Ranking quality (primary metric)
- Recall@k - Coverage of relevant documents
- MRR - Position of first relevant result
- Precision@k - Accuracy at cutoff
- MAP - Mean Average Precision
- Latency (p95) - 95th percentile response time
Detailed metric definitions in shared/metrics.py
π Leaderboards
Each benchmark maintains its own leaderboard:
- Financial Documents β - Best: 0.78 NDCG@10
- Medical Devices β - Best: 0.78 NDCG@10
- Curriculum Search β - Best: 0.84 NDCG@10
Submit Your Results
Beat the baseline? Submit your results:
- Run benchmark:
cd finance && python run.py - Results in:
finance/results/benchmark_results.json - Open PR with results + system description
- Appear on the leaderboard!
See individual benchmark READMEs for detailed submission instructions.
π Documentation
- Quick Start Guide - Get started in 60 seconds
- Finance Benchmark - SEC filings, financial docs
- Device Benchmark - Medical device IFUs, regulatory docs
- Learning Benchmark - Educational videos, lectures
Contributing a Benchmark
We welcome contributions from researchers and practitioners working on vertical-specific retrieval.
Requirements
- Minimum 100 queries with relevance judgments
- Clear licensing for underlying data
- Reproducible baseline using at least one open retriever
- Documentation describing the domain and evaluation protocol
Submission Process
- Fork this repo
- Add your benchmark under a new directory
- Include all required files (see structure above)
- Open a PR with benchmark description
See CONTRIBUTING.md for full guidelines.
Citation
If you use these benchmarks in your research:
@misc{mixpeek-multimodal-benchmarks, title={Multimodal Benchmarks: Evaluation Suite for Vertical Retrieval Systems}, author={Mixpeek}, year={2025}, url={https://github.com/mixpeek/multimodal-benchmarks} }
License
Benchmark code: MIT License
Datasets: Individual licensing per benchmark (see each benchmark's LICENSE file)
Built by Mixpeek β Multimodal AI infrastructure for regulated industries.
