Settings

Theme

Show HN: BenchFlow – run AI benchmarks as an API

github.com

24 points by xdotli 10 months ago · 1 comment · 2 min read

Reader

I built BenchFlow, an open-source framework that lets you integrate and evaluate AI tasks using Docker-based benchmarks. You can try it out right now by cloning the repo and running a benchmark in minutes.

As an AI researcher, I was frustrated with how much time my team spent setting up benchmark environments rather than actually improving our models. We'd spend weeks configuring environments, only to find inconsistencies when comparing results with other teams. BenchFlow started as an internal tool to standardize our evaluation process, and we decided to open-source it after seeing how much time it saved us.

Unlike other benchmarking tools that focus on specific domains, BenchFlow provides a unified interface for any AI task. The Docker-based approach ensures consistent environments across different machines and teams. You don't need to worry about dependency conflicts or environment setup - just implement a simple interface and you're ready to go.

How to try it out? check our link but here's a preview of that

1. pip install benchflow 2. load a benchmark and define how to call your agents/models 3. run it and get the result

Available benchmarks you can try today:

- MMLU-PRO: Test your model's knowledge across 57 subjects - Bird: Evaluate business intelligence reasoning capabilities - WebArena: See how your agent performs on web-based tasks - MedQA-CS: Test medical question answering abilities

The framework handles all the containerization, task distribution, and result collection, so you can focus on improving your models rather than managing infrastructure.

I'd love to hear your feedback and see how you use it. What benchmarks would you like to see added next?

Please give us a star if you can, thanks!

GitHub: https://github.com/benchflow-ai/benchflow

Website: https://benchflow.ai/

Benchmark Hub: https://benchflow.ai/benchmarks

Inspo: https://github.com/ServiceNow/BrowserGym

dnnssl2 10 months ago

What can this handle? Code? Browser? Computer Use?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection