Show HN: FC-Eval – CLI to Benchmark Local or Cloud LLMs on Function Calling

3 points by gauravvij137 3 months ago · 0 comments · 1 min read

Reader

I built FC-Eval to have a repeatable way to evaluate how well different LLMs handle function calling before using them in agent workflows.

It runs models through 30 test cases covering single-turn, multi-turn, and agentic scenarios, modeled loosely after the Berkeley Function Calling Leaderboard methodology.

Validation uses AST matching rather than string comparison to avoid false positives from formatting variations.

Supports two backends: OpenRouter for cloud models (GPT-5.2, Claude, Qwen 3.5, Mistral, etc.) and Ollama for local models with no API key needed.

Tests for best of N trials giving you a reliable score alongside raw accuracy.

Results export to JSON, TXT, CSV, or Markdown.

Quick start commands: Via Openrouter: `fc-eval --provider openrouter --models openai/gpt-5.2 anthropic/claude-sonnet-4.6`

Via Ollama: `fc-eval --provider ollama --models llama3.2`

GitHub repo: https://github.com/gauravvij/function-calling-cli

Happy to answer questions, especially around the test case design or validation logic.

No comments yet.

Settings

Show HN: FC-Eval – CLI to Benchmark Local or Cloud LLMs on Function Calling

Keyboard Shortcuts