FC-Eval
Function Calling Evaluation Tool for LLMs
Supports OpenRouter (Cloud) and Ollama (Local) backends
Installation • Quick Start • Usage • Ollama Setup • Methodology
Overview
FC-Eval is a comprehensive CLI tool for evaluating Large Language Models' function-calling capabilities. Inspired by the Berkeley Function Calling Leaderboard (BFCL) v4 methodology, it provides rigorous testing across 30 unique test cases covering single-turn, multi-turn, and agentic scenarios.
Key Features:
- 🌐 Dual Backend Support: Evaluate models via OpenRouter (cloud) or Ollama (local)
- 📊 30 Unique Test Cases: Comprehensive coverage across all function-calling scenarios
- 🔄 Best of N Trials: Configurable trial count with reliability metrics
- ⚡ Parallel Execution: Multi-threaded evaluation for faster results
- 📈 Comprehensive Reporting: JSON and TXT reports with detailed metrics
- 🎯 AST-Based Validation: Accurate function call matching using abstract syntax trees
Installation
Prerequisites
- Python 3.10 or higher
- For Ollama testing: Linux/macOS/Windows with WSL
Step 1: Clone the Repository
git clone https://github.com/gauravvij/function-calling-cli.git
cd function-calling-cliStep 2: Install Python Dependencies
# Create a virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install the package pip install -e .
Quick Start
FC-Eval can be run in two ways:
- Using the installed CLI (
fc-eval) - Supports both OpenRouter and Ollama - Using the standalone script (
evaluate_fc.py) - OpenRouter only
Option A: OpenRouter (Cloud) - Easiest
-
Get an API key at https://openrouter.ai/keys
-
Set your API key:
export OPENROUTER_API_KEY="your-api-key-here"
-
Run evaluation using fc-eval:
fc-eval --provider openrouter --models qwen/qwen3.5-9b
Or using the standalone script:
python evaluate_fc.py --models qwen/qwen3.5-9b
Option B: Ollama (Local) - Requires Setup
-
Install Ollama (see Ollama Setup section)
-
Create the optimized model:
ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfile
-
Run evaluation:
fc-eval --provider ollama --models qwen3.5:9b-fc
Ollama Setup
Installing Ollama
Ollama provides a simple installation script for Linux/macOS:
# Install Ollama (official one-liner) curl -fsSL https://ollama.com/install.sh | sh
This will:
- Download and install the Ollama binary
- Set up the Ollama service
- Start the Ollama server automatically
Verifying Installation
# Check Ollama is installed ollama --version # Verify server is running curl http://localhost:11434/api/tags
Creating the Custom Modelfile
The project includes an optimized Modelfile (qwen3.5-9b-fc.modelfile) that addresses the temperature and system prompt issues identified in our analysis:
FROM qwen3.5:9b # System prompt optimized for function calling SYSTEM You are a helpful AI assistant with access to tools/functions. When you need to perform an action, use the available tools by making function calls. Always respond with the correct function call format when a tool is needed. # Critical parameters for function calling accuracy PARAMETER temperature 0.0 PARAMETER top_p 0.9 PARAMETER top_k 10 PARAMETER num_ctx 8192 PARAMETER num_predict 4096
Key Configuration Changes:
| Parameter | Default | Optimized | Impact |
|---|---|---|---|
temperature |
1.0 | 0.0 | Eliminates randomness for deterministic function calls |
top_p |
0.95 | 0.9 | Slightly more focused sampling |
top_k |
20 | 10 | Reduces token selection variety |
num_ctx |
2048 | 8192 | Larger context window |
num_predict |
-1 | 4096 | Maximum response length |
Building the Optimized Model
# Create the custom model from the Modelfile ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfile # Verify the model was created ollama list # Inspect model parameters ollama show qwen3.5:9b-fc
Pulling the Base Model (if needed)
If you don't have the base model:
# Pull the base Qwen 3.5 9B model ollama pull qwen3.5:9b # Then create the custom version ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfile
Usage
API Key Setup
OpenRouter API Key
FC-Eval requires an OpenRouter API key for cloud-based evaluation.
Option 1: Environment Variable (Recommended)
export OPENROUTER_API_KEY="your-api-key-here"
Add this to your ~/.bashrc or ~/.zshrc for persistence.
Option 2: Command Line Argument
fc-eval --provider openrouter --api-key "your-api-key-here"Option 3: .env File
Create a .env file in your working directory:
OPENROUTER_API_KEY=your-api-key-here
Get your API key at: https://openrouter.ai/keys
Ollama (Local)
No API key required for Ollama. Ensure the server is running:
# Check if Ollama is running
curl http://localhost:11434/api/tagsBasic Usage
Evaluate with OpenRouter (Cloud)
# Evaluate default models via OpenRouter fc-eval --provider openrouter # Evaluate specific models fc-eval --provider openrouter --models qwen/qwen3.5-9b qwen/qwen3.5-27b # Run with parallel execution fc-eval --provider openrouter --mode parallel --max-workers 10
Evaluate with Ollama (Local)
# Evaluate local Ollama models fc-eval --provider ollama # Evaluate specific local model fc-eval --provider ollama --models qwen3.5:9b-fc # Run with sequential mode (recommended for local testing) fc-eval --provider ollama --mode sequential
Parallel vs Sequential Execution
Parallel Execution (recommended for cloud):
fc-eval --provider openrouter --mode parallel --max-workers 10
Sequential Execution (recommended for local/debugging):
fc-eval --provider ollama --mode sequential
Custom Models
Evaluate specific models:
# OpenRouter models fc-eval --provider openrouter --models openai/gpt-4o anthropic/claude-3.5-sonnet # Ollama models fc-eval --provider ollama --models llama3.2 mistral
Multiple Trials (Best of N)
Run multiple trials per test for reliability metrics (default: 3):
fc-eval --provider openrouter --trials 5
A test passes if at least one trial succeeds (Best of N logic). Reliability is reported as the percentage of trials that passed.
Category Filtering
Run only specific test categories:
# Single-turn tests only fc-eval --provider openrouter --category single_turn # Multi-turn tests only fc-eval --provider openrouter --category multi_turn # Agentic tests only fc-eval --provider openrouter --category agentic
Custom Output Directory
Save reports to a custom directory:
fc-eval --provider openrouter --output-dir ./my_results
Features
- Dual Backend Support: Test models via OpenRouter (cloud) or Ollama (local)
- 30 Unique Test Cases: Comprehensive coverage across single-turn, multi-turn, and agentic scenarios
- Best of N Trials: Configurable trial count with reliability metrics
- Parallel Execution: Multi-threaded evaluation for faster results
- Comprehensive Reporting: JSON and TXT reports with detailed metrics
- AST-Based Validation: Accurate function call matching using abstract syntax trees
- Category Breakdown: Detailed analysis by test category and subcategory
- Latency Tracking: Performance metrics for each model
Methodology
Test Categories
-
Single-Turn (16 tests)
- Simple function calls
- Multiple function selection
- Parallel function calling
- Parallel multiple functions
- Relevance detection
-
Multi-Turn (8 tests)
- Base multi-turn conversations
- Missing parameter handling
- Missing function scenarios
- Long context management
-
Agentic (6 tests)
- Web search simulation
- Memory/state management
- Format sensitivity
Evaluation Logic
- Best of N: A test passes if at least one of N trials succeeds
- Reliability: Percentage of trials that passed (e.g., 2/3 trials = 66.7% reliability)
- AST Matching: Function calls validated using abstract syntax tree comparison
Troubleshooting
Ollama Connection Issues
Problem: Connection refused error when using Ollama provider
Solution:
# Check if Ollama is running curl http://localhost:11434/api/tags # If not running, start the server ollama serve
Model Not Found (Ollama)
Problem: model not found error
Solution:
# List available models ollama list # Pull the required model ollama pull qwen3.5:9b # Create custom model with Modelfile ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfile
OpenRouter API Errors
Problem: 401 Unauthorized or 429 Rate Limited
Solution:
# Verify API key is set echo $OPENROUTER_API_KEY # Set API key export OPENROUTER_API_KEY="your-key-here" # For rate limits, use sequential mode with fewer workers fc-eval --provider openrouter --mode sequential --trials 1
Low Accuracy on Local Models
Problem: Local Ollama models show significantly lower accuracy than OpenRouter
Explanation: This is expected due to:
- Quantization: Ollama uses Q4_K_M (4-bit) quantization by default
- System Prompts: OpenRouter may apply additional optimizations
- API Optimizations: Cloud providers may use response format enforcement
Recommendation: Use the custom Modelfile (qwen3.5-9b-fc.modelfile) for best local results, but expect ~60 percentage point gap vs OpenRouter.
Performance Comparison: OpenRouter vs Ollama
Based on our analysis with Qwen 3.5 9B:
| Metric | OpenRouter (Cloud) | Ollama (Local) | Difference |
|---|---|---|---|
| Accuracy | 83.3% | 22.2% | -61.1 pp |
| Temperature | 0.0 (default) | 1.0 (default) | Critical |
| Avg Latency | ~1600ms | ~8900ms | 5.5x slower |
| Quantization | Unknown (likely F16) | Q4_K_M (4-bit) | Precision loss |
Recommendation: Use OpenRouter for production function-calling tasks requiring high accuracy. Use Ollama for local development, privacy-sensitive applications, or offline scenarios with acceptable accuracy trade-offs.
Files Reference
| File | Description |
|---|---|
evaluate_fc.py |
Main evaluation script |
qwen3.5-9b-fc.modelfile |
Optimized Ollama Modelfile for function calling |
FUNCTION_CALLING_ACCURACY_ANALYSIS.md |
Detailed discrepancy analysis report |
results/ |
Directory containing evaluation reports |
License
MIT License - see LICENSE file for details.
Built with ❤️ by NEO
NEO - A fully autonomous AI Engineer