GitHub - gauravvij/function-calling-cli: A CLI to evaluate multiple LLMs for 30 function calling tests

FC-Eval

Function Calling Evaluation Tool for LLMs
_{Supports OpenRouter (Cloud) and Ollama (Local) backends}

Installation • Quick Start • Usage • Ollama Setup • Methodology

Overview

FC-Eval is a comprehensive CLI tool for evaluating Large Language Models' function-calling capabilities. Inspired by the Berkeley Function Calling Leaderboard (BFCL) v4 methodology, it provides rigorous testing across 30 unique test cases covering single-turn, multi-turn, and agentic scenarios.

Key Features:

🌐 Dual Backend Support: Evaluate models via OpenRouter (cloud) or Ollama (local)
📊 30 Unique Test Cases: Comprehensive coverage across all function-calling scenarios
🔄 Best of N Trials: Configurable trial count with reliability metrics
⚡ Parallel Execution: Multi-threaded evaluation for faster results
📈 Comprehensive Reporting: JSON and TXT reports with detailed metrics
🎯 AST-Based Validation: Accurate function call matching using abstract syntax trees

Installation

Prerequisites

Python 3.10 or higher
For Ollama testing: Linux/macOS/Windows with WSL

Step 1: Clone the Repository

git clone https://github.com/gauravvij/function-calling-cli.git
cd function-calling-cli

Step 2: Install Python Dependencies

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install the package
pip install -e .

Quick Start

FC-Eval can be run in two ways:

Using the installed CLI (fc-eval) - Supports both OpenRouter and Ollama
Using the standalone script (evaluate_fc.py) - OpenRouter only

Option A: OpenRouter (Cloud) - Easiest

Get an API key at https://openrouter.ai/keys

Set your API key:

export OPENROUTER_API_KEY="your-api-key-here"

Run evaluation using fc-eval:

fc-eval --provider openrouter --models qwen/qwen3.5-9b

Or using the standalone script:

python evaluate_fc.py --models qwen/qwen3.5-9b

Option B: Ollama (Local) - Requires Setup

Install Ollama (see Ollama Setup section)

Create the optimized model:

ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfile

Run evaluation:

fc-eval --provider ollama --models qwen3.5:9b-fc

Ollama Setup

Installing Ollama

Ollama provides a simple installation script for Linux/macOS:

# Install Ollama (official one-liner)
curl -fsSL https://ollama.com/install.sh | sh

This will:

Download and install the Ollama binary
Set up the Ollama service
Start the Ollama server automatically

Verifying Installation

# Check Ollama is installed
ollama --version

# Verify server is running
curl http://localhost:11434/api/tags

Creating the Custom Modelfile

The project includes an optimized Modelfile (qwen3.5-9b-fc.modelfile) that addresses the temperature and system prompt issues identified in our analysis:

FROM qwen3.5:9b

# System prompt optimized for function calling
SYSTEM You are a helpful AI assistant with access to tools/functions. When you need to perform an action, use the available tools by making function calls. Always respond with the correct function call format when a tool is needed.

# Critical parameters for function calling accuracy
PARAMETER temperature 0.0
PARAMETER top_p 0.9
PARAMETER top_k 10
PARAMETER num_ctx 8192
PARAMETER num_predict 4096

Key Configuration Changes:

Parameter	Default	Optimized	Impact
`temperature`	1.0	0.0	Eliminates randomness for deterministic function calls
`top_p`	0.95	0.9	Slightly more focused sampling
`top_k`	20	10	Reduces token selection variety
`num_ctx`	2048	8192	Larger context window
`num_predict`	-1	4096	Maximum response length

Building the Optimized Model

# Create the custom model from the Modelfile
ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfile

# Verify the model was created
ollama list

# Inspect model parameters
ollama show qwen3.5:9b-fc

Pulling the Base Model (if needed)

If you don't have the base model:

# Pull the base Qwen 3.5 9B model
ollama pull qwen3.5:9b

# Then create the custom version
ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfile

Usage

API Key Setup

OpenRouter API Key

FC-Eval requires an OpenRouter API key for cloud-based evaluation.

Option 1: Environment Variable (Recommended)

export OPENROUTER_API_KEY="your-api-key-here"

Add this to your ~/.bashrc or ~/.zshrc for persistence.

Option 2: Command Line Argument

fc-eval --provider openrouter --api-key "your-api-key-here"

Option 3: .env File

Create a .env file in your working directory:

OPENROUTER_API_KEY=your-api-key-here

Get your API key at: https://openrouter.ai/keys

Ollama (Local)

No API key required for Ollama. Ensure the server is running:

# Check if Ollama is running
curl http://localhost:11434/api/tags

Basic Usage

Evaluate with OpenRouter (Cloud)

# Evaluate default models via OpenRouter
fc-eval --provider openrouter

# Evaluate specific models
fc-eval --provider openrouter --models qwen/qwen3.5-9b qwen/qwen3.5-27b

# Run with parallel execution
fc-eval --provider openrouter --mode parallel --max-workers 10

Evaluate with Ollama (Local)

# Evaluate local Ollama models
fc-eval --provider ollama

# Evaluate specific local model
fc-eval --provider ollama --models qwen3.5:9b-fc

# Run with sequential mode (recommended for local testing)
fc-eval --provider ollama --mode sequential

Parallel vs Sequential Execution

Parallel Execution (recommended for cloud):

fc-eval --provider openrouter --mode parallel --max-workers 10

Sequential Execution (recommended for local/debugging):

fc-eval --provider ollama --mode sequential

Custom Models

Evaluate specific models:

# OpenRouter models
fc-eval --provider openrouter --models openai/gpt-4o anthropic/claude-3.5-sonnet

# Ollama models
fc-eval --provider ollama --models llama3.2 mistral

Multiple Trials (Best of N)

Run multiple trials per test for reliability metrics (default: 3):

fc-eval --provider openrouter --trials 5

A test passes if at least one trial succeeds (Best of N logic). Reliability is reported as the percentage of trials that passed.

Category Filtering

Run only specific test categories:

# Single-turn tests only
fc-eval --provider openrouter --category single_turn

# Multi-turn tests only
fc-eval --provider openrouter --category multi_turn

# Agentic tests only
fc-eval --provider openrouter --category agentic

Custom Output Directory

Save reports to a custom directory:

fc-eval --provider openrouter --output-dir ./my_results

Features

Dual Backend Support: Test models via OpenRouter (cloud) or Ollama (local)
30 Unique Test Cases: Comprehensive coverage across single-turn, multi-turn, and agentic scenarios
Best of N Trials: Configurable trial count with reliability metrics
Parallel Execution: Multi-threaded evaluation for faster results
Comprehensive Reporting: JSON and TXT reports with detailed metrics
AST-Based Validation: Accurate function call matching using abstract syntax trees
Category Breakdown: Detailed analysis by test category and subcategory
Latency Tracking: Performance metrics for each model

Methodology

Test Categories

Single-Turn (16 tests)
- Simple function calls
- Multiple function selection
- Parallel function calling
- Parallel multiple functions
- Relevance detection
Multi-Turn (8 tests)
- Base multi-turn conversations
- Missing parameter handling
- Missing function scenarios
- Long context management
Agentic (6 tests)
- Web search simulation
- Memory/state management
- Format sensitivity

Evaluation Logic

Best of N: A test passes if at least one of N trials succeeds
Reliability: Percentage of trials that passed (e.g., 2/3 trials = 66.7% reliability)
AST Matching: Function calls validated using abstract syntax tree comparison

Troubleshooting

Ollama Connection Issues

Problem: Connection refused error when using Ollama provider

Solution:

# Check if Ollama is running
curl http://localhost:11434/api/tags

# If not running, start the server
ollama serve

Model Not Found (Ollama)

Problem: model not found error

Solution:

# List available models
ollama list

# Pull the required model
ollama pull qwen3.5:9b

# Create custom model with Modelfile
ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfile

OpenRouter API Errors

Problem: 401 Unauthorized or 429 Rate Limited

Solution:

# Verify API key is set
echo $OPENROUTER_API_KEY

# Set API key
export OPENROUTER_API_KEY="your-key-here"

# For rate limits, use sequential mode with fewer workers
fc-eval --provider openrouter --mode sequential --trials 1

Low Accuracy on Local Models

Problem: Local Ollama models show significantly lower accuracy than OpenRouter

Explanation: This is expected due to:

Quantization: Ollama uses Q4_K_M (4-bit) quantization by default
System Prompts: OpenRouter may apply additional optimizations
API Optimizations: Cloud providers may use response format enforcement

Recommendation: Use the custom Modelfile (qwen3.5-9b-fc.modelfile) for best local results, but expect ~60 percentage point gap vs OpenRouter.

Performance Comparison: OpenRouter vs Ollama

Based on our analysis with Qwen 3.5 9B:

Metric	OpenRouter (Cloud)	Ollama (Local)	Difference
Accuracy	83.3%	22.2%	-61.1 pp
Temperature	0.0 (default)	1.0 (default)	Critical
Avg Latency	~1600ms	~8900ms	5.5x slower
Quantization	Unknown (likely F16)	Q4_K_M (4-bit)	Precision loss

Recommendation: Use OpenRouter for production function-calling tasks requiring high accuracy. Use Ollama for local development, privacy-sensitive applications, or offline scenarios with acceptable accuracy trade-offs.

Files Reference

File	Description
`evaluate_fc.py`	Main evaluation script
`qwen3.5-9b-fc.modelfile`	Optimized Ollama Modelfile for function calling
`FUNCTION_CALLING_ACCURACY_ANALYSIS.md`	Detailed discrepancy analysis report
`results/`	Directory containing evaluation reports

License

MIT License - see LICENSE file for details.

_{Built with ❤️ by NEO}

NEO - A fully autonomous AI Engineer