GitHub - ofou/webgrid_eval: Webgrid Eval: Benchmark for LLM Vision and Tool-Use Capabilities

Benchmark LLM vision + tool-use capabilities on Neuralink's cursor control task.

Overview

At Neuralink, a game called Webgrid tests how precisely users can control a cursor. This benchmark evaluates LLMs on the same task: the model sees a screenshot of a grid with one blue target cell and uses tools (screen, mouse_move, mouse_click) to navigate the cursor to the target and click.

Example Replay

gemini-3-flash-preview on 30×30 grid — 4 correct, 3 misclicks, 0.16 BPS (1 NTPM), in 70s task

Human Baseline

For comparison: Neuralink's eighth clinical trial participant achieved 10.39 BPS controlling his computer with his brain; the highest mouse-based score mentioned is 17.1 BPS on a 35x35 grid (Neuralink employee).

Metrics

The goal is to click targets on the grid as quickly as possible while minimizing misclicks. Score is measured in bits per second (BPS), derived from net correct clicks (NTPM) and grid size.

NTPM: Net correct clicks = correct - incorrect
BPS: max((NTPM / 60) * log2(N - 1), 0) where N = grid cells (e.g., 900 for 30×30)

Verified against the Neuralink Webgrid frontend source: function E(f, t) { return Math.max(Math.log2(t * t - 1) * f / 60, 0) }

Benchmark Results

Results from 10 rounds on the browser-based eval (make play, 30×30 grid, 991px canvas, 70s, fullscreen):

Model	Modality	Grid	Canvas	Round	NTPM	BPS
claude-4.6-opus (computer use)	Browser click	30×30	991px	1	5	0.82
claude-4.6-opus (computer use)	Browser click	30×30	991px	2	5	0.82
claude-4.6-opus (computer use)	Browser click	30×30	991px	3	5	0.82
claude-4.6-opus (computer use)	Browser click	30×30	991px	4	7	1.14
claude-4.6-opus (computer use)	Browser click	30×30	991px	5	7	1.14
claude-4.6-opus (computer use)	Browser click	30×30	991px	6	5	0.82
claude-4.6-opus (computer use)	Browser click	30×30	991px	7	2	0.33
claude-4.6-opus (computer use)	Browser click	30×30	991px	8	6	0.98
claude-4.6-opus (computer use)	Browser click	30×30	991px	9	3	0.49
claude-4.6-opus (computer use)	Browser click	30×30	991px	10	4	0.65
				Avg	4.9	0.80

Comparison with other players:

Player	Method	Grid	Best BPS	Avg BPS
Bliss Chapman	Mouse	35×35	17.10	—
Neuralink P8	N1 Brain Implant	30×30	10.39	—
claude-4.6-opus	Computer use (browser click)	30×30	1.14	0.80
gemini-3-flash-preview	API tool pipeline	30×30	0.16	~0.16

Quick Start

Installation

git clone git@github.com:ofou/webgrid_eval.git
cd webgrid_eval
make install-dev

Play the game (default eval mode)

make play
# Open http://localhost:8000 in your browser (F11 for fullscreen)

Run API-based evaluation (requires LLM API key)

# 1. Start the API server
make dev

# 2. In another terminal, run the evaluation
make eval ARGS="configs/openrouter.yaml"

Usage

Browser Game (default eval)

# Start the game (30×30 grid, 991px canvas, Neuralink-identical UI)
make play
# Open http://localhost:8000 → F11 for fullscreen → click blue cells

Results are logged to results/web_games.json.

Configure Models (API eval)

Create a YAML configuration file (see configs/ for examples):

# configs/my_models.yaml
base_url: https://openrouter.ai/api/v1
grid_size: 64 # 8×8 grid (64 cells)
canvas_size: 256 # screenshot size in pixels
max_seconds: 70 # evaluation duration per model

models:
  - google/gemini-3-flash-preview
  - qwen/qwen3-vl-235b-a22b-instruct

Available configs:

configs/openrouter.yaml - OpenRouter API (many models)
configs/google.yaml - Google AI API (Gemini models)
configs/local.yaml - Local LLM server (e.g., LM Studio, Ollama)

Run Evaluation

# Run with a config file
make eval ARGS="configs/openrouter.yaml"

# With custom duration (seconds)
make eval ARGS="configs/openrouter.yaml --seconds 120"

# Cap images per API request (for models with limits)
make eval ARGS="configs/openrouter.yaml --max-images 8"

API Endpoints

When the server is running (make dev):

GET /health - Health check
POST /api/session/start - Run single model evaluation
POST /api/eval/run - Run batch evaluation (multiple models)

Generate Replay GIFs

# Generate GIFs for all evaluation results
make gif

# Or for a specific evaluation folder
make gif ARGS="eval/model-name"

Tools

The LLM agent has access to three tools:

Tool	Description
`screen`	Returns current HUD + screenshot (like looking at your monitor)
`mouse_move`	Move cursor by (dx, dy) pixels. Positive dx=right, dy=down
`mouse_click`	Click at the current cursor position

Citation

If you use this software in your research, please cite:

@software{olivares2026webgrid,
  author  = {Olivares Urrutia, Omar},
  title   = {{Webgrid Eval: Benchmark for LLM Vision and Tool-Use Capabilities}},
  year    = {2026},
  month   = feb,
  url     = {https://github.com/ofou/webgrid_eval},
}

Acknowledgments

Inspired by Neuralink's Webgrid

Contributing

Contributions are welcome!