Introducing WP-Bench: A WordPress AI Benchmark

How well do language models actually understand WordPress? To answer this, we’re introducing WP-Bench – the official WordPress AI benchmark.

WP-Bench evaluates how well AI models understand WordPress development, from core Core is the set of software required to run WordPress. The Core Development Team builds WordPress. APIs and coding standards to plugin A plugin is a piece of software containing a group of functions that can be added to a WordPress website. They can extend functionality or add new features to your WordPress websites. WordPress plugins are written in the PHP programming language and integrate seamlessly with WordPress. These can be free in the WordPress.org Plugin Directory https://wordpress.org/plugins/ or can be cost-based plugin from a third-party architecture and security best practices.

Why WP-Bench Matters

WordPress powers over 40% of the web, yet AI models are typically evaluated on general programming tasks. WP-Bench fills this gap by measuring WordPress-specific capabilities.

Understanding today’s models. Whether you’re building AI-powered plugins or using coding assistants, knowing which models excel at WordPress helps you make better tooling decisions.

Shaping tomorrow’s models. We want WP-Bench to become a standard benchmark that AI labs use when developing new models. When providers like OpenAI, Anthropic, and Google run pre-release evaluations, we want WordPress performance on their radar – not as an afterthought. This creates incentive to optimize for the millions of developers and site owners who depend on WordPress.

Building an open source Open Source denotes software for which the original source code is made freely available and may be redistributed and modified. Open Source **must be** delivered via a licensing model, see GPL. leaderboard. We’re working toward a public leaderboard tracking model performance on WordPress tasks. This will provide transparent results for the community, inform how the WordPress project engages with AI providers, and help developers choose the right tools for their projects.

How It Works

WP-Bench measures AI capabilities across two dimensions:

Knowledge – Multiple-choice questions testing WordPress concepts, APIs, hooks In WordPress theme and development, hooks are functions that can be applied to an action or a Filter in WordPress. Actions are functions performed when a certain event occurs in WordPress. Filters allow you to modify certain functions. Arguments used to hook both filters and actions look the same., security patterns, and coding standards, with an emphasis on modern additions like Abilities API A core WordPress API (introduced in 6.9) that creates a central registry of capabilities, making WordPress functions discoverable and accessible to AI agents, automation tools, and developers. Transforms WordPress from isolated functions into a unified system. and Interactivity API An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways..
Execution – Code generation tasks graded by a real WordPress runtime with static analysis and runtime assertions.

The benchmark uses WordPress itself as the grader, running generated code in a sandboxed environment. This ensures we measure both theoretical understanding and practical ability A registered, self-documenting unit of WordPress functionality that can be discovered and invoked through multiple contexts (REST API, Command Palette, MCP). Includes authorization and input/output specifications. to produce working, standards-compliant code.

Current State & Known Limitations

WP-Bench is an early release, and we’re being transparent about where it needs work:

Dataset size – The current test suite is relatively small. We need more test cases across WordPress APIs and patterns to make this a comprehensive benchmark.
Version coverage – The benchmark currently skews toward WordPress 6.9 features like the Abilities API and Interactivity API. This is partly intentional (newer APIs are where models genuinely struggle) but also creates a bias since these features post-date most models’ training data. We need more coverage of established WordPress patterns to create a balanced evaluation.
Benchmark saturation – Early testing showed models scoring very high on older WordPress concepts, which means those questions aren’t providing strong signal. The challenge is finding problems that are genuinely difficult, not just new.

These limitations are exactly why we’re releasing now rather than waiting. We know that the WordPress community is uniquely positioned to help build a robust, representative benchmark.

Quick Start

# Install
python3 -m venv .venv && source .venv/bin/activate
pip install -e ./python

# Start the WordPress runtime
cd runtime && npm install && npm start

# Run the benchmark
cd .. && wp-bench run --config wp-bench.example.yaml

Configure your model provider An AI service offering models for generation, embeddings, or other capabilities (e.g., Anthropic, Google, OpenAI). API keys in a .env file, and results are written to output/results.json. The harness supports running multiple models in a single pass for easy comparison.

Supporting the AI Building Blocks

WP-Bench complements the other AI Building Blocks for WordPress by measuring how well AI models work with WordPress. As we build out the Abilities API, MCP Adapter Translates WordPress abilities into Model Context Protocol format, allowing AI assistants like Claude and ChatGPT to discover and invoke WordPress capabilities as tools, resources, and prompts., and other infrastructure, a standardized benchmark helps ensure these tools integrate with the best available models.

Get Involved

WP-Bench needs your help. The benchmark is only as good as its test cases, and the WordPress community has decades of collective knowledge about what makes WordPress development challenging.

Ways to contribute:

Add test cases – Know a tricky WordPress pattern that trips up developers? It probably trips up AI too. We need coverage across more APIs, hooks, and real-world scenarios.
Run benchmarks – Test models you’re using and share your findings.
Improve grading logic – Help make the evaluation more rigorous.
Submit results – Contribute to the public leaderboard.

If you work at an AI lab, we’d love to collaborate on integrating WP-Bench into your evaluation pipeline.

Resources:

Our goal is for WP-Bench to become the standard evaluation AI providers use when releasing new models – creating a virtuous cycle where WordPress performance improves with each generation. Join us in #core-ai to discuss, share results, and help shape the future of AI in WordPress.

Props to @jason_the_adams for leading development on WP-Bench.

#ai-building-blocks, #core-ai