Table of Contents
- Why Cobalt
- Quickstart
- Core Concepts
- AI-First
- CI/CD
- Integrations
- Configuration
- CLI
- Roadmap
- Contributing
- License
Why Cobalt
Cobalt is a TypeScript testing framework built for AI agents and LLM-powered applications. Define datasets, run your agent, and evaluate outputs with LLM judges, custom functions, or pre-built evaluators — all from the command line. Results are tracked in SQLite with built-in comparison tools, cost estimation, and CI/CD quality gates. Cobalt ships with an MCP server so AI coding assistants can run experiments and improve your agents directly.
Quickstart
npm install @basalt-ai/cobalt npx cobalt init npx cobalt run
That's it — cobalt init creates an example experiment and runs it out of the box.
Now write your own. Create experiments/my-agent.cobalt.ts:
import { experiment, Dataset, Evaluator } from '@basalt-ai/cobalt' // Define your test data const dataset = new Dataset({ items: [ { input: 'What is 2+2?', expectedOutput: '4' }, { input: 'Capital of France?', expectedOutput: 'Paris' }, ], }) const evaluators = [ new Evaluator({ name: 'Correctness', type: 'llm-judge', prompt: 'Is the output correct?\nExpected: {{expectedOutput}}\nActual: {{output}}', }), ] // Run your agent and evaluate experiment('qa-agent', dataset, async ({ item }) => { const result = await myAgent(item.input) return { output: result } }, { evaluators })
npx cobalt run --file experiments/my-agent.cobalt.ts
Core Concepts
Dataset
Your test data. Load from JSON, JSONL, CSV, inline objects, or pull directly from platforms like Langfuse, LangSmith, Braintrust, and Basalt. Datasets are immutable and chainable — transform them with filter(), map(), sample(), and slice().
Evaluator
Scores your agent's output. Four built-in types: LLM judge (boolean pass/fail or 0-1 scale), custom functions (write your own logic), semantic similarity (cosine/dot product), and Autoevals (11 battle-tested evaluators from Braintrust). Extend with plugins for domain-specific evaluators.
Experiment
The core loop. An experiment runs your agent against every item in a dataset, evaluates each output, and produces a structured report with per-evaluator statistics (avg, min, max, p50, p95, p99). Supports parallel execution, multiple runs with aggregation, timeouts, and CI thresholds.
AI-First
Cobalt is built for AI-assisted development. Connect the MCP server, and your AI coding assistant can run experiments, analyze failures, and iterate on your agent — all from a single conversation.
Get started in 30 seconds:
- Add the Cobalt MCP server to your assistant config
- Ask it to run your experiments
- Let it analyze failures and suggest improvements
Example Prompts
"Compare gpt 5.1 and 5.2 on my agent and tell me wich one is the best"
"Run my QA experiment and tell me which test cases are failing"
"Generate a Cobalt experiment for my agent at
src/agents/summarizer.ts"
"Compare my last two runs and check for regressions"
"My agent is scoring 60% on correctness. Analyze the failures and suggest code fixes"
MCP Server
The built-in MCP server gives Claude Code (and other MCP clients) direct access to your experiments:
{
"mcpServers": {
"cobalt": {
"command": "npx",
"args": ["cobalt", "mcp"]
}
}
}| Tools | Resources | Prompts |
|---|---|---|
cobalt_run — Run experiments |
cobalt://config — Current config |
improve-agent — Analyze failures |
cobalt_results — View results |
cobalt://experiments — List experiments |
generate-tests — Add test cases |
cobalt_compare — Diff two runs |
cobalt://latest-results — Latest results |
regression-check — Detect regressions |
cobalt_generate — Generate experiments |
Skills
cobalt init generates a .cobalt/SKILLS.md file and integrates with your AI instruction files (CLAUDE.md, AGENTS.md, .github/copilot-instructions.md) so your assistant knows how to use Cobalt from day one. After upgrading the SDK, run cobalt update to regenerate the skills file and check for updates.
CI/CD
Cobalt is built to run in your CI pipeline. Define quality thresholds for your agents, and Cobalt will enforce them on every commit — ensuring your AI systems stay reliable over time, not just at launch.
GitHub Action
The easiest way to integrate Cobalt into your CI. Runs experiments, posts rich PR comments with score tables, auto-compares against the base branch, and optionally generates AI-powered analysis.
- uses: basalt-ai/cobalt@v1 with: api_key: ${{ secrets.OPENAI_API_KEY }}
CLI
For any CI provider, use the CLI directly with --ci to enforce quality thresholds:
npx cobalt run --ci
# Exit code 1 if any threshold is violatedDefine thresholds per evaluator, latency, cost, or overall score — Cobalt catches regressions before they reach production.
Integrations
Load datasets from your existing evaluation platforms:
| Platform | Loader | Docs |
|---|---|---|
| Langfuse | Dataset.fromLangfuse('dataset-name') |
Setup → |
| LangSmith | Dataset.fromLangsmith('dataset-name') |
Setup → |
| Braintrust | Dataset.fromBraintrust('project', 'dataset') |
Setup → |
| Basalt | Dataset.fromBasalt('dataset-id') |
Setup → |
File formats: JSON, JSONL, CSV, HTTP/HTTPS remote URLs.
LLM providers: OpenAI and Anthropic (auto-detected from model name).
Configuration
// cobalt.config.ts import { defineConfig } from '@basalt-ai/cobalt' export default defineConfig({ testDir: './experiments', judge: { model: 'gpt-5-mini', provider: 'openai' }, concurrency: 5, timeout: 30_000, cache: { enabled: true, ttl: '7d' }, })
| Option | Default | Description |
|---|---|---|
testDir |
'./experiments' |
Experiment file directory |
judge.model |
'gpt-5-mini' |
Default LLM judge model |
concurrency |
5 |
Max parallel executions |
timeout |
30000 |
Per-item timeout (ms) |
reporters |
['cli', 'json'] |
Output reporters |
cache.ttl |
'7d' |
LLM response cache TTL |
plugins |
[] |
Custom evaluator plugins |
thresholds |
-- | CI quality gates |
Full Configuration Reference →
CLI
cobalt run <file|dir> # Run experiments cobalt init # Initialize project cobalt update # Update skills file & check for SDK updates cobalt history # View past runs cobalt compare <id1> <id2> # Compare two runs cobalt serve # Start dashboard cobalt clean # Clean cache/results cobalt mcp # Start MCP server
Roadmap
Cobalt is open source and community-driven. The roadmap is shaped by what you need — tell us what matters to you.
| Status | Feature |
|---|---|
| ✅ | Core experiment runner, evaluators, datasets, CLI |
| ✅ | MCP server for AI-assisted testing |
| ✅ | CI mode with quality thresholds |
| ✅ | Plugin system & Autoevals integration |
| 🚧 | Vibe code your test reports - Vibe coded dashboard UI to make it like you want |
| ✅ | GitHub Action - First-class CI integration with PR comments |
| 🚧 | Tracing - Full tracing of the agent to have more context for the evaluation |
| 🔮 | Python version - Bring Cobalt to the Python ecosystem |
| 🔮 | VS Code extension - Run experiments from your editor |
| 🔮 | More integrations - Integrations with frameworks like Mastra or Langhcain |
| 🔮 | Multi-platform export - Push results to BigQuery, Snowflake or other tools |
Contributing
We welcome contributions! See our Contributing Guide for development setup, code standards, and PR process.
- Report bugs: Open an issue
- Suggest features: GitHub Issues
License
MIT — see LICENSE for details.