GitHub - basalt-ai/cobalt

6 min read Original article ↗

Table of Contents

Why Cobalt

Cobalt is a TypeScript testing framework built for AI agents and LLM-powered applications. Define datasets, run your agent, and evaluate outputs with LLM judges, custom functions, or pre-built evaluators — all from the command line. Results are tracked in SQLite with built-in comparison tools, cost estimation, and CI/CD quality gates. Cobalt ships with an MCP server so AI coding assistants can run experiments and improve your agents directly.

Quickstart

npm install @basalt-ai/cobalt
npx cobalt init
npx cobalt run

That's it — cobalt init creates an example experiment and runs it out of the box.

Now write your own. Create experiments/my-agent.cobalt.ts:

import { experiment, Dataset, Evaluator } from '@basalt-ai/cobalt'

// Define your test data
const dataset = new Dataset({
  items: [
    { input: 'What is 2+2?', expectedOutput: '4' },
    { input: 'Capital of France?', expectedOutput: 'Paris' },
  ],
})

const evaluators = [
    new Evaluator({
      name: 'Correctness',
      type: 'llm-judge',
      prompt: 'Is the output correct?\nExpected: {{expectedOutput}}\nActual: {{output}}',
    }),
  ]

// Run your agent and evaluate
experiment('qa-agent', dataset, async ({ item }) => {
  const result = await myAgent(item.input)
  return { output: result }
}, { evaluators })
npx cobalt run --file experiments/my-agent.cobalt.ts

Core Concepts

Dataset

Your test data. Load from JSON, JSONL, CSV, inline objects, or pull directly from platforms like Langfuse, LangSmith, Braintrust, and Basalt. Datasets are immutable and chainable — transform them with filter(), map(), sample(), and slice().

Read the Dataset docs →

Evaluator

Scores your agent's output. Four built-in types: LLM judge (boolean pass/fail or 0-1 scale), custom functions (write your own logic), semantic similarity (cosine/dot product), and Autoevals (11 battle-tested evaluators from Braintrust). Extend with plugins for domain-specific evaluators.

Read the Evaluator docs →

Experiment

The core loop. An experiment runs your agent against every item in a dataset, evaluates each output, and produces a structured report with per-evaluator statistics (avg, min, max, p50, p95, p99). Supports parallel execution, multiple runs with aggregation, timeouts, and CI thresholds.

Read the Experiment docs →

AI-First

Cobalt is built for AI-assisted development. Connect the MCP server, and your AI coding assistant can run experiments, analyze failures, and iterate on your agent — all from a single conversation.

Get started in 30 seconds:

  1. Add the Cobalt MCP server to your assistant config
  2. Ask it to run your experiments
  3. Let it analyze failures and suggest improvements

Example Prompts

"Compare gpt 5.1 and 5.2 on my agent and tell me wich one is the best"

"Run my QA experiment and tell me which test cases are failing"

"Generate a Cobalt experiment for my agent at src/agents/summarizer.ts"

"Compare my last two runs and check for regressions"

"My agent is scoring 60% on correctness. Analyze the failures and suggest code fixes"

MCP Server

The built-in MCP server gives Claude Code (and other MCP clients) direct access to your experiments:

{
  "mcpServers": {
    "cobalt": {
      "command": "npx",
      "args": ["cobalt", "mcp"]
    }
  }
}
Tools Resources Prompts
cobalt_run — Run experiments cobalt://config — Current config improve-agent — Analyze failures
cobalt_results — View results cobalt://experiments — List experiments generate-tests — Add test cases
cobalt_compare — Diff two runs cobalt://latest-results — Latest results regression-check — Detect regressions
cobalt_generate — Generate experiments

Read the MCP docs →

Skills

cobalt init generates a .cobalt/SKILLS.md file and integrates with your AI instruction files (CLAUDE.md, AGENTS.md, .github/copilot-instructions.md) so your assistant knows how to use Cobalt from day one. After upgrading the SDK, run cobalt update to regenerate the skills file and check for updates.

Read the Skills docs →

CI/CD

Cobalt is built to run in your CI pipeline. Define quality thresholds for your agents, and Cobalt will enforce them on every commit — ensuring your AI systems stay reliable over time, not just at launch.

GitHub Action

The easiest way to integrate Cobalt into your CI. Runs experiments, posts rich PR comments with score tables, auto-compares against the base branch, and optionally generates AI-powered analysis.

- uses: basalt-ai/cobalt@v1
  with:
    api_key: ${{ secrets.OPENAI_API_KEY }}

Read the GitHub Action docs →

CLI

For any CI provider, use the CLI directly with --ci to enforce quality thresholds:

npx cobalt run --ci
# Exit code 1 if any threshold is violated

Define thresholds per evaluator, latency, cost, or overall score — Cobalt catches regressions before they reach production.

Read the CI/CD docs →

Integrations

Load datasets from your existing evaluation platforms:

Platform Loader Docs
Langfuse Dataset.fromLangfuse('dataset-name') Setup →
LangSmith Dataset.fromLangsmith('dataset-name') Setup →
Braintrust Dataset.fromBraintrust('project', 'dataset') Setup →
Basalt Dataset.fromBasalt('dataset-id') Setup →

File formats: JSON, JSONL, CSV, HTTP/HTTPS remote URLs.

LLM providers: OpenAI and Anthropic (auto-detected from model name).

Configuration

// cobalt.config.ts
import { defineConfig } from '@basalt-ai/cobalt'

export default defineConfig({
  testDir: './experiments',
  judge: { model: 'gpt-5-mini', provider: 'openai' },
  concurrency: 5,
  timeout: 30_000,
  cache: { enabled: true, ttl: '7d' },
})
Option Default Description
testDir './experiments' Experiment file directory
judge.model 'gpt-5-mini' Default LLM judge model
concurrency 5 Max parallel executions
timeout 30000 Per-item timeout (ms)
reporters ['cli', 'json'] Output reporters
cache.ttl '7d' LLM response cache TTL
plugins [] Custom evaluator plugins
thresholds -- CI quality gates

Full Configuration Reference →

CLI

cobalt run <file|dir>          # Run experiments
cobalt init                    # Initialize project
cobalt update                  # Update skills file & check for SDK updates
cobalt history                 # View past runs
cobalt compare <id1> <id2>     # Compare two runs
cobalt serve                   # Start dashboard
cobalt clean                   # Clean cache/results
cobalt mcp                     # Start MCP server

Roadmap

Cobalt is open source and community-driven. The roadmap is shaped by what you need — tell us what matters to you.

Status Feature
Core experiment runner, evaluators, datasets, CLI
MCP server for AI-assisted testing
CI mode with quality thresholds
Plugin system & Autoevals integration
🚧 Vibe code your test reports - Vibe coded dashboard UI to make it like you want
GitHub Action - First-class CI integration with PR comments
🚧 Tracing - Full tracing of the agent to have more context for the evaluation
🔮 Python version - Bring Cobalt to the Python ecosystem
🔮 VS Code extension - Run experiments from your editor
🔮 More integrations - Integrations with frameworks like Mastra or Langhcain
🔮 Multi-platform export - Push results to BigQuery, Snowflake or other tools

Contributing

We welcome contributions! See our Contributing Guide for development setup, code standards, and PR process.

License

MIT — see LICENSE for details.