Chaos engineering tool that uses agents to break your infrastructure on purpose, then clean up after itself.
You tell it what to target (a database, a k8s cluster, some servers), pick the skills you want to run, and it handles discovery, fault injection, and rollback. You can also point an LLM at your infra and let it decide what to break.
Databases (PostgreSQL, MySQL, CockroachDB, YugabyteDB, MongoDB) — Connects to your DB, discovers the schema (or collections for MongoDB), and hammers it with inserts, updates, heavy reads, or config changes. Rolls back everything when done.
Kubernetes — Finds workloads in your cluster and starts killing pods, cordoning nodes, dropping network policies, or deploying resource hogs. Cleans up on exit.
Servers — SSHes into hosts, discovers what's running (services, ports, filesystems), and goes after them: fills disks, stops services, changes permissions, spikes CPU/memory. Restores original state after.
How it works
- Discover — Agent connects to the target and figures out what's there (tables, pods, services, filesystems, etc.)
- Plan — The orchestrator (or an LLM) picks skills and sets parameters
- Execute — Skills run and each one saves what it needs for rollback
- Observe — Events get emitted in real time
- Rollback — When the duration expires (or something fails), everything reverts in LIFO order
Install
Quick install (latest release)
curl -fsSL https://raw.githubusercontent.com/system32-ai/chaos-agents/master/install.sh | bashYou can also set a specific version or install directory:
VERSION=v0.1.0 curl -fsSL https://raw.githubusercontent.com/system32-ai/chaos-agents/master/install.sh | bash # custom install location INSTALL_DIR=~/.local/bin curl -fsSL https://raw.githubusercontent.com/system32-ai/chaos-agents/master/install.sh | bash
Usage
Interactive TUI
Run chaos with no arguments to launch the interactive terminal UI:
The TUI walks you through a guided wizard:
- Select provider — Anthropic, OpenAI, or Ollama (auto-detects API keys from env)
- Configure provider — API key, model, max turns
- Select target — Database, Kubernetes, or Server
- Configure target — Connection URL, namespace, SSH hosts, etc.
- Enter prompt — Describe what chaos to run and set a duration
- Review & confirm — Check settings and press Enter to start
Once running, the dashboard shows four live panels:
| Panel | What it shows |
|---|---|
| Chat | LLM conversation, tool calls, and experiment lifecycle events |
| Resources | Discovered targets (tables, pods, services) |
| Skills | Execution progress for each skill |
| Rollback | Rollback step status |
Keyboard shortcuts during execution:
| Key | Action |
|---|---|
Tab |
Switch between panels |
Up/Down |
Scroll the Chat panel |
Ctrl+C |
Cancel experiment (stay in TUI) |
Ctrl+W |
Cancel experiment and quit |
q |
Quit (after experiment finishes) |
Agent mode (CLI)
Plan and execute from the command line:
export OPENAI_API_KEY="sk.."
chaos agent "Test cockroachdb resilience at postgres://root@localhost:26257/mydb"
List skills
chaos list-skills chaos list-skills --target database chaos list-skills --target kubernetes chaos list-skills --target server
SKILL TARGET DESCRIPTION
----------------------------------------------------------------------
db.insert_load database Bulk INSERT random rows into target tables
db.update_load database Randomly UPDATE existing rows in target tables
db.select_load database Generate heavy SELECT query load against target tables
db.config_change database ALTER database configuration parameters with rollback
db.table_lock database Acquire table-level locks to simulate lock contention
db.row_lock database Acquire row-level locks (SELECT FOR UPDATE) to simulate row contention
mongo.insert_load database Bulk INSERT random documents into MongoDB collections
mongo.update_load database Randomly UPDATE existing documents in MongoDB collections
mongo.find_load database Generate heavy read (find) query load against MongoDB collections
mongo.index_drop database Drop secondary indexes from MongoDB collections
mongo.profiling_change database Change MongoDB profiling level to add overhead
mongo.connection_pool_stress database Open many MongoDB connections to exhaust limits
crdb.zone_config_change database Change CockroachDB zone config (replication, GC TTL)
ysql.follower_reads database Toggle YugabyteDB follower reads for eventual consistency
k8s.pod_kill kubernetes Delete random pods matching label selector
k8s.node_drain kubernetes Cordon a node (mark unschedulable), rollback uncordons it
k8s.network_chaos kubernetes Apply deny-all NetworkPolicy to isolate pods
k8s.resource_stress kubernetes Deploy a stress-ng pod to consume cluster resources
server.disk_fill server Fill disk space with a large file, rollback removes it
server.permission_change server Change file permissions to disrupt services, rollback restores them
server.service_stop server Stop random running services, rollback restarts them
server.cpu_stress server Run stress-ng to load CPU, rollback kills the process
server.memory_stress server Run stress-ng to consume memory, rollback kills the process
Run experiments
chaos run config/example-db.yaml
chaos run config/example-k8s.yaml
chaos run config/example-server.yaml
# dry-run — validates and discovers but doesn't execute anything
chaos run config/example-db.yaml --dry-runValidate config
chaos validate config/example-db.yaml
LLM planning
Let an LLM look at your setup and decide what chaos to run. The provider is auto-detected from your API key environment variables:
# Anthropic — auto-detected from ANTHROPIC_API_KEY export ANTHROPIC_API_KEY="sk-ant-..." chaos plan "Test our PostgreSQL database resilience under heavy write load" # OpenAI — auto-detected from OPENAI_API_KEY export OPENAI_API_KEY="sk-..." chaos plan "Kill random pods in the staging namespace" # Ollama (local) — used as fallback when no API key is set chaos plan "Stress test the web servers" --model llama3.1 # Explicit provider override chaos plan "Break the database" --provider openai # With MCP servers for extra context chaos plan "Run chaos on the entire staging environment" --config config/example-llm.yaml
Agent mode (CLI, detailed)
Plan and execute in one step — the LLM generates experiments, you review, and approve:
# Plan and run interactively chaos agent "Test our PostgreSQL database resilience under heavy write load" # Target CockroachDB or YugabyteDB — auto-detected from prompt keywords chaos agent "Test cockroachdb resilience at postgres://root@localhost:26257/mydb" # MongoDB — auto-detected from mongodb:// URL chaos agent "Load test mongodb://localhost:27017 collections" # Preview the generated config without executing chaos agent "Kill random pods in staging" --dry-run # Auto-approve (skip confirmation) chaos agent "Stress test the web servers" -y # Save the generated config to a file and run chaos agent "Fill disk on 10.0.1.50" --save plan.yaml
Daemon mode
Run experiments on a cron schedule:
chaos daemon config/daemon.yaml
# with a PID file
chaos daemon config/daemon.yaml --pid-file /var/run/chaos.pidConfiguration
Database experiment
experiments: - name: "postgres-load-test" target: database target_config: connection_url: "postgres://user:pass@localhost:5432/mydb" db_type: postgres skills: - skill_name: "db.insert_load" params: rows_per_table: 10000 tables: ["users", "orders"] - skill_name: "db.config_change" params: changes: - param: "work_mem" value: "4MB" duration: "5m" parallel: false
CockroachDB experiment
CockroachDB and YugabyteDB are PostgreSQL wire-compatible, so they use postgres:// connection URLs. The SQL skills (db.insert_load, db.select_load, db.update_load) work as-is. The db.config_change skill uses CockroachDB's SET CLUSTER SETTING syntax automatically.
experiments: - name: "cockroachdb-resilience" target: database target_config: connection_url: "postgres://root@localhost:26257/mydb" db_type: cockroach_db skills: - skill_name: "db.insert_load" params: rows_per_table: 5000 - skill_name: "crdb.zone_config_change" params: target: "DATABASE mydb" changes: - param: "num_replicas" value: "1" - param: "gc.ttlseconds" value: "600" duration: "5m"
YugabyteDB experiment
experiments: - name: "yugabyte-consistency-test" target: database target_config: connection_url: "postgres://yugabyte@localhost:5433/mydb" db_type: yugabyte_db skills: - skill_name: "db.insert_load" params: rows_per_table: 5000 - skill_name: "ysql.follower_reads" params: enable: true staleness: "60000ms" duration: "5m"
MongoDB experiment
experiments: - name: "mongodb-load-test" target: database target_config: connection_url: "mongodb://localhost:27017" db_type: mongo_d_b databases: ["myapp"] skills: - skill_name: "mongo.insert_load" params: database: "myapp" docs_per_collection: 5000 - skill_name: "mongo.update_load" params: database: "myapp" docs: 200 - skill_name: "mongo.find_load" params: database: "myapp" query_count: 1000 - skill_name: "mongo.index_drop" params: database: "myapp" max_per_collection: 2 - skill_name: "mongo.profiling_change" params: database: "myapp" level: 2 duration: "5m"
Kubernetes experiment
experiments: - name: "k8s-pod-chaos" target: kubernetes target_config: namespace: "staging" label_selector: "app=web" skills: - skill_name: "k8s.pod_kill" params: namespace: "staging" label_selector: "app=web" count: 2 - skill_name: "k8s.network_chaos" params: namespace: "staging" pod_selector: app: "web" duration: "5m"
Server experiment
The server agent auto-discovers running services and picks targets based on what it finds:
experiments: - name: "server-chaos" target: server target_config: hosts: - host: "10.0.1.50" port: 22 username: "chaos-agent" auth: type: key private_key_path: "~/.ssh/id_ed25519" discovery: enabled: true exclude_services: ["docker", "containerd"] skills: - skill_name: "server.service_stop" params: max_services: 2 - skill_name: "server.disk_fill" params: size: "5GB" target_mount: "/tmp" duration: "10m" resource_filters: - "nginx.*" - "postgres.*"
Daemon config
settings: max_concurrent: 2 experiments: - experiment: name: "nightly-db-chaos" target: database target_config: connection_url: "postgres://chaos:pw@db:5432/staging" db_type: postgres skills: - skill_name: "db.insert_load" params: rows_per_table: 5000 duration: "15m" schedule: "0 0 2 * * *" enabled: true
LLM + MCP config
llm: provider: anthropic api_key: "${ANTHROPIC_API_KEY}" model: "claude-sonnet-4-5-20250929" max_tokens: 4096 mcp_servers: - name: "prometheus-mcp" transport: type: stdio command: "npx" args: ["-y", "@modelcontextprotocol/server-prometheus"] env: PROMETHEUS_URL: "http://prometheus:9090" max_turns: 10
Rollback
Every skill saves the original state before doing anything. Rollback happens in LIFO order — last thing changed gets reverted first.
| Skill | What it does | Rollback |
|---|---|---|
db.insert_load |
INSERT rows | DELETE by stored IDs |
db.update_load |
UPDATE rows | Restore original values |
db.select_load |
Heavy SELECT queries | No-op (read-only) |
db.config_change |
ALTER SYSTEM SET / SET CLUSTER SETTING | Restore original value |
db.table_lock |
Acquire table-level locks | Release locks on transaction end |
db.row_lock |
SELECT FOR UPDATE on rows | Release locks on transaction end |
mongo.insert_load |
INSERT documents | DELETE by stored ObjectIds |
mongo.update_load |
UPDATE documents | Replace with original documents |
mongo.find_load |
Heavy find/aggregate queries | No-op (read-only) |
mongo.index_drop |
Drop secondary indexes | Recreate indexes with original key/options |
mongo.profiling_change |
Set profiling level to 2 (all ops) | Restore original profiling level |
mongo.connection_pool_stress |
Open many connections | Connections drain on process exit |
crdb.zone_config_change |
ALTER zone config (replication, GC) | Re-apply original zone config |
ysql.follower_reads |
Enable follower reads + staleness | Restore original follower read settings |
k8s.pod_kill |
Delete pod | Verify replacement pod is running |
k8s.node_drain |
Cordon node | Uncordon node |
k8s.network_chaos |
Create deny-all NetworkPolicy | Delete the policy |
k8s.resource_stress |
Deploy stress-ng pod | Delete the pod |
server.disk_fill |
Allocate large file | Remove the file |
server.permission_change |
chmod to 000 | Restore original permissions |
server.service_stop |
systemctl stop | systemctl start |
server.cpu_stress |
Run stress-ng CPU | Kill the process |
server.memory_stress |
Run stress-ng memory | Kill the process |
If the process crashes mid-experiment, the rollback log is serializable so it can be replayed on restart.
Roadmap
- Adaptive chaos — agents that learn from past runs and escalate intensity on their own
- Multi-target experiments — coordinated chaos across DB + k8s + server in one go
- Observability integrations — Prometheus, Grafana, Datadog, PagerDuty
- Steady-state assertions — define what "healthy" looks like and let the agent check
- Cloud targets — AWS, GCP, Azure fault injection (Lambda throttling, S3 latency, IAM revocation)
- Distributed agent mesh — agents across regions for cascading failure scenarios
Community
Join us on Discord for questions, feedback, and discussion.
License
MIT