AIOps Platform for Intelligent Kubernetes Remediation
Kubernaut closes the loop from Kubernetes alert to automated remediation. When something goes wrong in your cluster, Kubernaut detects the signal, sends it to an LLM-powered agent that investigates the root cause using native Go client-go bindings against the Kubernetes API, log, and Prometheus endpoints, selects a remediation workflow, and executes the fix — or escalates to a human with a full RCA when it can't.
Full Documentation · Demo Scenarios · Latest Release (v1.4.0)
Why
Kubernetes operators spend hours manually triaging alerts, diagnosing root causes from scattered logs and metrics, and executing remediation steps from runbooks that drift out of date. The response depends on tribal knowledge, human availability, and often happens at 3am.
Rule-based remediation tools help with known, deterministic problems — "if X, do Y." But when the same symptom has multiple root causes, or the right fix depends on context the rule can't see, they fall short.
Kubernaut bridges that gap. It uses an LLM agent that investigates the actual root cause through native Go bindings against the Kubernetes API and observability stack, selects the right remediation from a workflow catalog, executes it, and verifies the fix worked — escalating to humans only when it should. Rule-based tools are thermostats. Kubernaut is a diagnostician that also adjusts the thermostat.
Why Kubernaut? — full comparison with rule-based tools
What It Does
- Detects — Ingests Prometheus AlertManager alerts and Kubernetes Events, validates resource scope, and deduplicates by fingerprint
- Investigates — Performs live root cause analysis using Kubernetes inspection tools, configurable observability toolsets (Prometheus, etc.), and remediation history
- Remediates — Selects and executes a workflow from a searchable catalog via Tekton Pipelines, Kubernetes Jobs, or Ansible (AWX/AAP), with optional human approval gates
- Closes the loop — Notifies the team (Slack, console), evaluates whether the fix worked via health checks, alert resolution, and spec hash drift detection, and feeds effectiveness scores back into future investigations
Roadmap
v1.5 — Agentic Integration (current)
- MCP Interactive Mode — Human-in-the-loop investigation via MCP tools with lease-based session management, SSE streaming, dynamic takeover, and user impersonation (#703, #823)
- API Frontend service — Unified external protocol layer (MCP + A2A) with OIDC authentication, natural language signal intake, and the Kubernaut Console web UI (kubernaut-apifrontend)
Track progress on the v1.5 milestone.
Full roadmap — Agentic Integration (A2A), Collective Intelligence, Fleet Operations (ACM/AAP), and Operational Expansion (cost, security, non-K8s). For past releases, see the CHANGELOG.
Coming in v1.5 — Kubernaut Console: investigate, chat, and remediate from a single UI
Installation
See the Installation Guide for prerequisites, configuration, and deployment instructions.
Documentation
| Resource | Link |
|---|---|
| User & Operator Guide | jordigilh.github.io/kubernaut-docs |
| Architecture Overview | Architecture |
| Developer Guide | docs/DEVELOPER_GUIDE.md |
| Must-Gather Diagnostics | cmd/must-gather/README.md |
Related Repositories
| Repository | Description |
|---|---|
| kubernaut-docs | Documentation website (MkDocs Material) |
| kubernaut-apifrontend | API Frontend — MCP/A2A gateway with OIDC authentication |
| kubernaut-demo-scenarios | Demo scenarios, scripts, and recordings |
Development
make build-all # Build all services make test-tier-unit # Run unit tests make test-all-gateway # Run all test tiers for a service
We use Ginkgo/Gomega BDD for testing and follow a TDD workflow. See the Developer Guide for environment setup, build targets, and test commands.
Contributing
See CONTRIBUTING.md for guidelines. In short: create a feature branch, implement with tests, update docs, and open a PR.
License
Apache License 2.0 — see LICENSE.
Issues: GitHub Issues · Discussions: GitHub Discussions
Kubernaut — From alert to remediation, intelligently.
