GitHub - jordigilh/kubernaut: Open-source AIOps platform that closes the loop from Kubernetes alert to automated remediation. An LLM agent investigates live via native client-go bindings, logs, and Prometheus, selects a fix from a workflow catalog, and executes it — or escalates with a full RCA. Approval gates, OPA policies, and audit trails keep humans in control.

3 min read Original article ↗

AIOps Platform for Intelligent Kubernetes Remediation

Go Report Card Go Version Kubernetes License CI

Kubernaut closes the loop from Kubernetes alert to automated remediation. When something goes wrong in your cluster, Kubernaut detects the signal, sends it to an LLM-powered agent that investigates the root cause using native Go client-go bindings against the Kubernetes API, log, and Prometheus endpoints, selects a remediation workflow, and executes the fix — or escalates to a human with a full RCA when it can't.

CrashLoopBackOff demo — from alert to automated fix in under 5 minutes

Full Documentation  ·  Demo Scenarios  ·  Latest Release (v1.4.0)


Why

Kubernetes operators spend hours manually triaging alerts, diagnosing root causes from scattered logs and metrics, and executing remediation steps from runbooks that drift out of date. The response depends on tribal knowledge, human availability, and often happens at 3am.

Rule-based remediation tools help with known, deterministic problems — "if X, do Y." But when the same symptom has multiple root causes, or the right fix depends on context the rule can't see, they fall short.

Kubernaut bridges that gap. It uses an LLM agent that investigates the actual root cause through native Go bindings against the Kubernetes API and observability stack, selects the right remediation from a workflow catalog, executes it, and verifies the fix worked — escalating to humans only when it should. Rule-based tools are thermostats. Kubernaut is a diagnostician that also adjusts the thermostat.

Why Kubernaut? — full comparison with rule-based tools


What It Does

  • Detects — Ingests Prometheus AlertManager alerts and Kubernetes Events, validates resource scope, and deduplicates by fingerprint
  • Investigates — Performs live root cause analysis using Kubernetes inspection tools, configurable observability toolsets (Prometheus, etc.), and remediation history
  • Remediates — Selects and executes a workflow from a searchable catalog via Tekton Pipelines, Kubernetes Jobs, or Ansible (AWX/AAP), with optional human approval gates
  • Closes the loop — Notifies the team (Slack, console), evaluates whether the fix worked via health checks, alert resolution, and spec hash drift detection, and feeds effectiveness scores back into future investigations
Architecture

Kubernaut Layered Architecture


Roadmap

v1.5 — Agentic Integration (current)

  • MCP Interactive Mode — Human-in-the-loop investigation via MCP tools with lease-based session management, SSE streaming, dynamic takeover, and user impersonation (#703, #823)
  • API Frontend service — Unified external protocol layer (MCP + A2A) with OIDC authentication, natural language signal intake, and the Kubernaut Console web UI (kubernaut-apifrontend)

Track progress on the v1.5 milestone.

Full roadmap — Agentic Integration (A2A), Collective Intelligence, Fleet Operations (ACM/AAP), and Operational Expansion (cost, security, non-K8s). For past releases, see the CHANGELOG.

Kubernaut Console — interactive investigation and remediation
Coming in v1.5 — Kubernaut Console: investigate, chat, and remediate from a single UI


Installation

See the Installation Guide for prerequisites, configuration, and deployment instructions.


Documentation

Resource Link
User & Operator Guide jordigilh.github.io/kubernaut-docs
Architecture Overview Architecture
Developer Guide docs/DEVELOPER_GUIDE.md
Must-Gather Diagnostics cmd/must-gather/README.md

Related Repositories

Repository Description
kubernaut-docs Documentation website (MkDocs Material)
kubernaut-apifrontend API Frontend — MCP/A2A gateway with OIDC authentication
kubernaut-demo-scenarios Demo scenarios, scripts, and recordings

Development

make build-all          # Build all services
make test-tier-unit     # Run unit tests
make test-all-gateway   # Run all test tiers for a service

We use Ginkgo/Gomega BDD for testing and follow a TDD workflow. See the Developer Guide for environment setup, build targets, and test commands.


Contributing

See CONTRIBUTING.md for guidelines. In short: create a feature branch, implement with tests, update docs, and open a PR.


License

Apache License 2.0 — see LICENSE.


Issues: GitHub Issues · Discussions: GitHub Discussions

Kubernaut — From alert to remediation, intelligently.