GitHub - jordigilh/kubernaut: Open-source AIOps platform that closes the loop from Kubernetes alert to automated remediation. An LLM agent investigates live via native client-go bindings, logs, and Prometheus, selects a fix from a workflow catalog, and executes it — or escalates with a full RCA. Approval gates, OPA policies, and audit trails keep humans in control.

AIOps Platform for Intelligent Kubernetes Remediation

Kubernaut closes the loop from Kubernetes alert to automated remediation. When something goes wrong in your cluster, Kubernaut detects the signal, sends it to an LLM-powered agent that investigates the root cause using native Go client-go bindings against the Kubernetes API, log, and Prometheus endpoints, selects a remediation workflow, and executes the fix — or escalates to a human with a full RCA when it can't.

Full Documentation · Demo Scenarios · Latest Release (v1.4.0)

Why

Kubernetes operators spend hours manually triaging alerts, diagnosing root causes from scattered logs and metrics, and executing remediation steps from runbooks that drift out of date. The response depends on tribal knowledge, human availability, and often happens at 3am.

Rule-based remediation tools help with known, deterministic problems — "if X, do Y." But when the same symptom has multiple root causes, or the right fix depends on context the rule can't see, they fall short.

Kubernaut bridges that gap. It uses an LLM agent that investigates the actual root cause through native Go bindings against the Kubernetes API and observability stack, selects the right remediation from a workflow catalog, executes it, and verifies the fix worked — escalating to humans only when it should. Rule-based tools are thermostats. Kubernaut is a diagnostician that also adjusts the thermostat.

Why Kubernaut? — full comparison with rule-based tools

What It Does

Detects — Ingests Prometheus AlertManager alerts and Kubernetes Events, validates resource scope, and deduplicates by fingerprint
Investigates — Performs live root cause analysis using Kubernetes inspection tools, configurable observability toolsets (Prometheus, etc.), and remediation history
Remediates — Selects and executes a workflow from a searchable catalog via Tekton Pipelines, Kubernetes Jobs, or Ansible (AWX/AAP), with optional human approval gates
Closes the loop — Notifies the team (Slack, console), evaluates whether the fix worked via health checks, alert resolution, and spec hash drift detection, and feeds effectiveness scores back into future investigations

Architecture

Roadmap

v1.5 — Agentic Integration (current)

MCP Interactive Mode — Human-in-the-loop investigation via MCP tools with lease-based session management, SSE streaming, dynamic takeover, and user impersonation (#703, #823)
API Frontend service — Unified external protocol layer (MCP + A2A) with OIDC authentication, natural language signal intake, and the Kubernaut Console web UI (kubernaut-apifrontend)

Track progress on the v1.5 milestone.

Full roadmap — Agentic Integration (A2A), Collective Intelligence, Fleet Operations (ACM/AAP), and Operational Expansion (cost, security, non-K8s). For past releases, see the CHANGELOG.

Coming in v1.5 — Kubernaut Console: investigate, chat, and remediate from a single UI

Installation

See the Installation Guide for prerequisites, configuration, and deployment instructions.

Documentation

Resource	Link
User & Operator Guide	jordigilh.github.io/kubernaut-docs
Architecture Overview	Architecture
Developer Guide	docs/DEVELOPER_GUIDE.md
Must-Gather Diagnostics	cmd/must-gather/README.md

Related Repositories

Repository	Description
kubernaut-docs	Documentation website (MkDocs Material)
kubernaut-apifrontend	API Frontend — MCP/A2A gateway with OIDC authentication
kubernaut-demo-scenarios	Demo scenarios, scripts, and recordings

Development

make build-all          # Build all services
make test-tier-unit     # Run unit tests
make test-all-gateway   # Run all test tiers for a service

We use Ginkgo/Gomega BDD for testing and follow a TDD workflow. See the Developer Guide for environment setup, build targets, and test commands.

Contributing

See CONTRIBUTING.md for guidelines. In short: create a feature branch, implement with tests, update docs, and open a PR.

License

Apache License 2.0 — see LICENSE.

Issues: GitHub Issues · Discussions: GitHub Discussions

Kubernaut — From alert to remediation, intelligently.