GitHub - astraverse-io/KubeAstra: AI-powered Kubernetes troubleshooting via chat or MCP

11 min read Original article β†—

CI License: Apache 2.0 Python 3.11+ Next.js 16 MCP compatible PRs welcome

πŸ“¬ Subscribe for release updates β€” new versions, no spam

Your clusters are talking. This assistant helps you listen.

An AI-powered Kubernetes troubleshooting assistant that lets teams investigate, diagnose, and resolve cluster issues through natural language β€” via a chat-based web UI or directly inside your IDE (Cursor / Claude Desktop / VS Code via MCP).

Combines live kubectl access with pluggable LLM providers (Gemini, Ollama/local, more coming) for root-cause analysis that turns cryptic Kubernetes failures into clear answers and actionable fix commands.

See it in action

Watch the 90-second demo

β–Ά Watch the 90-second demo β€” Kubeastra walking through 7 real Kubernetes failures (CrashLoopBackOff, OOMKilled, ImagePullBackOff, stuck PVC, unschedulable pod, namespace-wide health, runbook generation).

Want to reproduce it locally? make demo spins up a kind cluster pre-seeded with six broken workloads. See demo/README.md.


Why this exists

Every DevOps engineer has been here: a pod is crashlooping at 2 AM, and you're mentally chaining together kubectl get, kubectl describe, kubectl logs, cross-referencing events, checking resource limits, and Googling error messages β€” all while half asleep.

This tool handles that investigation loop for you:

  • Ask in plain English β€” "Why is payment-service crashing in production?"
  • Get root-cause analysis β€” not just logs, but AI-synthesized explanations of what's wrong and why
  • Receive fix commands β€” ready to run, with safety confirmations for write operations
  • Generate runbooks β€” so your team doesn't debug the same issue twice
  • Stay on your own infra β€” run entirely locally with Ollama, no data leaves your cluster

Key Features

πŸ”— Connect Any Cluster in Seconds

Four ways to connect β€” pick what fits your setup:

Mode How it works Best for
Auto-detect Reads your local ~/.kube/config and lists available contexts Local dev, minikube, kind, Docker Desktop
Kubeconfig upload Paste or upload a kubeconfig file, pick a context Remote clusters, CI-generated configs
SSH Enter host/user/password β€” kubectl runs on the remote node over SSH Air-gapped clusters, bare-metal kubeadm
In-cluster Mounts the ServiceAccount token automatically When deployed inside the cluster via Helm

Switch between clusters without restarting. Each session tracks its own connection.

πŸ” 34 Built-in Kubernetes Tools

Live cluster tools (28) β€” pod/deployment/service inspection, event streams, multi-namespace discovery, rollout status, kubeconfig context switching, log retrieval with previous-container support, resource-graph topology, deployment-level investigation, namespace-wide health analysis, and safe write operations (delete, scale, restart, patch β€” all gated behind confirm=true).

AI analysis tools (6) β€” error analysis with RAG-backed similarity search, curated fix playbooks for 11 error categories, AI-generated runbooks, cluster health reports, post-incident summarization.

πŸ€– Agentic ReAct Investigation

Unlike single-shot "ask β†’ answer" tools, Kubeastra runs a multi-step ReAct loop β€” reasoning through complex failures autonomously:

You: Why is checkout-service down?

Agent reasoning:
  βœ“ find_workload β€” searching across all namespaces
  βœ“ investigate_pod β€” found CrashLoopBackOff in checkout-svc-7d4f9b
  βœ“ get_pods β€” checking Redis dependency β†’ ConnectionRefused
  βœ“ describe_pod β€” Redis pod Pending: unbound PVC

Root cause: PersistentVolumeClaim redis-data is unbound,
preventing Redis from starting, which cascades to checkout-service.

Each reasoning step is visible in real-time via the Investigation Trail β€” no black box. The agent answers listing questions in a single step and complex debugging in 2-3 steps, with a 90-second wall-clock safety timeout.

πŸ”§ One-Click Fix Execution

When the AI identifies a fix, you get a Review & Execute button with the exact commands:

  • Only write operations are suggested (delete pod, rollout restart, scale, patch) β€” never diagnostic commands you've already seen
  • Slide-to-confirm safety gate before any command runs
  • Button disappears after execution β€” no accidental re-runs
  • When no safe automated fix exists (e.g., "update your Helm values"), the card shows Manual Steps Required with numbered instructions instead

πŸ‘₯ Collaborative Sessions

  • Shareable URLs β€” click Share to copy a session link (/chat/:sessionId). Anyone with the URL sees the full investigation history β€” including the root-cause card, fix commands, and evidence.
  • Investigation timeline β€” every ReAct step (tool call, thought, observation) renders as a real-time timeline, not simulated placeholders.
  • Session not found β€” invalid or expired shared links show a clear message instead of a blank page.
  • One-click post-mortems β€” generate a structured post-mortem (summary, timeline, root cause, impact, resolution, action items) from any investigation session via the API.

πŸ—ΊοΈ Visual Debugging Canvas

The resource graph is an interactive investigation surface, not just a topology diagram:

  • Health-aware nodes β€” pods, services, deployments, and ingresses colored by health status with pulsing red glow for degraded resources
  • Click-to-inspect β€” click any node to see full metadata in a detail panel
  • Hover tooltips β€” quick metadata preview (phase, restarts, IP, ports, replicas)
  • Edge labels β€” see relationships at a glance: "routes β†’", "selects β†’", "manages β†’"
  • MiniMap + zoom/pan β€” navigate large cluster topologies with ease

πŸ’¬ Two Ways to Use It

Web UI IDE / MCP Integration
Chat-based Next.js interface for team-wide troubleshooting Direct integration into Cursor, Claude Desktop, or any MCP client
Connect any cluster (auto-detect, kubeconfig upload, SSH) Debug without leaving your editor
Shareable session URLs with persistent chat history (SQLite) 34 tools available via stdio or HTTP MCP transport
Visual resource graph with click-to-inspect Same ReAct agent powers both surfaces

πŸ”Œ Pluggable LLM Providers

Pick your LLM β€” Google Gemini (default, free tier available) or Ollama (fully local β€” your cluster data never leaves your network). OpenAI and Anthropic Claude adapters coming next.

πŸ›‘οΈ Safety First

  • Read-only by default β€” all kubectl commands are validated before execution
  • Explicit confirmation required for write operations (delete, scale, restart, patch) via slide-to-confirm
  • Full audit logging of every command executed
  • RBAC-aware β€” respects your existing Kubernetes permissions
  • Input validation β€” namespace/name/label-selector safety checks prevent injection
  • Session isolation β€” temp kubeconfig files are scoped per session with 0600 permissions, sanitized session IDs prevent path traversal, cryptographic session tokens prevent URL guessing
  • Command allowlist β€” the execute endpoint only accepts specific kubectl write prefixes; everything else is rejected

πŸš€ Deploy Anywhere

  • Local dev β€” docker-compose one-liner
  • Kind demo cluster β€” make demo spins up a broken cluster so you can see the tool work in 60 seconds
  • Production Helm chart β€” deploy into the same clusters it monitors
  • SSH multi-cluster β€” query any remote kubeadm cluster without copying kubeconfigs

Quick Start

Option 1: Try the demo (60 seconds, no cluster needed)

Prerequisites: Docker Desktop, kind, kubectl

git clone https://github.com/kubeastra/KubeAstra.git
cd KubeAstra
make demo

Spins up a local kind cluster with pre-broken workloads (CrashLoop, OOM, ImagePull, stuck PVC) and launches the web UI.

Open http://localhost:3300 and ask "what's broken in the demo namespace?".

The demo generates its own kubeconfig automatically β€” it does not touch your host's current kubectl context. See demo/README.md for full prerequisites and troubleshooting.

Option 2: Run locally against your own cluster

Prerequisites: a running Kubernetes cluster with kubectl access, and a Google Gemini API key (free tier) or Ollama running locally.

# 1. Configure the backend
cp ui/backend/.env.example ui/backend/.env
#    β†’ set GEMINI_API_KEY (or LLM_PROVIDER=ollama) in .env

# 2. Start via docker-compose (kubeconfig mounted read-only)
cd ui
docker compose up --build

# 3. Open http://localhost:3300

Option 3: Use via MCP (Cursor / Claude Desktop)

cd mcp
./setup.sh        # creates venv, installs deps, writes MCP config entry

Edit mcp/.env:

GEMINI_API_KEY=your-key-here          # or LLM_PROVIDER=ollama
ALLOWED_NAMESPACES=prod,staging,default

Restart your IDE β€” all 34 tools appear as MCP tools.

Option 4: Deploy to Kubernetes via Helm

helm upgrade --install kubeastra helm/kubeastra \
  --namespace kubeastra --create-namespace \
  --set backend.image.repository=ghcr.io/your-org/kubeastra-backend \
  --set frontend.image.repository=ghcr.io/your-org/kubeastra-frontend \
  --set secrets.geminiApiKey="YOUR_KEY" \
  --set secrets.kubeconfig="$(cat ~/.kube/config | base64 | tr -d '\n')"

How It Works

  1. Connect your cluster β€” auto-detect your local kubeconfig, upload one, or enter SSH credentials. The connection is scoped to your session.
  2. Ask a question β€” "Why are pods in checkout-service not starting?"
  3. ReAct investigation β€” the LLM reasons step-by-step: picks a tool β†’ executes it β†’ observes the result β†’ decides the next action. This continues autonomously (up to 6 iterations, 90-second timeout) until it has enough context to answer. Falls back to a keyword router if no LLM key is set.
  4. Auto-discovery β€” if you don't specify a namespace, find_workload searches across all namespaces. Large clusters (thousands of pods) are handled efficiently via text-format parsing instead of JSON.
  5. Live investigation β€” executes read-only kubectl commands against your cluster. Cluster connection flags are injected per-session automatically.
  6. AI synthesis β€” returns a severity-rated root-cause card with metrics, evidence, and either one-click fix commands or manual steps.
  7. Persistence β€” every message, tool call, and result saved to SQLite so you can pick up where you left off.
  8. Share and collaborate β€” share the session URL with your team. They see the full investigation including the root-cause card, fix commands, and evidence.

Example Interactions

Quick listing β€” answered in one tool call, ~5 seconds:

You: what pods are in the jenkins namespace?
Astra: Here are the pods in the jenkins namespace.
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚ Name                    β”‚ Status   β”‚ Ready β”‚ Restarts β”‚
       β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
       β”‚ jenkins-0               β”‚ Running  β”‚ 2/2   β”‚ 1        β”‚
       β”‚ avatar-agent-1s7k7      β”‚ Pending  β”‚ 0/0   β”‚ 0        β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Deep investigation β€” multi-step ReAct, ~15 seconds:

You: why is mongo arbiter pod in crashloop?

Investigation Trail: 3/3 tools
  βœ“ kubectl β†’ pod status retrieved
  βœ“ events  β†’ events scanned
  βœ“ ai      β†’ analysis complete

β”Œβ”€ CrashLoopBackOff ─────────────────────── CRITICAL ─┐
β”‚ mongodb-arbiter-0 Β· infrastructure                   β”‚
β”‚                                                      β”‚
β”‚ The MongoDB arbiter pod is failing to start because  β”‚
β”‚ the designated primary host (mongodb-0) is not       β”‚
β”‚ available. The arbiter's setup process times out.    β”‚
β”‚                                                      β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚ β”‚   360    β”‚ β”‚ CrashLoopBackOff β”‚ β”‚ False β”‚        β”‚
β”‚ β”‚ RESTARTS β”‚ β”‚      STATUS      β”‚ β”‚ READY β”‚        β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚                                                      β”‚
β”‚ [Review & Execute Fix]                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Fix command: kubectl delete pod mongodb-arbiter-0 -n infrastructure

Configuration

All settings are read from environment variables (or .env):

Variable Default Description
LLM_PROVIDER gemini gemini or ollama
GEMINI_API_KEY β€” Required when LLM_PROVIDER=gemini. Get one free
GEMINI_MODEL gemini-2.5-flash Gemini model to use
OLLAMA_BASE_URL http://localhost:11434 Ollama server URL
OLLAMA_MODEL llama3.1 Ollama model name (must be pulled first)
ALLOWED_NAMESPACES * Comma-separated list, or * for all
KUBECTL_TIMEOUT_SECONDS 15 Per-command timeout
MAX_LOG_TAIL_LINES 200 Max log lines per request
ENABLE_RECOVERY_OPERATIONS false Enables delete_pod, rollout_restart, scale_deployment, apply_patch
WEAVIATE_URL http://localhost:8080 Optional RAG vector DB

Repository Layout

kubeastra/
β”œβ”€β”€ ui/
β”‚   β”œβ”€β”€ frontend/                # Next.js chat UI
β”‚   β”‚   β”œβ”€β”€ app/chat/            # Chat page + /chat/:sessionId share routes
β”‚   β”‚   β”‚   β”œβ”€β”€ page-client.tsx  # Main chat component (ReAct rendering, result cards)
β”‚   β”‚   β”‚   └── [sessionId]/     # Shareable session page
β”‚   β”‚   └── components/          # ClusterConnect, ResourceGraph, RootCauseCard, etc.
β”‚   β”œβ”€β”€ backend/                 # FastAPI app + SQLite persistence
β”‚   β”‚   β”œβ”€β”€ routers/
β”‚   β”‚   β”‚   β”œβ”€β”€ chat.py          # Chat flow, tool dispatch, fix execution
β”‚   β”‚   β”‚   β”œβ”€β”€ cluster.py       # Cluster connection management (4 modes)
β”‚   β”‚   β”‚   └── sessions.py      # History, SSH targets, post-mortem API
β”‚   β”‚   β”œβ”€β”€ react.py             # ReAct loop orchestrator
β”‚   β”‚   └── db.py                # SQLite with cluster_connections table
β”‚   └── docker-compose.yml
β”œβ”€β”€ mcp/
β”‚   β”œβ”€β”€ mcp_server/              # MCP server (stdio + HTTP transports)
β”‚   β”œβ”€β”€ k8s/                     # kubectl wrappers, SSH runner, validators
β”‚   β”œβ”€β”€ ai_tools/                # Error analysis, fix playbooks, runbooks
β”‚   β”œβ”€β”€ services/                # LLM providers, Weaviate, embeddings
β”‚   └── config/settings.py
β”œβ”€β”€ helm/kubeastra/              # Helm chart
β”œβ”€β”€ demo/                        # Kind + broken workloads for `make demo`
└── docs/                        # Public documentation

Roadmap

  • Gemini + Ollama (local) LLM support
  • Demo mode with kind cluster
  • Approval flow for write operations
  • Deployment-level investigation (investigate_workload)
  • Namespace-wide health analysis (analyze_namespace)
  • Agentic ReAct investigation loop (multi-step tool calling)
  • Shareable session URLs + investigation timeline
  • Auto-generated post-mortems from investigation sessions
  • Visual debugging canvas (interactive resource graph with health glow, click-to-inspect, tooltips, MiniMap)
  • Multi-modal cluster connection (auto-detect, kubeconfig upload, SSH, in-cluster)
  • One-click fix execution with safety guards and slide-to-confirm
  • Manual steps fallback when no automated fix is available
  • Large cluster support (text-format parsing for all-namespaces queries)
  • Session security hardening (path traversal prevention, cryptographic session IDs, command allowlists)
  • Team playbook engine β€” investigation templates that codify debugging patterns
  • Alert-driven auto-investigation (PagerDuty / OpsGenie / Alertmanager webhooks)
  • OpenAI + Anthropic Claude adapters
  • Prometheus / Loki / Tempo observability integrations
  • "What changed?" view β€” recent deployments, ConfigMap/Secret mutations
  • Real-time collaborative sessions (WebSocket sync + presence indicators)
  • Slack bot integration (alert β†’ investigation β†’ findings in channel)
  • CNCF Sandbox submission

Contributing

Contributions are welcome β€” especially the items at the top of the roadmap. See CONTRIBUTING.md for local setup, project layout, and how to add a new tool, and CODE_OF_CONDUCT.md for community guidelines.

Looking for a starter task? Check the good first issue label.


License

Apache 2.0 β€” see LICENSE for details.