nvsx · NVSentinel operator control plane
Setup. Runbooks. Auto-trigger. Three modes, one tool.
NVSentinel handles GPU fault detection and the cordon/drain/reboot chain automatically. What it doesn't do is the operator wrapper around that chain — paging on-call, updating Jira, posting to Slack, recording MTTR. That's where your existing runbooks live today.
nvsx is the thin control plane that plugs your runbooks into NVSentinel:
| Mode | What it does | Command |
|---|---|---|
| Setup | Verify cluster, detect NVSentinel, scaffold starter runbooks | nvsx setup |
| Run | Execute a runbook manually (TTY → cinematic, pipe → plain) | nvsx run <runbook> or nvsx (shell) |
| Serve | Auto-trigger runbooks from webhooks or by polling NVSentinel | nvsx serve --mode webhook | poll |
Plus utilities: list, show, doctor, init, convert.
Install
Clone and run (recommended for operators)
git clone https://github.com/nav-beep/nvsx cd nvsx ./nvsx --help # first run auto-creates a local .venv/
Via pip
pip install nvsx nvsx --help
With the convert extra (LLM-powered runbook conversion via Claude):
pip install 'nvsx[convert]'Quickstart
./nvsx setup # one-time wizard: check cluster, scaffold starter runbooks ./nvsx doctor # verify readiness ./nvsx list # see installed runbooks ./nvsx # drop into the operator shell
The three modes
1. nvsx setup — first-run wizard
Run once when you install nvsx against a new cluster. Walks through:
- kubectl context check
- NVSentinel install verification (fault-quarantine, gpu-health-monitor, mongodb)
- GPU node discovery
- Scaffold starter runbooks for common faults (XID 79, ECC, NVLink, thermal, driver hang, InfoROM)
Nothing is written or applied without explicit confirmation.
2. nvsx run — manual execution
# Shell (default when you just type `nvsx`) nvsx nvsx> list nvsx> use gpu-off-bus nvsx (gpu-off-bus)> run nvsx (gpu-off-bus)> status nvsx (gpu-off-bus)> quit # Direct CLI nvsx run gpu-off-bus nvsx run gpu-off-bus --target-node gke-t4-pool-abc nvsx run gpu-off-bus --dry-run nvsx run gpu-off-bus --plain # force plain output
Rendering auto-adapts: cinematic flight-deck on a TTY, plain JSONL stdout when piped (CI-friendly). Same engine either way.
3. nvsx serve — auto-trigger daemon
Two modes, selected by --mode:
webhook (default) — HTTP endpoint
Any incident system POSTs a JSON payload; nvsx fires the matching runbook.
nvsx serve --mode webhook --host 0.0.0.0 --port 8080
curl -X POST http://localhost:8080/webhook \ -H 'Content-Type: application/json' \ -d '{ "runbook": "gpu-off-bus", "target_node": "gke-ml-pool-t4-abc1", "source": "pagerduty" }' # → {"status":"fired","runbook":"gpu-off-bus"}
Wire PagerDuty, Opsgenie, AlertManager, or a custom script to POST to this endpoint.
poll — watches NVSentinel's MongoDB
nvsx serve --mode poll --poll-interval 10
Queries the HealthEvents collection every N seconds for new fatal events, matches them to runbooks (by tag or by detect-stage condition type), and fires the runbook on a background thread.
Both modes run indefinitely until Ctrl-C.
Utility commands
| Command | Purpose |
|---|---|
nvsx list |
Table of installed runbooks (name, nickname, title, tags) |
nvsx show <runbook> |
Pretty-print a runbook's stages, watch clauses, hooks |
nvsx doctor |
Cluster + NVSentinel readiness check |
nvsx init <slug> |
Scaffold a new runbook: YAML + hook directory + 3 hook stubs |
nvsx convert <file.md> |
Claude-powered conversion of a Markdown runbook → nvsx YAML + hooks |
What is a runbook?
A YAML file in runbooks/ that describes which NVSentinel events to observe and what operator hooks to fire at each stage.
apiVersion: nvsx/v1 kind: Runbook metadata: name: gpu-off-bus nickname: rogue-moose title: "GPU fell off bus → self-heal" tags: [infra, nvsentinel, xid, remediation] estimatedDuration: 90s prerequisites: - name: nvsentinel-control-plane check: "kubectl get pods -n nvsentinel ..." expect: "Running" stages: - id: preflight hook: hooks/gpu-off-bus/preflight.sh - id: detect watch: - kind: node-condition type: GpuPcieWatch status: "True" - id: remediate watch: - kind: crd group: janitor.dgxc.nvidia.com resource: rebootnodes hook: hooks/gpu-off-bus/on-remediate.sh - id: recover watch: - kind: node-condition type: GpuPcieWatch status: "False" hook: hooks/gpu-off-bus/on-recover.sh # preflight · detect · quarantine · drain · remediate · recover · postmortem # are the canonical stages. Use all of them; some may be no-ops in your case.
Canonical stages — every runbook uses these IDs for cross-runbook consistency:
| Stage | Purpose |
|---|---|
preflight |
Validate prerequisites; fire "incident acknowledged" hook |
detect |
Wait for an NVSentinel condition / event |
quarantine |
Observe cordon + taint |
drain |
Observe pod eviction |
remediate |
Observe NVSentinel's remediation CRD creation |
recover |
Observe return to healthy state |
postmortem |
Collect artifacts (always runs, even on failure) |
Plus two optional stages for drill/test runbooks:
| Stage | Purpose |
|---|---|
baseline |
Pre-fault observation (useful for drills; skipped in production) |
inject |
Fault-injection action (drill only — never in production runbooks) |
Full schema: runbooks/README.md.
Hook scripts hold YOUR operational bash
Every runbook can fire hook scripts at stage boundaries:
#!/usr/bin/env bash # runbooks/hooks/gpu-off-bus/on-remediate.sh # Fires when NVSentinel creates the RebootNode CRD. set -euo pipefail curl -s -X POST "$SLACK_WEBHOOK_URL" \ -H 'Content-type: application/json' \ -d "{\"text\":\":warning: GPU fault on ${NVSX_TARGET_NODE}\"}" # Open a Jira ticket, page oncall, emit a metric — whatever your team does.
Env vars available to hooks:
| Var | Description |
|---|---|
NVSX_STAGE |
Stage id (preflight, detect, ...) |
NVSX_RUNBOOK |
Runbook name |
NVSX_TARGET_NODE |
The node the runbook is acting on (auto-detected or passed via --target-node) |
NVSX_ELAPSED_MS |
Milliseconds since stage began |
NVSX_PROJECT_ROOT |
Absolute path to the nvsx project root |
Converting a legacy runbook
If you have a Markdown / Confluence runbook that covers manual cordon/drain/reboot + Slack/PagerDuty/Jira steps:
export ANTHROPIC_API_KEY=sk-ant-...
nvsx convert path/to/your-runbook.mdClaude Opus 4.7 splits it into two buckets:
- Core remediation (kubectl cordon/drain/reboot) → deleted. NVSentinel owns these now. They become passive
watch:clauses. - Operational wrapping (Slack / PagerDuty / Jira / MTTR) → preserved verbatim in hook scripts.
Output: a new runbooks/<slug>.yaml + three populated hook scripts, validated against the nvsx schema before writing.
Sample input: examples/sample-runbook.md.
Architecture
┌────────────────────────────┐
│ YOU (operator) │
│ nvsx / nvsx run / … │
└─────────────┬──────────────┘
│
┌─────────────────────▼──────────────────────┐
│ nvsx core │
│ ─ runner.py stage loop + hook lifecycle │
│ ─ watcher.py kubectl/mongo pollers │
│ ─ render.py Cinematic / Plain output │
│ ─ schema.py YAML validation │
└─────────────────────┬──────────────────────┘
│
┌───────────────┴────────────────┐
│ │
┌──────────▼──────────┐ ┌────────────▼────────────┐
│ nvsx serve │ │ nvsx run / shell │
│ webhook/poll daemon│ │ manual execution │
└──────────┬──────────┘ └────────────┬────────────┘
│ │
┌──────────────▼────────────────────────────────▼────────────┐
│ YOUR CLUSTER (NVSentinel runs here) │
│ │
│ gpu-health-monitor ─► MongoDB HealthEvents │
│ syslog-health-monitor │
│ fault-quarantine ─► cordons nodes │
│ node-drainer ─► evicts pods │
│ fault-remediation ─► RebootNode / GPUReset CRDs │
│ janitor ─► executes CRDs │
└────────────────────────────────────────────────────────────┘
Key invariant: nvsx never writes to the cluster except through a runbook's declared action.script. Cordoning, taints, and CRDs are NVSentinel's job — nvsx observes and orchestrates.
Repo layout
nvsx/
├── nvsx # bash launcher (clone-and-run)
├── pyproject.toml # also installable via pip
├── src/nvsx/ # Python package
│ ├── cli.py # Typer dispatcher
│ ├── runner.py # stage loop + hook lifecycle
│ ├── watcher.py # kubectl/mongo pollers (8 kinds)
│ ├── render.py # Cinematic + Plain renderers
│ ├── schema.py # pydantic Runbook model
│ ├── doctor.py # readiness checks
│ ├── repl.py # `nvsx` / `nvsx shell`
│ ├── server.py # `nvsx serve` (webhook + poll)
│ ├── setup.py # `nvsx setup`
│ ├── scaffolder.py # `nvsx init`
│ ├── converter.py # `nvsx convert` (Claude)
│ ├── aliases.py # deterministic node aliases
│ └── presets.py # colors / icons
├── runbooks/
│ ├── gpu-off-bus.yaml # flagship · #rogue-moose
│ ├── thermal-throttle.yaml # stub · #sleepy-panda
│ ├── README.md # schema + how-to-write
│ └── hooks/<runbook>/ # per-runbook hook scripts
├── scripts/ # helpers runbooks call
│ ├── collect-metrics.sh # postmortem artifact collector
│ └── port-forward-all.sh # grafana/prometheus UI helper
└── examples/
└── sample-runbook.md # input for `nvsx convert`
Node aliases
gke-nav-gpu-cluster-t4-pool-8ksx is no one's idea of a memorable name. nvsx deterministically maps every node to an adjective-animal alias — brave-gazelle, rogue-moose, sleepy-panda — using SHA-256 + a 4096-combination word list. Same node always gets the same alias; shown next to the raw name everywhere.
"NVSentinel just cordoned brave-gazelle" is a lot easier to track over Zoom than reading out an 11-character GKE suffix.
Requirements
- Python 3.11+
kubectlon$PATH, with a context pointing at a cluster running NVSentinel- Optional:
uv(the launcher prefers it overpipif present, for faster setup) - Optional:
ANTHROPIC_API_KEY(fornvsx convert)
Explicitly out of scope
- Policy management — no editing NVSentinel CEL rules or Helm values. That's GitOps territory.
- Replacing NVSentinel components — no fault detection, no cordon logic, no remediation. NVSentinel always wins.
- Web dashboard — Grafana exists. The TUI is the UI.
- Multi-cluster / remote execution — assumes current
kubectlcontext, single cluster. - Helm-style YAML templating — only fixed narration vars (
{{elapsed}},{{targetNode}}, etc.). For config, pass env vars to hook scripts.
Contributing
Issues and PRs welcome — especially new runbooks. The easiest way to contribute one:
nvsx init my-fault-scenario # fill in the YAML + hooks git add runbooks/my-fault-scenario.yaml runbooks/hooks/my-fault-scenario/ git commit -m "runbook: my-fault-scenario"
Keep runbooks under ~150 lines. Put logic in hooks, not YAML.
License
MIT. See LICENSE.
MIT is a permissive open-source license: anyone can use, modify, distribute, or sell nvsx — including in commercial or closed-source products — as long as they include the copyright notice and license text. No warranty, no restrictions beyond attribution. It's the standard "use this freely, don't sue us if it breaks" license.
Related
- NVIDIA NVSentinel — the fault detection + remediation system nvsx sits on top of
- NVIDIA DCGM — the GPU telemetry backbone NVSentinel reads
Built on rich, typer, pydantic, anthropic-sdk-python.