Settings

Theme

NVSentinel: Nvidia's open-source GPU resilience system for Kubernetes

github.com

3 points by mchmarny a month ago · 1 comment

Reader

mchmarnyOP a month ago

Keeping a GPU cluster healthy at scale isn't just a "nice to have"—it’s the difference between seamless training and a nightmare of idle nodes. That’s why we built NVSentinel, our open-source system designed to detect, classify, and auto-remediate hardware and software faults across Kubernetes nodes and NVSwitches.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection