Settings

Theme

Silent Data Corruptions: The Boogeyman of LLM Training

adept.ai

31 points by jmintz 2 years ago · 5 comments

Reader

auraham 2 years ago

Interesting post. It would be much better if the author included a few code snippets to show how to identify the failing GPU during training.

ejro 2 years ago

Interesting. This is probably a universal problem for large model training but not being discussed enough.

adeptlo 2 years ago

Super interesting problem that's affecting more people than they probably realize.

osavant 2 years ago

Super interesting, thanks for putting this together

ibeitia 2 years ago

Fascinating read!

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection