Why don’t PCs use error correcting RAM? “Because Intel,” says Linus

2 min read Original article ↗

This Monday, Linux kernel creator Linus Torvalds went on a frustrated rant about the lack of Error Correcting Checksum (ECC) RAM in consumer PCs and laptops.

… the misguided and arse-backwards policy of “consumers don’t need ECC”, [made] the market for ECC memory go away.

The arguments against ECC were always complete and utter garbage. Now even the memory manufacturers are starting to do ECC internally because they finally owned up to the fact that they absolutely have to.

If you’re not familiar with ECC RAM, it’s probably because you don’t build or spec dedicated servers using server-grade CPUs and motherboards—which, unfortunately, is about the only place you actually find ECC. In a nutshell, ECC RAM includes a tiny amount of extra memory used for detection and correction of errors.

Memory errors and probability

In most modern implementations, this means for every 64-bit word stored in RAM, there are eight checking bits. A single bit error—a 0 flipped to 1, or a 1 flipped to 0—can be both detected and corrected automatically. Two bits flipped in the same word can be detected but not corrected. Three or more bits flipped in the same word will probably be detected, but detection is not guaranteed.

Bit flips can happen for many reasons, beginning with cosmic-ray impact or simple hardware failure. A large-scale study of Google servers found that roughly 32 percent of all servers (and 8 percent of all DIMMs) in Google’s fleet experience at least one memory error per year. But the vast majority of these are single-bit errors—and since Google is using server CPUs and ECC RAM, this means the machines in question keep right on trucking.

In consumer machines, even these single-bit errors—which are over 40 times more likely to occur than multiple-bit errors, according to Google’s data—go undetected and can introduce instability into systems and corruption into data.

Bit flips aren’t always accidental

Not every RAM error is the result of a hardware failure or unintentional EMF problem. In recent years, researchers have developed increasingly practical physics-based side channel attacks, using controlled, rapid bit flips in areas of RAM accessible to one application to deduce or modify the values of data in adjacent areas of RAM they shouldn’t be able to.