Update 7/7/2023 8:15am PT: Intel has developed a firmware fix for the issue and has resumed shipments of the impacted Xeon models, as you can read about here.
Original Article:

Intel's oft-delayed Sapphire Rapids processors are created using two types of underlying designs: The XCC package, which employs four compute tiles (die) to create a single chip, and the MCC package, which uses a single monolithic die. As shown in the slides above, the MCC design is used for chips up to 32 cores, which are the source of high-volume sales for Intel, while the XCC variants are used for the halo chips between 36 and 60 cores.
"Intel has faced another crop of design issues related to Sapphire Rapids MCC, the highest volume version of Sapphire Rapids. The 2-socket and 4-socket SKUs have paused shipments due to a timing issue since mid-June," Patel said.
Intel hasn't confirmed that the issue is confined to dual- and quad-socket SKUs, instead classifying this issue as limited to a 'subset' of the SKUs, and hasn't stated when the pause in shipments began. Intel also hasn't confirmed Patel's assertions that the bug is timing-related, or given us any clarification on the nature of the issue.
A timing issue could consist of any number of possibilities ranging from UPI interconnect to instruction timing issues, so the true nature of the bug remains nebulous for now. We do know that Intel can correct the issue with a firmware fix that apparently remains in validation for now, so the issue will not require a redesign or new revision/stepping to fix. Additionally, since new firmware is an adequate fix, Intel might not be required to replace any processors already in the field — although it could pose a validation headache for its customers.
Intel has earned plenty of criticism not only for its missteps on process node tech for the oft-delayed Sapphire Rapids, but also for the issues in its design and validation methodology that led to further delays and numerous new steppings (a typically minor redesign that requires a new version of silicon to correct an issue). Intel's Sapphire Rapids has been plagued with rumors that its design/verification missteps led to 12 steppings for some configs (an unnaturally large number — most chips see three steppings at most). Naturally, that led to severe production delays and missed launch dates.
The company has since communicated that it plans to take a different approach to its design, simulation, and validation flow that will correct those issues. Intel says those adjustments will kick in fully in the next generation of Emerald Rapids Xeon processors.
Intel says this new Sapphire Rapids bug wasn't encountered while "running commercially available software" (perhaps this was a hyperscaler's custom application), and it obviously wasn't caught during validation. This type of situation isn't entirely unheard of; nearly all complex chips have both known and unknown errata and bugs that are addressed with firmware, driver, and software workarounds that can reduce or eliminate those issues, and they ship that way — that's the very nature of modern semiconductor design and production.
For example, Intel's Skylake generation of processors shipped with 53 known errata, and six months later, Intel listed another 40 errata. Another example is the recent discovery that AMD's EPYC Rome chips crash after 1,044 days of uptime. Some bugs are simply left unfixed, as they aren't deemed critical enough to fix, or they are fixed with a combination of firmware and software. The most critical bugs sometimes require a new stepping to correct, which is the worst-case scenario. Luckily for Intel, that doesn't seem to be the case here.
However, while bugs aren't uncommon, it is uncommon for those types of bugs to lead to a halt in shipments, implying that this is more than a garden-variety errata. Intel hasn't clarified when it plans to resume shipments for its Sapphire Rapids MCC chips, but we'll update our coverage as we learn more.