News Intel's Sapphire Rapids Had 500 Bugs, Launch Window Moves Further

Given that Sapphire Rapids has Golden Cove cores. just like Alder Lake also has Golden Cove cores... if there are indeed 500 (known) bugs in the Sapphire Silicon, are any of those bugs present in Alder Lake's version of Golden Cove?

Any way to test to see whether the bugs in Sapphire Rapids are indeed present in Alder Lake and the upcoming Raptor Lake?

Sapphire's volume ramp has been anything but Rapid.
 
Well it sounds catastrophic, but when you're talking server space where systems are measured in millions of dollars, squashing as many bugs pre-release is a far better option than releasing a product and then face having to possibly remove performance to mitigate.
 
  • Like
Reactions: artk2219
Given that Sapphire Rapids has Golden Cove cores. just like Alder Lake also has Golden Cove cores... if there are indeed 500 (known) bugs in the Sapphire Silicon, are any of those bugs present in Alder Lake's version of Golden Cove?

Any way to test to see whether the bugs in Sapphire Rapids are indeed present in Alder Lake and the upcoming Raptor Lake?

Sapphire's volume ramp has been anything but Rapid.
Bugs may be for features specific to Data centers: Optane dimms, ECC, and some aceelerators that make good selling points. Consumers don't care about them.
 
Well it sounds catastrophic, but when you're talking server space where systems are measured in millions of dollars, squashing as many bugs pre-release is a far better option than releasing a product and then face having to possibly remove performance to mitigate.
So is delaying your product for years. Intel lost so many customers as they waited endlessly for something new.
 
  • Like
Reactions: artk2219
Given that Sapphire Rapids has Golden Cove cores. just like Alder Lake also has Golden Cove cores... if there are indeed 500 (known) bugs in the Sapphire Silicon, are any of those bugs present in Alder Lake's version of Golden Cove?

Any way to test to see whether the bugs in Sapphire Rapids are indeed present in Alder Lake and the upcoming Raptor Lake?

Sapphire's volume ramp has been anything but Rapid.

Definitely. Bugs exist for all CPUs out there, whether past, present or even future. No cpu is perfect. Bugs exit in GPUs too.

You just need to download their technical documents to know which are the bugs. Both Intel and AMD has such documents available on their websites. They are commonly referred to as errata.

Some could be corrected via different 'steppings', some via microcode (bios). Some could not be corrected so programmers must avoid them. Sometimes, workaround is also available in these documents.
 
So is delaying your product for years. Intel lost so many customers as they waited endlessly for something new.

Not really. Most of these customers are corporates instead of end-users. They don't wait. They simply buy what's available in the market. Only very niche products like supercomputers will need to wait because they are designed around the processor.

Also, sapphire rapids are just "optimized" processors that integrate many features into a single processor. Current CPUs/GPUs combo can also do perform the same tasks, just not as efficient.
 
  • Like
Reactions: artk2219
For a project of this size, 500 'bugs' is no surprise. There is a good chance that 80% of the bugs are performance or manufacturing related. Most of the bugs are probably related to new logic and instructions. Of course, even a single bug in the wrong place can be a huge disaster.
 
  • Like
Reactions: artk2219
This comes at a horrible time for Intel, because their current product (Ice Lake SP) was barely competitive at launch (which was years late) and has no chance against AMD's Genoa.

Given that Sapphire Rapids has Golden Cove cores. just like Alder Lake also has Golden Cove cores... if there are indeed 500 (known) bugs in the Sapphire Silicon, are any of those bugs present in Alder Lake's version of Golden Cove?
Precisely because Golden Cove is already on the market, I think we can say most of the show-stopper bugs aren't in the cores, themselves. And the ones in the cores would be related to features disabled in consumer versions, anyhow (like AVX-512, perhaps).

Sapphire's volume ramp has been anything but Rapid.
🤣
I like it! Intel really opened themselves up to that one!

Bugs may be for features specific to Data centers: Optane dimms, ECC, and some aceelerators that make good selling points. Consumers don't care about them.
Agreed... except not really about ECC, because Alder Lake supports ECC when used in a motherboard with a WS680 chipset. That said, servers typically feature advanced ECC modes you don't get in entry-level workstations.

As for datacenter features, the article correctly notes these CPUs introduce AMX, DSA, and CXL. These are each big, new, and complex features. To pick on AMX, Intel's software support for it was so late that I could imagine they were delayed in getting their hardware validation tests in place for the initial steppings of the CPU.

CXL is probably not easy to test, due to the lack of CXL devices on the market.

Not really. Most of these customers are corporates instead of end-users. They don't wait. They simply buy what's available in the market.
I don't know about that. "corporates" is a very broad category. Many business customers will have some leeway in deciding when to decommission and replace old machines.

Also, sapphire rapids are just "optimized" processors that integrate many features into a single processor. Current CPUs/GPUs combo can also do perform the same tasks, just not as efficient.
Uh, that's probably over-simplifying it. They actually lack some features of the consumer CPUs, such as the iGPU, media block, display controller, GNA, and E-cores. On the other hand, they have AVX-512 (i.e. supported, tested, and working - they said AVX-512 in Alder Lake hadn't been tested, even though it seemed to work with some BIOS) and certain RAS (Reliability, Availability, and Serviceability) features, a different interconnect for enabling cores to communicate with each other, different last-level cache, inter-processor communications link, plus the DSA (Data Stream Accerelator) and AMX (Advanced Matrix eXtensions) that I already mentioned. While there's some functional overlap between AMX and GPUs, the way AMX works is very different. DSA is a programmable engine for shipping and manipulating data streams, which does in hardware things that a consumer CPU would have to do in software.

In short, these are a different animal. They're not simply a gluing together of CPUs and GPUs, but rather a complex system of richly-featured cores and special-function units that need to interact in complex ways to satisfy all the needs and demands of modern server users. And that might be their undoing - trying to build one CPU architecture that can be everything to everyone.
 
Given that Sapphire Rapids has Golden Cove cores. just like Alder Lake also has Golden Cove cores... if there are indeed 500 (known) bugs in the Sapphire Silicon, are any of those bugs present in Alder Lake's version of Golden Cove?

Any way to test to see whether the bugs in Sapphire Rapids are indeed present in Alder Lake and the upcoming Raptor Lake?

Sapphire's volume ramp has been anything but Rapid.
The Golden Cove cores have been well tested in an isolated environment like an Alder chip. Pull up a die shot of Alder and look at that Sapphire rapids shot. Do you see any differences?
Something I noticed is that Sapphire Rapids looks more complicated with a lot more components. It is likely that there may be bugs in some of the many other components and very likely that there could be security issues and bugs in one or more of the many paths of communication between those various components.
As far as testing, I don't think you can isolate the GC cores in SPR the same way they are already isolated in ADL and even if Intel allowed reviewers to, doing so would likely be beyond the abilities of their testing equipment and certainly be beyond worth taking on the time and cost associated with such an effort.
 
  • Like
Reactions: artk2219
Bear in mind Sapphire Rapids is more than just a single die SKU. Each of those steppings may be for a different die, as Intel have used that scheme in the past.
For example, if there are 3 dies within the Sapphire Rapids product lineup (the common LCC/MCC/HCC split) that could be 4 steppings per die. Or if the 4 dies of the composite chip package are not rotationally symmetric (the HBM layout is not, so rotational die symmetry would require use of a large substrate rather than compact EMIB bridges) it would only be 3 stepping for each of the 4 dies.
The stepping themselves - A0, A1, B0, C0, C1, C2, D0, E0, E2, E3, E4 and E5 - indicate this may be the case: following Intel's previous convention, each prefix letter is a major die layout change, and the suffix numeral is the revision of that layout. 4 dies, 4 major layouts, one having no revisions, one having two, one having three, and one having 5 revisions. If the dies are hetereogenous in capability (e.g. if you place the majority of support components on one die, so you can have a 1, 2, 3, or 4, die SKU without duplicating those components and wasting die area) then it would be expected the 'main' die has the most steppings and the most extensive testing and revision, and the other dies may not even have been taped out until the first die design was well into testing.
 
All SoC validation is hard, but to repeatedly and wildly miss your own deadlines by at least 8 months is especially embarrassing for a large semiconductor firm. How badly did you not understand your own product, your own management, and your own capacities?

Alder Lake's validation was tricky, Intel stated, due to its large I/O changes & COVID-19 pandemic restrictions, but it was still on-time: Intel of all companies knew the significantly heightened & wider validation needed to ship Sapphire Rapids in volume. But Intel has re-adjusted and re-adjusted and re-adjusted timelines because Intel can't make the "integrated electronics" bits of Sapphire Rapids work like Intel claimed it could.

Notably, Gelsinger worked at Intel when Intel was at its peak arrogance and also when Intel rushed into destructive fire sales to "save the company" after AMD's groundbreaking K8 / Athlon64 CPUs. I wonder what he learned?

ARM & AMD & Apple (I jokingly call them the ARMADA) needed a decade to do it, but their relative flexibility to manufacture with TSMC, Samsung, and / or GloFo + their ability to understand the Arm market have paid off massively in perf/W, perf/$, release cadence, and key customer targets. With Arm offering 128 "medium" cores today, with AMD offering 64-core Zen3-powered Milan today, and with Apple offering M1 under $999 today, is Intel going to get ready or does it have more "decade-out slidedecks" to pitch to tech media?

Intel always has a comeback narrative and sometimes it works (Conroe, Alder Lake), but it seems more often than not, that comeback narrative was pure rubbish ("10nm", Itanium, Xscale, Arc so far). I guess money talks & we'll see how many customers continue to "ride or die" with Intel.
 
Better really late rather than a bugg-fest. Chipzillah can't take yet another hit to HEDT ___ no wonder AMD has been slow-walking TR Pro ...

"Bear in mind Sapphire Rapids is more than just a single die SKU ... "
___ EMIB and Foveros (rolling eyes)

On what is essentially the 10th anniversary of the "Sea Micro Freedom Fabric" deal ---- turns out, Rory Read was a genius, and really kicked Intel in the Tookus . . .
 
  • Like
Reactions: artk2219
If the dies are hetereogenous in capability
Haven't we seen photos of the HBM variant with its lid off? I seem to recall that had just the compute tiles + HBM tiles. And I wonder if those compute tiles are the same as their non-HBM CPUs use.

You do raise a good point about different core counts, but since SPR is tile-based (with 4 tiles in the max config), I'd guess there are probably just one or two tiles sizes. That would be a change for Intel, in that previously they could pipeline work on the different die sizes and now they'd have to try and throw all their resources at designing just one or two tiles.
 
How badly did you not understand your own product, your own management, and your own capacities?
My guess is that it has something to do with the comibinatorical complexity of different feature interactions, with added wrinkle that they must now be more sensitive to security exploits than in the past.

Let's take the example of DSA (Data Stream Accelerator). It needs to interact with the mesh interconnect, cache hierarchy, PCIe, CXL, DDR5, HBM, and then go over UPI to access the same sorts of resources attached to another physical CPU. That multiplies out the test cases you have to run. And when doing this in simulation, the actual execution resources to perform those tests could also be a limiting factor.

Similarly, AMX has a large number of potential interactions as well, since it abstracts tiles of data in memory. And AMX also interacts with the OS kernel, both in that one thread is restricted to using it at a time (per-core, I assume) and there have been bugs involving its role in context-switching. It's more like an adjunct accelerator than something like AVX, which means there's additional state management required of software.

it was still on-time:
Um, according to which roadmap? And also we only know about final delivery. It could be that validation internally ran behind, and that backed up SPR validation efforts.

Intel also stated that they had not validated Alder Lake's AVX-512 implementation, and therefore couldn't assure that it was bug-free.

Intel can't make the "integrated electronics" bits of Sapphire Rapids work like Intel claimed it could.
Cute with the bold, there. Is that how they got the name?

Anyway, I'm reminded of TSX in Haswell, which I think took them until about Skylake to finally get right. Then, it turned out to have some security bugs and they quietly dropped it around the time of Coffee Lake-R.

There are probably a half-dozen other hardware features Intel flubbed within the past decade, but they were non-essential and mostly server-oriented security things that I didn't bother to remember. Every now and then, you'll see support for one of these get dropped from GCC, because it was either broken or just didn't work adequately.