News Intel 5th Gen Xeon 'Emerald Rapids' pushes up to 64 cores, 320MB L3 cache — new CPUs claim up to 1.4X higher performance than Sapphire Rapids

Status
Not open for further replies.

bit_user

Titan
Ambassador
somehow don't think 1.4x over its old model is gonna beat epyc as iirc wasnt epyc nearly 2x as fast?
It's not a big enough increase to overcome 96-core EPYC on general-purpose cloud workloads. However, it's probably enough to score wins in more niches, particularly in cases where the accelerators come into play.

The big question I have is whether Intel is going to be as parsimonious with the accelerators as they were in the previous gen.

I recall how they restricted AVX-512 to single-FMA for lower-end Xeons, in Skylake SP + Cascade Lake, but then in Ice Lake, they were at such an overall deficit against EPYC that they just let everyone (Xeons, I mean) have 2 FMAs per core.

So, I could see Intel further trying to overcome its core-count & memory channel deficit by enabling more accelerators in more SKUs. Or will they still try to overplay this hand? They can always tighten the screws again, in Granite Rapids.
 
  • Like
Reactions: KyaraM

bit_user

Titan
Ambassador
The chipmaker's performance figures seem credible since Emerald Rapids wields the faster Raptor Cove cores than Sapphire Rapids, which uses Golden Cove cores.
The P-cores in Alder Lake vs. Raptor Lake only changed in the amount of L2 cache. The server cores usually have more L2 cache than the client versions, anyhow. So, I also have to question how meaningful that distinction is, in this context.

substantial performance gains in AI thanks to the company's built-in AI accelerators and Intel Advanced Matrix Extensions (Intel AMX). Intel's projecting uplifts between 1.3X to 2.4X, depending on the workload.
I'm going to speculate the upper end of that improvement is thanks mainly to the increased L3 capacity.

Downsizing helps improve latency.
Not sure about that. I think it's mainly a cost-saving exercise. The main way you reduce latency, at that scale, is by improving the interconnect topology or optimizing other aspects of how it works.

Intel hasn't confirmed the core count on each Emerald Rapids die, but the current speculation is that each die houses 33 cores with one disabled core.
I read that each die has 35 cores and at least 3 are disabled.

Emerald Rapids is drop-in compatible with Intel's current Eagle Stream platform with the LGA4677 socket, thus improving the cost of ownership (TCO) and minimizing the server downtime for upgrades.
As with the consumer CPUs, it's standard practice for Intel to retain the same server CPU socket across 2 generations. I'm sure it's mainly for the benefit of their partners and ecosystem, rather than to enable drop-in CPU upgrades. Probably 99%+ of their Xeon customers don't do CPU swaps, but rather just wait until the end of the normal upgrade cycle and do a wholesale system replacement.
 
Last edited:

DavidC1

Distinguished
May 18, 2006
517
86
19,060
I'm going to speculate the upper end of that improvement is thanks mainly to the increased L3 capacity.

Not sure about that. I think it's mainly a cost-saving exercise. The main way you reduce latency, at that scale, is by improving the interconnect topology or optimizing other aspects of how it works.
No, it's precisely due to performance. They are going from 1510mm2 using 4 dies to 1493mm2 using 2 dies, so the latter requires quite a bit more wafers. If it was about cost, they wouldn't have done so.

EMIB chiplets go down significantly and there's less hop needed as it only needs to traverse between two tiles, rather than four.
 
  • Like
Reactions: rtoaht

bit_user

Titan
Ambassador
They are going from 1510mm2 using 4 dies to 1493mm2 using 2 dies, so the latter requires quite a bit more wafers. If it was about cost, they wouldn't have done so.
I assume yield is the reason they have 3 spare cores per die (assuming that info I found is correct).

EMIB chiplets go down significantly and there's less hop needed as it only needs to traverse between two tiles, rather than four.
Okay, then what's the latency impact of an EMIB hop?
 

DavidC1

Distinguished
May 18, 2006
517
86
19,060
I assume yield is the reason they have 3 spare cores per die (assuming that info I found is correct).


Okay, then what's the latency impact of an EMIB hop?
Go read about Semianalysis's article on EMR and come back.

They aren't saving anything here, and there would be no reason to as they are barely making a profit and aiming for lower material cost over performance is a losing battle as increased performance can improve positioning does profits.

Hence why things like 3x increased in L3 cache capacity happened.

Considering small core count increase and basically same core, up to 40% is very respectable and come from lower level changes.
 
  • Like
Reactions: KyaraM
D

Deleted member 2731765

Guest
I read that each die has 35 cores and at least 3 are disabled.

Nope, that data which some other tech sites have outlned seems inaccurate. I presume there are two 33-core dies in Emerald Rapids lineup. So it means they have disabled only 1 core per die, to give a 64-core top-end SKU.

oSGXmFL.jpg
 
  • Like
Reactions: KyaraM and bit_user

bit_user

Titan
Ambassador
Go read about Semianalysis's article on EMR and come back.
Will do, though you are allowed to post (relevant) links...

They aren't saving anything here, and there would be no reason to as they are barely making a profit and aiming for lower material cost over performance is a losing battle as increased performance can improve positioning does profits.
The way I see 2 dies vs. 4 affecting performance is actually by reducing power consumption. Making the interconnect more energy-efficient, by eliminating cross-die links, gives you more power budget for doing useful computation.

Let's not forget the original point of contention, which was that smaller dies = lower latency. If you have evidence to support that, I'd like to see it.

Hence why things like 3x increased in L3 cache capacity happened.

Considering small core count increase and basically same core, up to 40% is very respectable and come from lower level changes.
The benefits of more L3 cache have been extensively demonstrated by AMD. Couple that with better energy-efficiency from Intel 7+ manufacturing node and fewer cross-die links + a couple more cores, and the performance figures sound plausible to me.
 
Last edited:
  • Like
Reactions: KyaraM

George³

Prominent
Oct 1, 2022
228
124
760
I'm having trouble calculating the aggregate CPU to RAM communication speed for Emerald Rapids? What is the total width of the bus, 256 or 512 bits?
 

DavidC1

Distinguished
May 18, 2006
517
86
19,060
I'm having trouble calculating the aggregate CPU to RAM communication speed for Emerald Rapids? What is the total width of the bus, 256 or 512 bits?
It's 8-channels so 512-bits.

DDR5 has internal two 32-bit channels but in practice it doesn't matter since DIMMs are always 64-bits. Emerald Rapids is also a drop in socket replacement to Sapphire Rapids, which is also 8-channel.

DDR5 is confusing a lot of people. Channels mean 64-bit.

@bit_user Actually I never said it reduced latency due to being a smaller die. It's the reduced travel and hops needed due to having only two dies is what improves latency.

If Intel wanted to focus on reducing cost, they wouldn't have increased L3 cache so drastically.
 
Last edited:
  • Like
Reactions: KyaraM and George³

George³

Prominent
Oct 1, 2022
228
124
760
It's 8-channels so 512-bits.

DDR5 has internal two 32-bit channels but in practice it doesn't matter since DIMMs are always 64-bits. Emerald Rapids is also a drop in socket replacement to Sapphire Rapids, which is also 8-channel.

DDR5 is confusing a lot of people. Channels mean 64-bit.

@bit_user Actually I never said it reduced latency due to being a smaller die. It's the reduced travel and hops needed due to having only two dies is what improves latency.

If Intel wanted to focus on reducing cost, they wouldn't have increased L3 cache so drastically.
You're right. I read publications from before one year and information and slides are more more understandably served. Will have 16 Dimms per socket.
 

bit_user

Titan
Ambassador
DDR5 has internal two 32-bit channels but in practice it doesn't matter since DIMMs are always 64-bits. Emerald Rapids is also a drop in socket replacement to Sapphire Rapids, which is also 8-channel.

DDR5 is confusing a lot of people. Channels mean 64-bit.
Uh, no. As you said, the channels are actually 32-bit. When we talk about DIMMs, we talk in increments of 64-bit. The proper way to refer to a DDR5 DIMM is a "channel pair", but that's a mouthful.

@bit_user Actually I never said it reduced latency due to being a smaller die.
The article did and that's what I took issue with. You replied to my comment about that, at least partially contradicting me.

It's the reduced travel and hops needed due to having only two dies is what improves latency.
Okay, so you are saying it reduces latency. Let's have some numbers, then.

If Intel wanted to focus on reducing cost, they wouldn't have increased L3 cache so drastically.
As you said, L3 needed to be increased to be more competitive on performance.

That's a different question than the number of dies or their size. I think they went with 2 dies vs. 10 for performance reasons, and because the yields on Intel 7 are good enough that it was practical. However, as a compromise, they went with smaller dies and more spare cores.
 
  • Like
Reactions: KyaraM
The P-cores in Alder Lake vs. Raptor Lake only changed in the amount of L2 cache. The server cores usually have more L2 cache than the client versions, anyhow. So, I also have to question how meaningful that distinction is, in this context.
I'm mostly curious if it's just process improvements or something else within Raptor Cove allowing the higher clocks to be reached. If you compare 60c/52c SPR to the EMR leaks there's a 100mhz base clock improvement (and who knows what boost will look like). The L2 cache configuration was a possible explanation for part of the differences between ADL/RPL, but it should be the same configuration in this case.
 
  • Like
Reactions: KyaraM and bit_user

DavidC1

Distinguished
May 18, 2006
517
86
19,060
I'm mostly curious if it's just process improvements or something else within Raptor Cove allowing the higher clocks to be reached. If you compare 60c/52c SPR to the EMR leaks there's a 100mhz base clock improvement (and who knows what boost will look like). The L2 cache configuration was a possible explanation for part of the differences between ADL/RPL, but it should be the same configuration in this case.
Improved Intel 7 process with greater than 50mV reduction at the same frequency, 200MHz increase at the same voltage, and 600MHz increase in clocks at peak

Sometimes called Intel 7 Ultra internally.

Rest are client chip changes so not worth mentioning since it's not applicable to server.
Uh, no. As you said, the channels are actually 32-bit. When we talk about DIMMs, we talk in increments of 64-bit. The proper way to refer to a DDR5 DIMM is a "channel pair", but that's a mouthful.
It don't matter man. Intel slides all show as 8-channels. Dual channel is basically an accepted pseudo-marketing terminology that refers each channel as being a DIMM. They don't need to say "Oh it's 16 micro channels since you still install 8 DIMMs anyway".

What's the benefit of 16-channels when the bit-width is same as 8-channels? It's a pointless exercise that no one cares. You can't get a DDR5 variant with 1x64 anyway, and it's overall same 64-bit width as DDR4, 3, 2, 1 and SDRAM.

Doesn't matter in performance, doesn't matter in routing, and doesn't matter in installation.
 
  • Like
Reactions: bit_user

bit_user

Titan
Ambassador
Improved Intel 7 process with greater than 50mV reduction at the same frequency, 200MHz increase at the same voltage, and 600MHz increase in clocks at peak

Sometimes called Intel 7 Ultra internally.

Rest are client chip changes so not worth mentioning since it's not applicable to server.
Excellent info. Thank you!

What's the benefit of 16-channels when the bit-width is same as 8-channels?
More parallelism. There are 2 truly independent channels in a DDR5 DIMM.

You can't get a DDR5 variant with 1x64 anyway,
I think I once saw some credible claim about it being configurable... too foggy on the details. Don't have a source to cite.

Doesn't matter in performance, doesn't matter in routing, and doesn't matter in installation.
It was obviously done for reasons. If the two channels of a DDR5 DIMM are clocked independently, then it could actually benefit routing.

However, I think the main reason was to allow more pipelining without increasing burst length.
 
Excellent info. Thank you!


More parallelism. There are 2 truly independent channels in a DDR5 DIMM.


I think I once saw some credible claim about it being configurable... too foggy on the details. Don't have a source to cite.


It was obviously done for reasons. If the two channels of a DDR5 DIMM are clocked independently, then it could actually benefit routing.

However, I think the main reason was to allow more pipelining without increasing burst length.
It is my understanding that DDR5 was split into two "micro-channels", unlike previous generations, because of signal integrity issues with trying to push a 64 bit channel to much higher transfer rates than DDR4 could do. Routing in the PCB was related to this as the difficulty in making all 64 signal lines close enough in length and impedance is the source of the signal integrity issue. Newer generations of DDR might even go with smaller "micro-channels", though that's just speculation on my part. I don't have a link to support my claims as it has been years since I read about the subject. I will say that this is largely why trying to fit multiple DIMMs per channel has hugely negative consequences for usable frequency with many DDR5 systems and even sometimes with DDR4 where it was almost never a problem with DDR3, DDR2, and DDR.

This is not a new nor DDR specific problem and is the reason why PCIe connections consist of multiple independent serial connections for each slot instead of a variable width parallel bus like PCI was. It is also largely why serial connections like SATA and USB replaced parallel connections like IDE and parallel ports.
 

bit_user

Titan
Ambassador
I will say that this is largely why trying to fit multiple DIMMs per channel has hugely negative consequences for usable frequency with many DDR5 systems and even sometimes with DDR4 where it was almost never a problem with DDR3, DDR2, and DDR.
Not the number of channels, but I assume you mean trying to drive them at high frequencies?

This is not a new nor DDR specific problem and is the reason why PCIe connections consist of multiple independent serial connections for each slot instead of a variable width parallel bus like PCI was.
I'm not sure about that, since PCIe was created back in the early 2000's and I doubt they really expected to keep it going this long. I think the "lanes" thing was more about flexible bandwidth allocation and creating a simpler switching model. The other thing about PCIe is that it's not a bus, but rather point-to-point. This simplifies things, on an electrical level.

It is also largely why serial connections like SATA and USB replaced parallel connections like IDE and parallel ports.
You're confusing Mbps and Gbps. That's a pretty big difference! Parallel ports only ran at up to 2 Mbps. USB replaced parallel ports simply because it had more bandwidth, smaller connectors, and a lot more features. Parallel ports could do long cables just fine - a lot better than USB, in fact! According to this, parallel cables could easily reach 25 feet.


USB 1.1 ran at either 12 Mbps or 1.5 Mbps and cables were limited to 5m or 3m, depending on the speed. So, in terms of cabling, USB was not an improvement over parallel ports! However, USB had features like hubs and hand-shaking, that made it much more consumer-friendly.

I have to really question your claims about SATA, as well. Before SATA, we had UltraSCSI-320, which (like IDE) had 16 data pins and supported maximum cable lengths of up to 12m! That's a heck of a lot longer than SATA can reach, and suggests that clock-skew wasn't in fact a limiting factor of such parallel interfaces, within a PC. I think the main benefits of SATA being serial were reductions in the cost and bulk of the connectors and cables.

Be careful about taking one fact and trying to over-apply it. In general, you should try to fact-check anything you're claiming but not sure about.
 
Last edited:
Not the number of channels, but I assume you mean trying to drive them at high frequencies?
I was referring to how many dual-channel motherboards can support four DIMMs instead of just two by putting two DIMMs in each channel. This stresses the memory controller more than one DIMM per channel as two modules need to pull twice as much current (assuming identical DIMMs) and the lines going to both DIMM slots in one channel are even more difficult to make matched lengths and impedances than the individual lines going to a single DIMM slot.

AMD's DDR5 memory controller often can't operate at the normal frequencies of DDR5, let alone higher frequencies with overclocking, when you have two DIMMs per channel with unregistered, high-capacity (say, 32GB or 48GB) DIMMs such as in a consumer dual-channel AM5 board. Intel's memory controllers are better, but still struggle somewhat with two DIMMs per channel with high-capacity DIMMs. If DDR5 DIMMs were using 64 bit channels instead of two 32 bit channels, the routing side of this issue would be even worse as they would require more precise routing to maintain signal integrity, perhaps more precise routing than was even possible to make at the time, or at least than was possible without a very expensive board.

Prior generations of DDR, due simply to lower clock frequencies, were able to achieve much closer to full performance with two DIMMs per channel relative to one DIMM per channel than DDR5 can. With DDR3, the difference was negligible for people who weren't trying to get high memory overclocks.

I'm not sure about that, since PCIe was created back in the early 2000's and I doubt they really expected to keep it going this long. I think the "lanes" thing was more about flexible bandwidth allocation and creating a simpler switching model. The other thing about PCIe is that it's not a bus, but rather point-to-point. This simplifies things, on an electrical level.
PCIe went with the independent serial lanes instead of a variable width channel specifically for signal integrity. They could have made PCIe a parallel point-to-point connection capable of being 1 bit wide, 4 bits wide, 8 bits wide, or 16 bits wide if they wanted too, but they went with serial links that don't need to be perfectly synchronized instead for better signal integrity. It does make the routing technically easier because the lanes don't need to be perfectly matched, but that's the main contributing factor for signal integrity so it goes hand-in-hand with it. The reason why a parallel interface needs more precise routing than a serial interface is specifically for it to have usable signal integrity.

I don't know if they knew PCIe would stick around in one form or another for the next couple decades with plans for more, but seeing as they normally have plans going on for PCIe versions two or three versions ahead of what is currently in use, they probably did have some idea that they were basically going to push it until they couldn't find a way to make the next version any faster than the previous. They may not have known how long they could push it, but trying to find out was the plan all along.

PCIe being point-to-point wasn't entirely new either. Some boards had multiple PCI buses, sometimes individual PCI ports even had their own bus, which is essentially point-to-point then. This was done to allow a device to have an entire PCI bus all to itself while still being able to use other PCI cards in other ports.

You're confusing Mbps and Gbps. That's a pretty big difference! Parallel ports only ran at up to 2 Mbps. USB replaced parallel ports simply because it had more bandwidth, smaller connectors, and a lot more features. Parallel ports could do long cables just fine - a lot better than USB, in fact! According to this, parallel cables could easily reach 25 feet.

USB 1.1 ran at either 12 Mbps or 1.5 Mbps and cables were limited to 5m or 3m, depending on the speed. So, in terms of cabling, USB was not an improvement over parallel ports! However, USB had features like hubs and hand-shaking, that made it much more consumer-friendly.

I have to really question your claims about SATA, as well. Before SATA, we had UltraSCSI-320, which (like IDE) had 16 data pins and supported maximum cable lengths of up to 12m! That's a heck of a lot longer than SATA can reach, and suggests that clock-skew wasn't in fact a limiting factor of such parallel interfaces, within a PC. I think the main benefits of SATA being serial were reductions in the cost and bulk of the connectors and cables.

Be careful about taking one fact and trying to over-apply it. In general, you should try to fact-check anything you're claiming but not sure about.
In order to achieve those higher speeds, they had to use a narrower interface, ideally one as narrow as possible, which is serial for USB. Parallel ports being as slow as they were was why they could use such long cables. Later on, USB cables were made which could be even longer (I own 25 foot USB cables). Smaller connectors were welcomed side-effects of the change making them possible. DVI and Displayport are both newer than VGA and feature larger connectors than VGA, granted smaller ones were eventually made for Displayport, so the size of the connector was really just a function of what they thought they could do with it at the time. The added features of USB were simply to make it universal so it could replace both parallel ports and serial ports while adding in new technologies too. The new features were certainly part of why it was commercially successful, but not why it was technologically necessary to push higher speeds. They were then able to push it to 480Mb/s, 240 times faster than a parallel port, with minimal changes as the serial interface was physically capable of it where the parallel port and cable was not.

IDE, in order to reach its highest speed, required doubling the necessary amount of wires from what it could originally work with to provide grounds between each data line specifically for signal integrity issues with pushing the parallel interface to higher speeds. They thought it would be completely impractical to try pushing IDE farther and had to come up with something narrower to continue pushing speeds higher. They had a test run with the first generation of SATA being only a little faster but were able to double it twice after that with minimal changes.

They could have gone farther, but decided not to (except for the SAS folks who went one time farther than SATA) due to latency limitations of the SATA interface that were being pushed by SSDs. Like with USB, the smaller plug and cable were beneficial to commercial success but were only possible because of the serial interface in use and it was in use because it was the easiest way to cheaply push higher speeds with the technology we had developed by then. Again, the existence of DVI replacing VGA shows that a smaller connector was not needed to achieve commercial success for a new interface to replace an older one.

UltraSCSI was a complicated and expensive interface. It had complicated and expensive hardware. IDE and subsequently SATA needed to perform well with much cheaper hardware.

Parallel interfaces as a whole were a product of the technology at the time. They could only make transistors switch efficiently so quickly at the time, so the only way to move data around faster was to make busses wider. As transistors improved, it became possible to make them switch much faster. Then we ran into problems where those old parallel interfaces couldn't keep up with higher clock frequencies without much more expensive hardware. As our ability to ush higher frequencies improved, we needed to make interfaces serial again just to function properly at higher frequencies. Memory modules are literally the only removeable device left that still uses a recently developed parallel interface (DDR5).

They're the most complicated thing to route on a mainboard, have a very short run without using cables, and we still ran into significant problems with it to the point that AMD gave up with trying to push two DIMMs per channel for Threadripper and that's with registered memory to address the current issue I mentioned earlier. Memory modules still use so much bandwidth that a single serial connection is impractical, though it was attempted with DDR3 FB-DIMMs. Since they are right on the board, the strict routing needs of parallel connections aren't entirely impractical, but with DDR5 we are clearly running into trouble with them again. Since it is a physics issue, improving technology can only do so much for it and we may see even narrower "micro-channels" in future versions of DDR. As further evidence, look at server boards. DDR3 server boards, for example, often offered three registered DIMMs per channel with quad-channel boards often having twelve memory slots. I haven't seen any DDR5 boards do that.
 

bit_user

Titan
Ambassador
The issue with your previous post isn't that you didn't type enough words. It's that you engaged in freewheeling speculation without bothering to check or cite any sources. If your goal is to impress people, you'll do better with one well-sourced claim than such pages of half-truths and backward-reasoning.

I was referring to how many dual-channel motherboards can support four DIMMs instead of just two by putting two DIMMs in each channel. This stresses the memory controller more than one DIMM per channel as two modules need to pull twice as much current (assuming identical DIMMs) and the lines going to both DIMM slots in one channel are even more difficult to make matched lengths and impedances than the individual lines going to a single DIMM slot.
Not sure sub-channels factors into the whole 2DPC issue. Seems a bit of a reach, to me.

Impedance-matching seems hard, when you have multiple devices per channel. DDR5 also features more training than DDR4 had, and it's not clear to me how much complexity it adds to actually use the trained values when you have what is essentially a bus with multiple devices.

AMD's DDR5 memory controller often can't operate at the normal frequencies of DDR5, let alone higher frequencies with overclocking, when you have two DIMMs per channel
In case you haven't heard, AMD released new firmware that dramatically improved Ryzen 7000's memory speed and compatibility, even on 4-DIMM configurations.


Prior generations of DDR, due simply to lower clock frequencies, were able to achieve much closer to full performance with two DIMMs per channel relative to one DIMM per channel than DDR5 can. With DDR3, the difference was negligible for people who weren't trying to get high memory overclocks.
My Sandybridge Xeon's CPU was only specified to run at DDR3-1333, if using 2 DIMMs per channel (unbuffered). At 1 DIMM per channel, it could do DDR3-1600. That would be as big of a drop as going from DDR5-5600 to DDR5-4550 (if such a speed existed). I didn't consider it negligible, at the time.

There was actually a PDF put out by Intel that detailed the speeds different memory configurations would run at. Another mitigating factor was using 1.35v instead of 1.5v memory, which I found out the hard way. That's what lead to me discovering the document.

PCIe went with the independent serial lanes instead of a variable width channel specifically for signal integrity. They could have made PCIe a parallel point-to-point connection capable of being 1 bit wide, 4 bits wide, 8 bits wide, or 16 bits wide if they wanted too,
You're completely ignoring bifurcation as a rationale for going with the multiple lanes approach. That gets way easier, if you have self-contained lanes.

I don't know which were the dominant factors in their design decisions, but I think neither do you. You might be right, but it only matters if we know you're right, and that's something you only achieve by posting good (i.e. authoritative) sources.

PCIe being point-to-point wasn't entirely new either. Some boards had multiple PCI buses, sometimes individual PCI ports even had their own bus, which is essentially point-to-point then.
Implementations being point-to-point isn't the same as having a standard that's fundamentally point-to-point. After waxing about multiple DIMMs per channel, I'm surprised you didn't immediately appreciate this.

In order to achieve those higher speeds, they had to use a narrower interface, ideally one as narrow as possible, which is serial for USB. Parallel ports being as slow as they were was why they could use such long cables.
You're getting history all muddled up. USB started out life at a max of 15 Mbps. The faster speeds only came along later. Serial wasn't needed to push past Parallel Ports' meager speed of 2 Mbps - USB opted to go with a serial bus for the other reasons mentioned.

They thought it would be completely impractical to try pushing IDE farther
Again, SCSI shows that's not true. UltraSCSI-320 is more than 2.5x as fast as the fastest PATA spec.

A key point you're glossing over with PATA is that it was a bus standard, not unlike the original PCI. You could have two devices per cable, which eventually should introduce issues like we saw with multiple DIMMs per channel.

However, it's not at all clear that being a parallel interconnect was a real limitation, at those speeds. It used a ribbon cable, with all the conductors being of the same length. However, it had the burden of legacy support, which might've been the biggest constraint holding it back.

They could have gone farther, but decided not to (except for the SAS folks who went one time farther than SATA) due to latency limitations of the SATA interface
SATA stopped at 6 Gbps because NVMe was just better in so many ways, rapidly maturing, and had already captured the high-end.

Like with USB, the smaller plug and cable were beneficial to commercial success but were only possible because of the serial interface in use
You have it backwards. The cost-savings and bulk reduction of SATA vs. PATA drove the move from parallel to serial - not the other way around.

Again, the existence of DVI replacing VGA shows that a smaller connector was not needed to achieve commercial success for a new interface to replace an older one.
You're forgetting who bares the cost of PATA being such a clunky standard. It's disproportionately the OEMs. In contrast, it's overwhelmingly end-users dealing with video cables. OEMs drive the agenda, when it comes to PC standards.

UltraSCSI was a complicated and expensive interface. It had complicated and expensive hardware.
This was true back in the 1980's and early 1990's. Have you ever heard of ATAPI? It was used by many CD-ROM drives and was merely SCSI commands tunneled over IDE. Every controller supporting UDMA mode supported it.

Parallel interfaces as a whole were a product of the technology at the time. They could only make transistors switch efficiently so quickly at the time,
This is rubbish. If you go back to the origins of the parallel port cable, the reason they adopted it was to avoid having to add circuits to serialize and deserialize data over a serial link. In other words, the proliferation of parallel interfaces was much more about cost-savings than anything else.

Those were the pre-VLSI days. The main thing that changed is the cost-reductions and commodification of ICs needed to implement serial links.

DDR3 server boards, for example, often offered three registered DIMMs per channel with quad-channel boards often having twelve memory slots. I haven't seen any DDR5 boards do that.
DDR3-era CPUs had only up to 4 memory channels. DDR5-era server CPUs have either 8 or 12. With AMD, you get the same 12 DIMMs as those old 3DPC motherboards, but with 3x the bandwidth as 4-chanel/3DPC would provide. This lets you scale capacity along with bandwidth, making multiple DIMMs per channel is less important than it used to be.

Furthermore, have you ever seen a 2-CPU server board with 24 DIMM slots?? It's hard to see where you'd fit another 24! Here's one with the maximal 48-DIMM configuration - it's huge and it has basically no room for anything else!


Furthermore, in the modern era, die-stacking is reportedly set to usher in DDR5 DIMM capacities as large as 1 TB! At that point, you're spending so much $ on DRAM per system that it's probably not a big deal to buy some more CPUs, to scale up further.


At today's prices 24 TB of 32 GB DDR5-4800 RDIMMs would cost a whopping $115,200. Obviously, GB/$ will improve by the time we have 1 TB DIMMs, but recent trends in semiconductor pricing suggest the improvement won't be much. Maybe a factor of 2 or 3? But, probably not the order of magnitude we're used to seeing.
 
Last edited:
  • Like
Reactions: Order 66
Not sure sub-channels factors into the whole 2DPC issue. Seems a bit of a reach, to me.

Impedance-matching seems hard, when you have multiple devices per channel. DDR5 also features more training than DDR4 had, and it's not clear to me how much complexity it adds to actually use the trained values when you have what is essentially a bus with multiple devices.
Two 32-bit sub-channels instead of one 64 bit channel means you have half as many lines to impedance match together as tightly. Lines in one channel don't need to be as tightly matched to the lines in the other channel as they do to the other lines in their own channel. That's an advantage for signal routing.
In case you haven't heard, AMD released new firmware that dramatically improved Ryzen 7000's memory speed and compatibility, even on 4-DIMM configurations.
I haven't found anything showing the firmware in use with two DIMMs per channel in that link nor searching online for it. I'd imagine that the firmware helps even in that situation, but how much?
My Sandybridge Xeon's CPU was only specified to run at DDR3-1333, if using 2 DIMMs per channel (unbuffered). At 1 DIMM per channel, it could do DDR3-1600. That would be as big of a drop as going from DDR5-5600 to DDR5-4550 (if such a speed existed). I didn't consider it negligible, at the time.

There was actually a PDF put out by Intel that detailed the speeds different memory configurations would run at. Another mitigating factor was using 1.35v instead of 1.5v memory, which I found out the hard way. That's what lead to me discovering the document.
That's roughly a 17% drop in frequency. Looking at the memory compatibility lists of motherboards, I've seen that the DDR5 boards often drop significantly more by percentage. DDR3 boards could typically run at least 80% of full speed with two DIMMs per channel, yet many Intel DDR5 boards are closer to 30% losses according to their supported memory pages and AMD fared even worse. I read through a dozen or so lists.

1.5V was the normal voltage for DDR3, 1.35V was for lower power memory (there was even lower voltage DDR3 available, I saw 1.25V DDR3 in some computers). Yes, that definitely made a big impact on performance as lowering the voltage to the memory will result in lower signal integrity. Running DDR3 memory at 1.35V instead of 1.5V will lower the maximum performance you can get in stable operation. For the record, I ran an AMD Phenom II 1090T (was overclocked from 3.2GHz to 4.0GHz) in an ECS A890GXM-A board with four 4GB DDR3-1333 DIMMs (overclocked to DDR3-1600) for many years. I still have it, though the board stopped working after a very close lightning strike damaged a bunch of stuff in my neighborhood several years ago. I'm still using the memory from that build in another project right now with two ASRock N68C-GS boards.
You're completely ignoring bifurcation as a rationale for going with the multiple lanes approach. That gets way easier, if you have self-contained lanes.

I don't know which were the dominant factors in their design decisions, but I think neither do you. You might be right, but it only matters if we know you're right, and that's something you only achieve by posting good (i.e. authoritative) sources.
Bifurcation makes some (mostly niche) things slightly more convenient and broadens the amount of different configurations you can use with without having to pay for PCIe switch chips, but it's hardly a significant advantage for the vast majority of people. Doing away with x16 slots completely and giving only x8, x4, and x1 connections which couldn't be split or ganged wouldn't have made much difference for most people. Video cards almost always perform with nearly all of their performance even in an x4 slot of their PCIe version or x8 of the previous version.

SSDs can be made that use up essentially however much PCIe bandwidth you can give them, but few people have so many of them that they need to make use of some of the lanes from the x16 slot of a consumer board. Servers don't really care as they just used x8 PCIe SSDs (or PCIe x8 cards in general) or PCIe switch chips before bifurcation was a common feature and they'd still use them if x4/x4/x4/x4 bifurcation hadn't become a thing, albeit bifurcation makes it cheaper. Frankly, whether you use 24 x4 SSDs or 12 x8 SSDs (as one random example) doesn't really matter and if PCIe connections were primarily a bunch of x8 connections, you could use x8 SSDs without needing PCIe switch chips nor bifurcation.

Bifurcation does have a slight advantage for mixing cards and slots with non-matching widths when its supported by the motherboard, but it is hardly ever important. Even then, while 32 bit PCI slots were the only ones common to consumer motherboards, PCI supported mixing 32 bit and 64 bit cards in the opposite size slot and there wasn't any awful gate-keeping with it like with PCIe bifurcation needing BIOS support that motherboard manufacturers often refuse to implement.
Implementations being point-to-point isn't the same as having a standard that's fundamentally point-to-point. After waxing about multiple DIMMs per channel, I'm surprised you didn't immediately appreciate this.
Whether or not the interface supports multiple devices per bus doesn't really matter when you're only using one device per bus. AGP was based on and nearly identical to PCI. Simply choosing to dedicate a single bus for the one device, despite the underlying technology having been built for supporting multiple devices per bus, allowed them to clock it higher than PCI could go. AGP still was a parallel interface and while the maintaining signal integrity between the signal lines in the one slot was easier than doing it with two or more slots on one bus, it would still be an issue for pushing the clock even higher which PCIe addressed by making it no longer necessary to keep the signal lines synchronized.

Unless you're arguing that DDR memory in one DIMM per channel operation could perform better than it does now if the interface didn't support multiple DIMMs per channel, I don't see your surprise.
You're getting history all muddled up. USB started out life at a max of 15 Mbps. The faster speeds only came along later. Serial wasn't needed to push past Parallel Ports' meager speed of 2 Mbps - USB opted to go with a serial bus for the other reasons mentioned.


Again, SCSI shows that's not true. UltraSCSI-320 is more than 2.5x as fast as the fastest PATA spec.

A key point you're glossing over with PATA is that it was a bus standard, not unlike the original PCI. You could have two devices per cable, which eventually should introduce issues like we saw with multiple DIMMs per channel.

However, it's not at all clear that being a parallel interconnect was a real limitation, at those speeds. It used a ribbon cable, with all the conductors being of the same length. However, it had the burden of legacy support, which might've been the biggest constraint holding it back.
Serial wasn't needed to push beyond 2Mb/s, sure I'll agree to that. However, cheap parallel connections simply couldn't go as far as USB went on a long cable and they planned from the beginning of USB to go far beyond 15Mb/s. SCSI and its derivatives are complicated and expensive. It uses vastly more and more powerful transistors to allow it to do the things it does and it consumes more power in doing so. Even SAS, which stuck with point-to-point functionality in SATA, uses much more complicated and expensive hardware.

PATA had such bad signal integrity issues that it needed 80 wire cables (mostly extra grounds between the signal lines) instead of the original 40 wire cables to get acceptable signal integrity with its faster versions. It was clearly hitting a wall where significant changes were necessary to continue improving speeds. They could have used more expensive hardware like ultra SCSI to keep pushing higher, but consumers don't tend to like spending hundreds of dollars extra for that sort of thing. Yes, PATA having two devices per cable was related and part of the problem, but it was less significant as one hard drive at the time wasn't fast enough to saturate the interface in the last couple versions. Legacy support isn't much of a burden for PCIe nor USB, so I don't think it was a significant issue for PATA either.
SATA stopped at 6 Gbps because NVMe was just better in so many ways, rapidly maturing, and had already captured the high-end.
The main advantage was less hardware of any kind being between the CPU and the storage, which reduced latency and allowed bandwidth improvements to be made simply by keeping up with new PCIe versions. That goes back to my point about why PATA went away to SATA instead of being replaced with SCSI: NVMe is not only lower latency, but since it uses less hardware, it is also simpler and cheaper to implement.
You have it backwards. The cost-savings and bulk reduction of SATA vs. PATA drove the move from parallel to serial - not the other way around.


You're forgetting who bares the cost of PATA being such a clunky standard. It's disproportionately the OEMs. In contrast, it's overwhelmingly end-users dealing with video cables. OEMs drive the agenda, when it comes to PC standards.
"Like with USB, the smaller plug and cable were beneficial to commercial success but were only possible because of the serial interface in use" is not backwards to "The cost-savings and bulk reduction of SATA vs. PATA drove the move from parallel to serial," it merely adds context to it. Regardless, I doubt OEMs were overly concerned with paying a few dollars for one cable. Would they prefer to spend less? Always, but they tend to move slow and had no qualms continuing to ship PATA devices even in computers that had SATA ports. I saw many OEM computers that shipped with PATA devices instead of SATA even though they supported SATA. I saw some computers that shipped with a mix of PATA and SATA devices like a SATA hard drive and PATA optical drive. Sure, they're just going through whatever stock they had or could get a good deal on at the time, but the cable was a much smaller part of that cost than the cost of the device.
This was true back in the 1980's and early 1990's. Have you ever heard of ATAPI? It was used by many CD-ROM drives and was merely SCSI commands tunneled over IDE. Every controller supporting UDMA mode supported it.
That's completely irrelevant. I said SCSI hardware was expensive and it was. That you could use the same commands on a different, cheaper interface didn't mean you could make that interface do thing things SCSI could do, like connecting many devices with a single parallel bus at relatively high speed. That required hundreds of dollars of hardware. USB also supports SCSI commands for external storage which are better suited at getting good performance out of storage than USB's own much more general-purpose commands but that doesn't change what the interface is physically capable of. SAS interface hardware is also more complicated and expensive than SATA interface hardware despite their apparent similarities. The commands being used have little to do with that as it's a hardware thing, not software.
This is rubbish. If you go back to the origins of the parallel port cable, the reason they adopted it was to avoid having to add circuits to serialize and deserialize data over a serial link. In other words, the proliferation of parallel interfaces was much more about cost-savings than anything else.

Those were the pre-VLSI days. The main thing that changed is the cost-reductions and commodification of ICs needed to implement serial links.

The parallel port started off without a buffer communicating directly as you mention. Reaching its higher speeds required a buffer and once you have a buffer, there is no distinction between data entering it serially or in parallel. Whether you input data in parallel on a given clock or serially on a clock that is your previous clock multiplied by the previous parallel width is indistinguishable from the output of the buffer to the device. You can just as easily input data serially and still output it in parallel, the output device doesn't see any of that. There were no cost-savings to be had by sticking with the parallel port other than not having to redesign the printer to move away from it after the introduction of the buffer, they just stuck with the interface for a long while afterwards because they already had one that worked fine and it was easier than changing to a different one. Again, the cost of the many conductor cable wasn't enough to make them change even though the buffer made it no longer advantageous. That a huge amount of devices simply didn't need much bandwidth at the time also made it easy to not care about changing.
DDR3-era CPUs had only up to 4 memory channels. DDR5-era server CPUs have either 8 or 12. With AMD, you get the same 12 DIMMs as those old 3DPC motherboards, but with 3x the bandwidth as 4-chanel/3DPC would provide. This lets you scale capacity along with bandwidth, making multiple DIMMs per channel is less important than it used to be.

Furthermore, have you ever seen a 2-CPU server board with 24 DIMM slots?? It's hard to see where you'd fit another 24! Here's one with the maximal 48-DIMM configuration - it's huge and it has basically no room for anything else!

Furthermore, in the modern era, die-stacking is reportedly set to usher in DDR5 DIMM capacities as large as 1 TB! At that point, you're spending so much $ on DRAM per system that it's probably not a big deal to buy some more CPUs, to scale up further.

At today's prices 24 TB of 32 GB DDR5-4800 RDIMMs would cost a whopping $115,200. Obviously, GB/$ will improve by the time we have 1 TB DIMMs, but recent trends in semiconductor pricing suggest the improvement won't be much. Maybe a factor of 2 or 3? But, probably not the order of magnitude we're used to seeing.
I have seen 2-CPU server boards with 24 DIMMs. There's loads of old ones cheap on newegg and ebay. Besides, I wasn't just referring to AMD's 12 channel CPUs in two-socket boards. Intel's boards are still 8-channel. I haven't seen any 3 DIMM per channel boards for them. More DIMM slots means you can use less dense memory which is cheaper than using the highest density memory. If you need a lot of memory, more slots may let you use cheaper memory resulting in significant cost savings. I realize that there is less need for three DIMMs per channel now than there was before, though that's largely because memory bandwidth per DIMM hasn't increased as much for ECC memory DDR4 to DDR5 as it used to, so you need 12 channels just to get good enough memory bandwidth to supply those ridiculous core count CPUs with enough data to not be mostly sitting idle. However, some workloads are not as bandwidth-intensive while still needing large capacity and if it was feasible, three DIMMs per channel could have offered cost benefits, especially with CPUs that only have 8 or 6 channels anyway. If you don't need a second CPU per board for anything other than getting more DIMM slots, and with how many cores these new CPUs have that's not a stretch,
 

bit_user

Titan
Ambassador
Two 32-bit sub-channels instead of one 64 bit channel means you have half as many lines to impedance match together as tightly. Lines in one channel don't need to be as tightly matched to the lines in the other channel as they do to the other lines in their own channel. That's an advantage for signal routing.
The thing is, unless you're a board designer (and it's clear that you're not), you don't know if something is a dominant issue or just also true. I'm getting a lot of that from you, where you read about some effect and assume it explains some particular design decision. It's basically the same kind of thinking that conspiracy nuts use to establish causal relationships anywhere they perceive an alignment of interests between two entities or persons.

In engineering, there's usually one dominant factor which drives a decision, even when it seems like there could be more. In order to truly know the rationale behind certain design decisions, you need to be intimately familiar with the details (or take it on the authority of someone who is).

Servers don't really care as they just used x8 PCIe SSDs (or PCIe x8 cards in general) or PCIe switch chips before bifurcation was a common feature and they'd still use them if x4/x4/x4/x4 bifurcation hadn't become a thing, albeit bifurcation makes it cheaper. Frankly, whether you use 24 x4 SSDs or 12 x8 SSDs (as one random example) doesn't really matter and if PCIe connections were primarily a bunch of x8 connections, you could use x8 SSDs without needing PCIe switch chips nor bifurcation.
This is wildly out of date, if it were ever true. AFAICT, virtually all current datacenter SSDs use NVMe x4 connections. There might still be some x8 drives kicking around, but the industry has standardized on x4 and the advent of PCIe 5.0 and CXL (not to mention PCIe 6.0 and CXL 3.0, waiting in the wings) would seem to eliminate any foreseeable need for them to go wider.

PATA had such bad signal integrity issues that it needed 80 wire cables (mostly extra grounds between the signal lines) instead of the original 40 wire cables to get acceptable signal integrity with its faster versions.
Because they had to maintain backward compatibility, they couldn't do obvious things like differential signalling. That doesn't mean PATA hit a fundamental wall, just that maintaining support for IDE and multiple devices on a bus required they introduce those extra grounds. If you were starting from scratch, there's a lot more you could do.

You're clearly starting with an explanation and claiming it applies everywhere it seems like it could. That's not engineering. Engineers start with data and work forwards to the answer.

Yes, PATA having two devices per cable was related and part of the problem, but it was less significant
Less significant for signal integrity? How do you know?

Legacy support isn't much of a burden for PCIe nor USB, so I don't think it was a significant issue for PATA either.
PATA supports IDE, which is much older than either of those technologies.

That goes back to my point about why PATA went away to SATA instead of being replaced with SCSI: NVMe is not only lower latency, but since it uses less hardware, it is also simpler and cheaper to implement.
So, B replaced A, instead of being replaced by C, because D is better and cheaper? Do you read this stuff, before you post it?

I doubt OEMs were overly concerned with paying a few dollars for one cable.
With margins as thin as theirs, a couple $ per unit is huge! This tells me you've never worked anywhere in the hardware business. BoM costs are one of their main concerns. That said, your cost estimate is probably off by an order of magnitude.

I saw many OEM computers that shipped with PATA devices instead of SATA even though they supported SATA.
Early SATA drives were premium. Surely, it was the OEMs' ability to get cheaper PATA drives which drove any decision to keep using them.

That's completely irrelevant. I said SCSI hardware was expensive and it was.
But it didn't stay expensive because of the interface. It stayed expensive because it had been relegated to a high-end niche market of workstation and server hardware that was expensive for other reasons - not the interface.

I have seen 2-CPU server boards with 24 DIMMs.
You completely missed the point. It wasn't about 24-slot boards, but about 48-slot boards. I was pointing out how little room is left after just 24 slots. Once you go to 48 slots, you're left with a huge board that can fit nothing else.

Intel's boards are still 8-channel. I haven't seen any 3 DIMM per channel boards for them.
Their server CPUs don't support 3 DPC, and that's probably quite alright, since they're typically used in 2 CPU configurations and there doesn't seem to be much demand for 48-slot boards.

More DIMM slots means you can use less dense memory which is cheaper than using the highest density memory.
You're thinking of consumer RAM, here. Servers use registered memory, meaning they can handle more ranks.

Anyway most of the server market is datacenters, and they care a lot about density (i.e. fitting the most resources in the least rack space). I think they would go with higher-density RAM no matter what.

If you need a lot of memory, more slots may let you use cheaper memory resulting in significant cost savings.
Cheaper memory generally runs slower, which is another reason why you don't want it in 100+ core systems.

memory bandwidth per DIMM hasn't increased as much for ECC memory DDR4 to DDR5 as it used to,
ECC has nothing to do with it. And ECC RDIMMs are available at the maximum speed that the CPUs, themselves, support. These days, the ECC UDIMM market is suffering, mainly just because servers and now workstations only support support RDIMMs. That leaves a small market for ECC UDIMMs, so DIMM manufacturers don't give it much love.
 
Status
Not open for further replies.