News If you think PCIe 5.0 runs hot, wait till you see PCIe 6.0's new thermal throttling technique

Conor Stewart

Commendable
Oct 31, 2022
52
46
1,560
I can't see how reducing the link speed or number of lanes is the best way to deal with thermals. Surely it would be better to limit it in other ways, like power and temperature limits in the cards themselves (isn't that already a thing?) or communication between the CPU and PCIe device so for instance the GPU can tell the CPU it can't keep up and is overheating and needs to slow down (I would be surprised if this isn't already a thing).

Adding a connection bottleneck rather than just slowing the device down seems like a very odd and nonsensical way of doing things. It would only make sense if there was no way of slowing the device down, in which case it just needs designed better.

If you are going to use a PCIe 6.0 x 16 connection just for it to reduce to PCIe 6.0 x 8 or x 4 under load then you might as well just use PCIe 4 or 5. Same if it reduces the clock speed of the PCIe bus. Sure you would lose out on burst performance but the sustained performance would be the same.

Also can't you just design the products better? If you know your product can only handle a PCIe 6.0 x 4 connection under load then why give it a 16 lane connection? The performance gained for the tiny amount of time it can use the 16 lanes (before it heats up) will be minimal and of no use to many applications especially the likes of datacentres where you want to use the devices as much as possible and constantly.

Also what benefit do you have from reducing the lanes of a device under use? It's not like you can then use those lanes elsewhere, all lanes are still taken up by that one device.
 

bit_user

Titan
Ambassador
The article said:
As the PCIe bus gets faster, it becomes more demanding of signal integrity and less tolerant of signal loss, which is often combated by improving encoding or increasing clocks and power, with the latter two creating extra heat.
Even improvements in the encoding can burn more power. I'm sure PCIe's PAM4 does, not to mention its FEC computation.
 

bit_user

Titan
Ambassador
What is old is new again. Computers are going require air conditioning to use them.
They already do. PCIe 6.0 is used in servers, which run in air conditioned machine rooms and datacenters.

If the new standards can't run as rated then why bother upgrading to them if you can use the older standard and spend less money?
PCIe 6.0 was created and is being adopted to address market needs. The market wants faster speeds more than energy efficiency.

Also can't you just design the products better?
Haven't you ever heard of a fan failure, or maybe a heatsink that's not properly mounted? Just like how CPUs will avoid burning themselves out in these situations, PCIe controllers also need to avoid frying chips (on some very expensive boards, I might add), in the event of a cooling failure.
 

Nicholas Steel

Distinguished
Sep 12, 2015
30
7
18,535
What is old is new again. Computers are going require air conditioning to use them. LOL.
If the new standards can't run as rated then why bother upgrading to them if you can use the older standard and spend less money?
Think of it like a CPU with Boost capability, or anotherwords a PCI-E 5.0 device that can temporarily boost up to double the speed.
 
  • Like
Reactions: greenreaper

Conor Stewart

Commendable
Oct 31, 2022
52
46
1,560
They already do. PCIe 6.0 is used in servers, which run in air conditioned machine rooms and datacenters.


PCIe 6.0 was created and is being adopted to address market needs. The market wants faster speeds more than energy efficiency.


Haven't you ever heard of a fan failure, or maybe a heatsink that's not properly mounted? Just like how CPUs will avoid burning themselves out in these situations, PCIe controllers also need to avoid frying chips (on some very expensive boards, I might add), in the event of a cooling failure.
How is the solution to cooling failures to reduce the number of PCIe lanes? The chips need to avoid frying themselves by slowing down or switching off if they get too hot. Why should controlling the devices thermals be the job of the PCIe controller?

This also seems to be talking about normal use, not failures, so that is a totally different argument anyway.
 
  • Like
Reactions: Avro Arrow

bit_user

Titan
Ambassador
How is the solution to cooling failures to reduce the number of PCIe lanes? The chips need to avoid frying themselves by slowing down or switching off if they get too hot.
Deactivating some of the lanes is a mechanism that can be used to throttle activity, thus reducing power consumption & heat dissipation.

Why should controlling the devices thermals be the job of the PCIe controller?
The patch isn't very specific about what is being throttled, but there could be bridge chips or even the PCIe controller built into the CPU that needs to throttle.

This also seems to be talking about normal use, not failures, so that is a totally different argument anyway.
Nothing I've seen in the patchset indicates one way or another.

There can be multiple reasons why something overheats. Maybe the cooling is adequate for typical use, but the system gets subjected to a workload that hammers PCIe unusually hard and for an extended duration. I've even opened up a server and removed some of the airflow shrouds, while it was running, in order to try and narrow down the source of a noise it was making. Whatever the cause, the hardware needs the ability to throttle enough that it doesn't burn out.
 

razor512

Distinguished
Jun 16, 2007
2,159
87
19,890
I wonder, can motherboard makers get rid of the requirements for motherboard tray standoffs, and then make the back of the motherboard have a heatsink that doubles as a heatsink and heatspreader?
 
Last edited:
Yeah, I won't be buying one of these (or a PCIe5 NVMe for that matter) because clearly, we don't have the technology yet to handle speeds like this safely. At the moment, the real-world performance differences between PCIe3 and PCIe5 are almost imperceptible. The only exception to this is copying files from one drive to another, limited by the write speed of the receiving drive.

When it comes to loading programs, we're talking a difference of <2 seconds which is no real difference at all, certainly not worth the tradeoffs of the (sometimes dangerous) levels of heat that these "high-speed" NVMe drives can generate. If an NVMe needs to be throttled to reduce heat, then you're probably better off just buying a PCIe3 or PCIe4 NVMe because a PCIe6 drive that gets throttled will probably give you a similar experience to a PCIe4 NVMe anyway.

I would rather have a PCIe3 NVMe and lose a tiny bit of performance in exchange for superior thermals and reliability.
 

bit_user

Titan
Ambassador
I wonder, can motherboard makers get rid of the requirements for motherboard fray standoffs, and then make the back of the motherboard have a heatsink that doubles as a heatsink and heatspreader?
Assuming we're talking about servers, wouldn't it be easier for them just to attach heatsinks to some of the hotspots and direct some airflow under there?
 
If it carries on like this, without major improvements in efficiency, PCs are going to need an external refrigeration system, to chill the intake air...🥶
Asetek might have to design an AIO for SSDs. :eek:
Just stick some heatsinks on the traces on the mobo... problem solved!
Or, even better, we could have water channels that run through the motherboard to cool everything, even the CPU! ;)(y):ROFLMAO:
Think of it like a CPU with Boost capability, or anotherwords a PCI-E 5.0 device that can temporarily boost up to double the speed.
I see what you're saying but I'd rather wait until we have the proper tech to do it without worrying about heat. Our rigs are already packed with things like 300W video cards and (for some people), 150+W CPUs. We're going to have to move A LOT of air through the cases. I would rather not have SSDs that create enough heat that it can throttle another part of my PC.
 
  • Like
Reactions: PEnns and Order 66

ravewulf

Distinguished
Oct 20, 2008
985
44
19,010
I can't see how reducing the link speed or number of lanes is the best way to deal with thermals. Surely it would be better to limit it in other ways, like power and temperature limits in the cards themselves (isn't that already a thing?) or communication between the CPU and PCIe device so for instance the GPU can tell the CPU it can't keep up and is overheating and needs to slow down (I would be surprised if this isn't already a thing).

Adding a connection bottleneck rather than just slowing the device down seems like a very odd and nonsensical way of doing things. It would only make sense if there was no way of slowing the device down, in which case it just needs designed better.

If you are going to use a PCIe 6.0 x 16 connection just for it to reduce to PCIe 6.0 x 8 or x 4 under load then you might as well just use PCIe 4 or 5. Same if it reduces the clock speed of the PCIe bus. Sure you would lose out on burst performance but the sustained performance would be the same.

Also can't you just design the products better? If you know your product can only handle a PCIe 6.0 x 4 connection under load then why give it a 16 lane connection? The performance gained for the tiny amount of time it can use the 16 lanes (before it heats up) will be minimal and of no use to many applications especially the likes of datacentres where you want to use the devices as much as possible and constantly.

Also what benefit do you have from reducing the lanes of a device under use? It's not like you can then use those lanes elsewhere, all lanes are still taken up by that one device.

It's not about how much power the cards are using, it's about how much power the PCIe lanes themselves are using within the motherboard. It takes power to move data between the CPU/chipset and the cards. Operating systems already reduce the link speed down to PCIe 1.1 when idle to save power, this adds in thermals if the lanes/redrivers/etc on the motherboard get too hot.
 

35below0

Respectable
Jan 3, 2024
1,726
744
2,090
would rather have a PCIe3 NVMe and lose a tiny bit of performance in exchange for superior thermals and reliability.
The performance loss really is tiny. gen 3 and gen 4 are good enough for most. Maybe in the next 3-4 years this changes a little bit, but i bet if it costs a ton of money and requires complicated cooling, it will be something like a 4090 today. Maybe would be nice to have but i ain't paying for that.

(bad analogy unfortunately because the 4090 justifies it's price with a huge jump in performance, while PCIe 5.0 NVMe are only slightly faster.)
 
  • Like
Reactions: Avro Arrow

MobileJAD

Prominent
May 15, 2023
24
13
515
If it carries on like this, without major improvements in efficiency, PCs are going to need an external refrigeration system, to chill the intake air...🥶
Don't server farms in corporate server rooms already do this? Which this is kinda the target customer for the changes discussed in the article. I don't really see the thermal problems being discussed as something the average PC gamer having to worry about.
Although it is concerning that the PCIe evolution between generations that we have been enjoying so far won't be able to continue, imagine if you won't be able to continue to enjoy running your new fancy $2,000 high end gpu at 16x in the future? Although truth be told, GPUs don't seem to be able to keep up with the ever evolving PCIe speeds, as I don't seem to recall hearing about any PCIe 5.0 GPUs...
Maybe PC gamers at home won't have to worry about PCEe 6.0 thermal issues?
 
  • Like
Reactions: bit_user
Also can't you just design the products better? If you know your product can only handle a PCIe 6.0 x 4 connection under load then why give it a 16 lane connection?
within limits of m.2 ssd? they have little wiggle room.

& they are a company that #1 rule is...profit...so they push the newest product.

about only way they'd get around thermal limit (w/o lowering soemthing spec wise) is having to redesign MB's so they could fit active coolers on them (and currently they cant due to the layout of MB and gpu's being so close to all the slots blocking any actual active cooling you could do that would have meaningful impact)


The chips need to avoid frying themselves by slowing down or switching off if they get too hot.
in every situation thats not the answer.
if your boot OS drive got too hot cause you installed/copied a file and shut off...thats gonna be an issue right?

and they already do slow down to prevent cooking themself as they have thermal limits before they throttle.
, can motherboard makers get rid of the requirements for motherboard tray standoffs, and then make the back of the motherboard have a heatsink that doubles as a heatsink and heatspreader?
yes, but there would require cases to support that...and also with how they are moving some connections to back would interfere as well and would be an increase in cost of production.

and then you run risk of if it is good enough. heatsinks/spreaders can only move oh so much heat in a given time. (similar to why IHS can't keep cpu's cold is even best cooling gets overcome by the amount of heat generated in a specific area)
we could have water channels that run through the motherboard to cool everything, even the CPU!
i recall reading article in past they were testing out actual water channels inside CPU's to try and keep em cooler..no idea what happened to that idea (might be worked on yet or dropped)



Maybe PC gamers at home won't have to worry about PCEe 6.0 thermal issues?
they will.

pcie 5.0 already has thermal issue (and why you see so many active cooling for em in market as well as why they all come w/ some form of heatsink (wasnt case with 3.0 and only tail end of gen 4 drives could hit speeds that might need em)

GPUs don't seem to be able to keep up with the ever evolving PCIe speeds, as I don't seem to recall hearing about any PCIe 5.0 GPU
becasue its costly to do so & GPU makers have no reason to do son for consumer gpu's (they will slowly increase over time)

now server side? Nvidias H800 & H100 are gen 5 gpu's. (primarily ai focused which can hit the need for gen 5)
 

Notton

Commendable
Dec 29, 2023
903
804
1,260
I find it amusing that people think they're going to use this data center stuff in their gaming desktop PC.

Y'all are aware that desktop GPUs cannot saturate a PCIe 5.0 x16 link, right?
In fact, an RTX 4090 only loses some 2% performance when used on a PCIe 4.0 x8 link.
 
  • Like
Reactions: bit_user

bit_user

Titan
Ambassador
It's not about how much power the cards are using, it's about how much power the PCIe lanes themselves are using within the motherboard. It takes power to move data between the CPU/chipset and the cards.
That power is burned at the PCIe controllers in the endpoints and any intervening switches. Perhaps also PCIe retimers, but I'm not sure they're concerned about those.

While I suppose the copper traces in the PCB might heat up a little bit, I'm pretty sure that's not what they're worried about.
 
  • Like
Reactions: slightnitpick

jlake3

Distinguished
Jul 9, 2014
148
211
18,960
I find it amusing that people think they're going to use this data center stuff in their gaming desktop PC.

Y'all are aware that desktop GPUs cannot saturate a PCIe 5.0 x16 link, right?
In fact, an RTX 4090 only loses some 2% performance when used on a PCIe 4.0 x8 link.
I mean... PCIe standards tend to roll out there first because they need the performance and can bear the cost, but so far they've always made their way down to consumer-grade PCs.

Early PCIe 4.0 chipsets needed active cooling, which got solved, but PCIe 4.0 SSDs got a reputation for running warm and PCIe 5.0 drives have been worse, and there hasn't been a lot of progress on that one. I don't think most gamers are worried they're gonna have PCIe thermal throttling in 2024 on existing hardware, but there's a worry (some valid and some misplaced) that if things keep going the way they seem to be going then PCIe implementations are going to be an increasing problem.
 
  • Like
Reactions: Avro Arrow