Skylake Bugs Aren't Odd, They're Prime

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

ynhockey

Distinguished
May 4, 2012
15
0
18,510
I have also had random freezes running Skylake (6700K). Not sure if it's related to this bug; an update to BIOS considerably reduced the amount of freezes, but it still happens from time to time. Hoping for a quick fix through ASUS (motherboard)!
 

George Phillips

Reputable
Jun 17, 2015
614
0
5,360
It turned out the first and current stepping of Skylake CPUs have some major bugs relating to calculating numbers correctly. That's the most basic function of a processor. Skip it and wait for the next stepping after the market bought all current Skylake processors.

It looks like my PC with a Haswell Xeon of latest stepping and ECC rams will be serving much longer!
 

The bug has been fixed by a BIOS microcode update, like countless other bugs in past generations of Intel chips. There won't be a recall since all you need to do to fix it is update your BIOS when your motherboard manufacturer releases an updated BIOS that includes the fix.

The FDIV recall was because someone at Intel forgot an entry in the division lookup table and there was no possible microcode fix for that. The only possible work-around was having the OS intercept the FDIV op and execute it in software, which produced unacceptable performance loss from the overhead of running an exception handler on every FDIV.

If it's a hardware bug, you can't fix it. You can just reprogram the chip to avoid the issue at a performance cost.

It's a flaw with the design, not the programming. You can't "fix" it. You can just make a workaround.
 
AVX2 at a lower clock is still faster than non-AVX2 at a higher clock. It's called a trade-off. The desktop Haswell increases the voltage when AVX2 gets used, the Xeon Haswells reduce the clock speed and keep the voltage the same.

Not on the 4790K---mine goes straight for 100C before it thermally throttles. That doesnt seem like a great way of reducing the clock, you know, making it so hot that the chip death protection cuts in.

Kewlx25 said that Xeons reduce their clock frequency, not the desktop chips. Kewlx25 specifically stated that the desktop Haswell chips increase their voltage rather than decreasing clocks, which is why your i7-4790K has heat trouble. I agree that this was a bad call by Intel and that all of the Haswell chips should have done what the Xeons do and simply reduce clock speed as needed to keep power consumption and heat in check.
 

George Phillips

Reputable
Jun 17, 2015
614
0
5,360


I agree that the hardware can't actually be changed. The only workaround is to modify the BIOS and microcodes to avoid issues. These usually come at a performance cost.
 

aldaia

Distinguished
Oct 22, 2010
535
23
18,995

It is true that past processors have well over a hundred of documented errors, however, most of those errors remain unsolved, and just a few of them get a workaround (which is not exactly the same as a fix).
See for instance the list of errors for 4th generation intel processors:
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf
There are 155 errors listed. Under the column "Status" I've been unable to find a single error whose status is different than "No Fix". If you then go to the details for each error, some of them have a workaround, which in many cases implies that your software must do something or avoid some situation. Example: "Workaround: Software should ensure that the VEX.L bit is set to 0 for all scalar instructions".
However, most of the errors are listed as "Workaround: None identified"
 


There is plenty of erratum in Haswell and Haswell-E, so it's not like those are without issues either. However, something as basic as prime calculation, one of the most basic functions of processors, seems like a gigantic oversight.
 

fireinthesky7

Reputable
Jan 12, 2016
2
0
4,510
How about the "feature" in my Haswell 4790K which turns it into a miniature supernova when AVX2 instructions are used in Prime95. ;)

One wonders why bother including these features if the chip cant handle them at stock speeds.

This is apparently a problem with the newest build of Prime95 and not any specific processor. It's apparently limited Prime95's use as a stress tester because the temperatures CPUs have been hitting while running it are totally unrealistic for any other scenario.
 

blppt

Distinguished
Jun 6, 2008
576
92
19,060
"This is apparently a problem with the newest build of Prime95 and not any specific processor."

Its with any of Intel's offerings that offer AVX2 onboard, which if memory serves, is Haswell and beyond. If I remember correctly, versions of P95 prior to 28.x did not fully utilize FMA3 / AVX2,and that was the reason the current releases can create a small blast furnace in your haswell case.

Still, the mere fact the Intel would allow the chip in any situation to nearly double its design TDP (I've read north of 150W when running Prime95 latest on my 4790k), thereby overwhelming its recommended (or heaven forbid, somebody using the boxed) cooler, going to self-preservation mode in a hurry seems like a rather poor oversight on their part.

Also of note is that apparently the area of the Haswell cores that generates this massive heat does not make great contact with the heat spreader, meaning that better cooling can only do so much when P95 (latest) kicks in.

Still, I'd hate to think what my 9590 would draw if it had AVX2 extensions. I think it would melt into the earth within 10 seconds of the P95 run start. :)
 

fireinthesky7

Reputable
Jan 12, 2016
2
0
4,510


On the flip side, you wouldn't have to leave your desk to cook.
 

SuperVeloce

Distinguished
Aug 20, 2011
154
0
18,690
"This is apparently a problem with the newest build of Prime95 and not any specific processor."

Its with any of Intel's offerings that offer AVX2 onboard, which if memory serves, is Haswell and beyond. If I remember correctly, versions of P95 prior to 28.x did not fully utilize FMA3 / AVX2,and that was the reason the current releases can create a small blast furnace in your haswell case.

Still, the mere fact the Intel would allow the chip in any situation to nearly double its design TDP (I've read north of 150W when running Prime95 latest on my 4790k), thereby overwhelming its recommended (or heaven forbid, somebody using the boxed) cooler, going to self-preservation mode in a hurry seems like a rather poor oversight on their part.

Also of note is that apparently the area of the Haswell cores that generates this massive heat does not make great contact with the heat spreader, meaning that better cooling can only do so much when P95 (latest) kicks in.

Still, I'd hate to think what my 9590 would draw if it had AVX2 extensions. I think it would melt into the earth within 10 seconds of the P95 run start. :)

Some motherboards override the voltage settings at stock and feeds crazy voltage numbers into the cpu. But this is basically like cheating, because they override the intel set parameters. So the statement that desktop cpus increase the voltage and xeons decrease the frequency is not true or rather its the motherboards fault. Desktops should decrease the frequency if you hit stock TDP.
Sometimes the problem lies with Load-line calibration on some motherboards/settings as it's just too agressive. My H87 motherboard (paired with i7 4790 3.6-4ghz) never forces the vrms on the cpu higher than 0.01V over standard on avx2 loads. But if I set the loadline calibration on max, I could actually hit the TDP wall (84w) with small FFT in prime95 and it would underclock to 3700 (it sits hapilly at 3800 on all other settings and loads). But rather than to increase the TDP i just undervolted the cpu.

Your CPU should not go over your rated TDP (88W for your cpu) if you never enabled that in bios (except for the stock 8 seconds of allowed higher tdp for bursts), even with Z mobo and K cpu. But it happens, mostly on Z motherboards... most likely because 4790k throttles like crazy under heavy loads if you dont want to exceed the 88w TDP and mobo makers wants you to feel that your system is working at its full potential.

I think your cpu should not need 150w+ TDP for a 4.2ghz four core turbo at stock voltages unless your motherboard overrides voltage bins (and also 4.4ghz on all cores is basically an overclock and only enabled if motherboard maker chooses so). Actually under avx2 you should expect no more than 4ghz on all cores without some undervolt just as intel promises (base clock). While they do specify lower "avx base" frequency for E5 cpus, they never said that for e3 xeons as far as I know. But if you succeed to exceed the TDP for extended periods of time (with avx2 workload) even at 4GHz, your cpu should and would underclock under base freqency when all else is at intel stock settings.
But you can set your TDP (and short higher tdp) for all cpus (even locked) at bios anyways.
 

blppt

Distinguished
Jun 6, 2008
576
92
19,060
"Your CPU should not go over your rated TDP (88W for your cpu) if you never enabled that in bios (except for the stock 8 seconds of allowed higher tdp for bursts), even with Z mobo and K cpu. But it happens, mostly on Z motherboards... most likely because 4790k throttles like crazy under heavy loads if you dont want to exceed the 88w TDP and mobo makers wants you to feel that your system is working at its full potential."

Seems about right---it doesnt seem to be the voltage bump and resulting high wattage consumption that causes my 4790K to throttle (aka not wanting the exceed the TDP of 84W)---its only when the temp reaches "chip protect" threshold (somewhere north of 100C) that the clock throttling sets in (internal cpu control)---otherwise, up until that point, even with 150+W pumping through the cpu, the mobo attempts to remain at the stock max frequency.

Still, that kind of wattage has to be very dangerous for this particular cpu which was engineered for 84W, so I would think that the mobo manufacturers (if they could) would put their own settings in the BIOS to downclock to a safe clock when the AVX2 usage parameters get intensive. My guess is that it is either something that they cannot control, or Intel will not let them control, because ASUS, Gigabyte, etc., have no problem dumping every enthusiast setting under the sun into their BIOSs for their high end mobos, and as far as I know, even the ROG Z97/170 boards do not have a specific setting for AVX2 loads. Sure, you could accomplish some of this yourself with various LLC and Power Limit settings (like you said), but you would think that as default, the motherboard maker would have "Auto" err on the side of caution. And those LLC/Power limit settings would affect everything, not just for the rare AVX2 heavy workload.
 

SuperVeloce

Distinguished
Aug 20, 2011
154
0
18,690
"Still, that kind of wattage has to be very dangerous for this particular cpu which was engineered for 84W, so I would think that the mobo manufacturers (if they could) would put their own settings in the BIOS to downclock to a safe clock when the AVX2 usage parameters get intensive"
No no, they do exactly what you think they cannot do. They remove all safeguards, because they expect you to overclock it to the moon.

"And those LLC/Power limit settings would affect everything, not just for the rare AVX2 heavy workload."
Well yeah, that's the point. If any of the other calculations could take that kind of work and power, the temperatures would be the same as in avx2. Hence the power limit settings are too aggressive and just unnecessary on stock and "avx2 only" settings are not needed if the thing is controlled how it's meant to be. If the avx2 needs 200w TDP, and all other calculations need 120W TDP for a given frequency, then this setting is way too aggressive. It basically disables underclock for avx2 loads
 

blppt

Distinguished
Jun 6, 2008
576
92
19,060
"No no, they do exactly what you think they cannot do. They remove all safeguards, because they expect you to overclock it to the moon."

No, thats not what I meant. What I mean is that the mobo manufacturers cannot disable the Haswell+ voltage bump when under an AVX2 load. It happens no matter what---if you fiddle with BIOS voltage limits, those are for every situation, not just for the AVX2 voltage bump, so yes, while you can say "do not exceed this TDP no matter what"---that applies to everything you run on the chip, not just the AVX2 monster P95 load and subsequent voltage bump.

What I was saying is that the mobo manufacturers are apparently unable to disable the processor's internal directive to bump the voltage when an AVX2 load is detected. Because if it were possible to control this specific state from a mobo bios, you can bet ASUS or GB or MSI would have added such a voltage mod (maybe even adjustable levels) in their enthusiast mobo bioses.
 

SuperVeloce

Distinguished
Aug 20, 2011
154
0
18,690
I understand what you mean, but as I already said, on my system (h87 + 4790) with avx2 loads (and any other synthetic "burn-in" calculations) running, my voltages stay the way they are normally (I get 0,01V or 0,02volts bump at most)
 


What about the servers? Skylake isn't used there much if at all right now as far as I'm aware, unless someone goes out and uses a Xeon E3 V5 which really isn't ideal for most server usages. Low end servers can get away with much weaker CPUs and higher end servers are often best with a Xeon E5 or E7 even when you aren't going multi-processor because they're generally more optimized for the workload and much more power efficient. Xeon E5 and Xeon E7 are still limited to versions of Haswell.

Also, I'd be surprised if most servers had to spend much time calculating prime numbers. Sure, the supercomputers that do nothing but prime numbers could be useless with Skylake, but again, those are best with Haswell anyway for core count, cache, and efficiency advantages. Skylake systems are unlikely to be used in a situation where this bug is a problem outside of stress testing and there are other stress tests around if necessary.
 

Epsilon_0EVP

Honorable
Jun 27, 2012
1,350
1
11,960


Intel's strategy of introducing new architectures in lower end hardware and slowly trickling up to server components makes sense with this in mind. I wonder if the Skylake based Xeons will feature a revised architecture to deal with issues like this.
 


That it does. High performance and high efficiency Skylake isn't due until 2017 and by then, this will probably be fixed without BIOS workarounds.
 
Status
Not open for further replies.