Does hyper-threading / multi-threading increase efficiency (Performance/Watt)?

Mellelmejor

Reputable
Jan 29, 2016
36
0
4,540
I'm looking forward to buying a new laptop, but I have this question that I wasn't able to find the answer to.

I would like to know if hyperthreading increases the efficiency, performance per watt, of a processor.

I am aware that chips selected for I7 processors are usually better than the ones for I5s.


So my question would be: if you turn off hyperthreading on a processor, how much does this affect the power consumption in relation to the performance you get? Having an anwser for both AMD and Intel solutions would be nice.
 
Solution
No. Not really. The answer is that if you run a light load, less than 4 threads, then the HT is wasted and could possibly be less efficient. However, if you throw a heavy load at the cpu, even though initial wattage used might be higher, the time saved for the length of the load will net a result lower than the time and resultant wattage of 4 threads.
Eg. If a core uses 20w, HT core uses 30w. A large render takes a 4core 2 hours, =40w. A HT does it 1.5x faster, so 1hr 20mins, = 40w
Hyperthreading uses residual cycles of a main core to get more work done.
If you shut off hyperthreading for a multi thread app, the power use may go down, but your job will take longer, essentially nullifying any benefit.

Not worth worrying about.

AMD does things differently, but the bottom line is the same, don't bother with such micromanaging.
 

R_1

Expert
Ambassador
no
if anything, hyperthreading/SMT is powering more of the core, doing more work.
turning off that part of the die will improve power use much in the same way that an intel chip powering down the iGPU uses less wattage at the CPU than when the iGPU is powered.
 

InvalidError

Titan
Moderator

HyperThreading is nothing more than Intel's marketing name for SMT. SMT in AMD's Ryzen works exactly the same way: the instruction decoder fetches instructions from up to two instruction streams and the scheduler tries to reorder the mixed decoded instruction streams to fill as many execution units as possible on each clock cycle.

As for OP's question about efficiency, SMT will give you as much as 40% extra performance for 5-10% more power so disabling SMT in workloads that can benefit from it can cost you as much as ~35% in efficiency.
 

The catch is that there are very few tasks which benefit that much from hyperthreading/SMT. When a compute task is CPU-bound, it's usually CPU-bound because it's trying to do one particular thing, and (multithreaded) it's already using all of the execution units on the CPU capable of doing that one thing. For HT to be effective, the task has to require lots of different things to be done simultaneously. e.g. arithmetic operations + character operations. So the HT thread can successfully run an instruction on a part of the CPU which the main (physical) thread is not using.

Very few real-world tasks are this eclectic in their workload. The common ones are video encoding (done on the CPU, not GPU), encryption/decryption, and data compression. HT can give you up to 40% extra performance with those (i.e. a 4-core yields the performance of a 5.6 core). But for the vast majority of other tasks, HT will only get you about 5%-10% more performance, and in some specialized tasks can actually result in reduced performance (a time-critical thread gets assigned to a virtual core instead of a physical core, resulting in other threads being idled as they wait for that critical thread to complete).

So if you're not doing the big three tasks I mentioned or playing one of the few games which benefit from HT, I wouldn't make HT a priority. I actually turn HT off most of the time on my i7 laptop because I noticed it runs about 5-10 C cooler and I get about 15-20 min more battery life with it off, with hardly any noticeable difference in performance. Obviously I turn it back on if I'm going to do anything CPU-intensive like encode videos. But most of the time I just run with it off. Other than a brief stutter now and then when I'm opening a new browser tab or switching programs, I don't notice a difference.

So I guess my question to OP would be if they're asking for performance/Watt during load while doing specialized tasks (like video encodes)? Or performance/Watt during general everyday use with the laptop? HT is important for the former, not so much (and may actually make things worse, though you can disable it) for the latter.
 

InvalidError

Titan
Moderator

That's far from true: even if all threads are executing the exact same type of workload, using SMT means that the scheduler now has twice as many possible eligible instructions to choose from at any given time to cover potential dead-time and wasted work. Unless both threads happen to be waiting on a conditional branch, memory fetch or other similar operation at the same time, the other thread can be used to do useful work instead of stalling the whole core or committing resources to deep speculative execution work that may get discarded. In other words, even if you had a perfect overlap in instruction mix, you'd still get some gain from reduction of wasted work and outright stalls.

Also, ensuring that as many execution ports as possible are likely to have eligible instructions in the re-order buffer is the reason why Intel's execution ports are partitioned to cover seemingly random subsets of the instruction set instead of uniform subsets replicated across multiple ports and those subsets get tweaked between architecture refinements. Based on how modern Intel CPUs average about four instructions per cycle despite the architecture having 7-8 ports, we can deduce that on average, 60-70% of execution ports are used by a single-threaded task on any given clock tick, which means the other 30-40% are going to waste. That's why SMT yields a ~30% performance boost in heavily threaded workloads. (Ex.: Handbrake gains ~25%.)

BTW, none of the threads on an SMT CPU are any more physical or logical than the others. The only difference between the two (or however many there are) is the context tag the cores attach to instructions and data to keep tabs on what belongs to which thread.
 

That's not the average,4 IPC is about as maximum as you can get for actual useful software(some stress test loops could probably max out all the IPC) .
Look at games, games use less IPC and you can get up to 100% gain from HTT,more if the CPU is getting hammered without HTT.

As for the OPs question,for normal everyday use HTT equals more cores,more cores means that the CPU can do the same amount of work at lower clocks which means better efficiency.
 

InvalidError

Titan
Moderator

Hm, no. Most games are IPC whores, which is why Intel's Core 2 and beyond CPUs have been destroying all of AMD's pre-Ryzen CPUs in most games even when AMD is given a massive overclock advantage against Intel CPUs with fewer cores running at stock clocks.

You will never see gains from SMT beyond 40% or so in AMD and Intel CPUs designed with deep out-of-order speculative execution because the whole point of deep out-of-order speculative execution is to keep execution units busy even through code with horrible instruction-level parallelism to get the highest single-threaded IPC practically possible.

If by "100% gain" you mean you are seeing 100% usage on "CPU0" and 100% usage on "CPU1" which happens to be the second thread on core #0, that isn't 100% more performance. Usage in Task Manager is calculated based on how much scheduled time the thread has had on a CPU, not how much work was actually performed. You may get 100% more total CPU time, but actual work done (ex.: Handbrake) over a given time has only increased by ~25% because the cores only had some amount over 25% of their resources to spare with HT/SMT off.
 

Karadjgne

Titan
Ambassador
Depends on the app vrs cpu. If you use a program that benefits from multiple thread usage in terms of time taken to complete, like winzip for instance, then the 15w or so power used by the cpu taking 10 minutes is going to use far less battery than a disabled 15w cpu taking 15 minutes due to lack of threads. That's going to be more evident on lower thread cpus.
Games don't work that way though, as time taken isn't a factor, dude running through a dungeon is the same no matter what, he won't run in slow motion simply because threads are packed. So then it's upto power demands set by the user, higher settings demand more from cpu/gpu.

On the other hand, some apps are single thread strong, only using 1-2 threads at most, so SMT/HT is a moot point for the most part.

It's mostly a matter of perspective. Running something that sucks up ½ your battery in 20minutes is still more efficient than that same thing sucking up ¾ of the battery in an hour. First, it took way longer, and time is money, and second, at the end of it all, the first still has ½ battery left, the second is down to ¼ battery. But most ppl only see the ½ battery in 20 minutes vrs ¾ in an hour, so therefore the second is better because the battery will last longer.
 


Most games don't use that many cores which is why many people think like that, but looking at benchmarks of well threaded games with a 2/2 a 2/4 and a 4/4 CPU you will see the 2/4 being far ahead of the 2/2 and very close to the 4/4 one.
Here look at GTA V for example, a 80% gain and only 10% slower then the 4/4 CPU, while less threaded games show almost no difference.
https://www.hardwaresecrets.com/pentium-g4560-cpu-by-intel-review/7/
 

All the games are coded for mobile tier jaguars running at 1-1.5Ghz I doubt there is any deep out-of-order speculative execution at all.
 
Six years ago, Hyperthreading was of dubious value for gaming....

Now, especially if you only have 2 or 4 cores...it's almost required....

The 8600K, having 6c/6t. still does darn well without it, so, we certainly can not say it;s a 'must have', if you have adequate core count.
 

InvalidError

Titan
Moderator

The way practically all PC games and ports are HEAVILY dependent on IPC and clock frequency strongly suggests otherwise. Most PC games and ports would be considered unplayable by a large chunk of PC gamers on an AMD FX-8320 underclocked to 2.5GHz despite being over 50% faster than the Jaguar CPU.

What is good enough for consoles where software can be optimized for a very limited number of possible platform and software configurations with limited user customization doesn't translate to good enough on PC with practically infinite hardware and software combinations, along with the added in-game customization required to accommodate a decent subset of that infinite gamut.
 


I'm not sure what you're trying to prove.

Speculative, symmetric, out-of-order, massively superscalar execution with beefy vector support is what gives Intel's microprocessors their bite. x86 instructions are the same regardless of whether they are executive in order or not, they are the same whether they are executed speculatively or not.
 

InvalidError

Titan
Moderator

GTA V is a notoriously bad port. The anomalously large performance gain is simply from the game being broken on 2C2T CPUs, not from the CPU being 80% faster.

One frequently used low-latency thread synchronization technique is busy-waiting: repeatedly polling the value of a memory address instead of using mutexes or semaphores which incur the expense and latency of context switches. If you have a game engine designed on the premise of having at least four hardware threads to run on but run the code on a 2C2T CPU, every cycle wasted by one thread busy-waiting on an address is one less clock cycle available for the thread it is waiting on while that other thread is itself waiting to get its next CPU time slot. Tons of CPU cycles get wasted in multi-threaded deadlocks, waiting for OS scheduler preemption to break the deadlock by scheduling the process everything else is waiting on.
 

Well, obviously you will need a game where a dual core is incapable of reaching high GPU usage, if it reaches 60% GPU usage there is no way to show any difference higher then 40%


Yeah consoles can't afford to do that,they do the exact opposite, every thread keeps doing what it has to do as fast as possible and they display their results whenever they can,this is especially obvious in GTA V where on a dual core if you drive fast the game keeps running the main thread as fast as possible which results in the other threads ,namely the threads responsible for rendering buildings and roads and stuff, not being able to keep up so large parts of the city just won't show up.
Agent's of mayhem had this as well which resulted in an incredibly bad launch for the game.
 

InvalidError

Titan
Moderator

That's hot how SMT/HT works. A thread that is scheduled on a CPU 100% of the time will still show as "100% CPU usage" in Task Manager and Process Explorer even if it only one of the scheduler's issue ports gets used through that whole time and 80% of the physical resources are actually unused.

As for the thread synchronization stuff, try writing multi-threaded code involving multiple inter-related functions without some form of thread synchronization methods to prevent your threads from corrupting each other's data while attempting to pass data along. The only thing you'll succeed at is creating an undebuggable nightmare for yourself.
 


GPU I said, if the GPU is for example 60% used without HTT then HTT can't give you anything more then 40% benefit because the GPU just won't allow for it.
Also, how come you understand that different software uses different amounts of IPC but you believe that games are the ones that use the most?Software like x264 has the whole of the data to be worked on ready beforehand and can distribute it to as many threads as it wants to and run it's codex as fast as possible on each of them since all of the data is already there,Games on the other hand only have a vague idea of what's going on (they know the loaded level) and have to respond to user input so they have to do things without prior notice.

Also PCM (Processor Counter Monitor) gives you info on how much IPC your cores chew through so you get an idea on how much IPC something actually uses.
Intel site for info
Github for downloading

x264 ~2.26 IPC
DeusX Mankind divided ~ 0.73 IPC
Older lighter game that get's more FPS ~1.0 IPC,still less then half that of x264
 

InvalidError

Titan
Moderator

lol. GPUs run hundreds of concurrent threads to hide individual thread stalls from threads waiting for data. The performance hit from disabling SMT on a GPU would be far worse than 40% as SMT is how GPUs are managing to achieve high aggregate performance despite lacking deep out-of-order execution. (GPUs can't afford it because it costs too much power and die area. It makes no sense to waste resources on single-threaded IPC on architectures intended for embarrassingly parallel workloads.)


Saying that games are MOST SENSITIVE to IPC is a completely different thing from "making the most use" of it. Software doesn't "make use of IPC", it just gets whatever IPC the CPU is able to provide to it. When I say that games are particularly sensitive to it, it is in the sense that once any critical thread maxes out a core's throughput, it becomes a bottleneck that caps off the frame rate and immediately noticeable. Since most threads in a game engine work on completely different things, it is that much more difficult to keep them in sync and the only way to prevent any one of them to become a bottleneck is to make each thread/core as fast as possible. For embarrassingly parallel tasks like video processing, it doesn't matter anywhere near as much since software can easily compensate for slower cores by using more of them. That's why most modern video editing suites support GPU acceleration. Individual shader threads are much weaker than CPU threads but there are hundreds of times as many of them.
 


Intel seems to disagree on that one as the pics with PCM show games using much much less IPC than x264.



Yes,and it not only caps off the frame rate it caps off the execution of any other thread,since they need to stay in sync with that one critical thread, which is why it cuts off frame rate.
This is another point that gives HTT ample resources to work with,usually one thread will max out one core while the others will be running much lower utilizations.
 

Mellelmejor

Reputable
Jan 29, 2016
36
0
4,540
So I guess the answer may be: If the workload I'm throwing at my processor is using more than 4 cores, then HT may help improve efficiency, whereas if it doesn't, it will decrease efficiency by "being more powerful" needessly?
 

Karadjgne

Titan
Ambassador
No. Not really. The answer is that if you run a light load, less than 4 threads, then the HT is wasted and could possibly be less efficient. However, if you throw a heavy load at the cpu, even though initial wattage used might be higher, the time saved for the length of the load will net a result lower than the time and resultant wattage of 4 threads.
Eg. If a core uses 20w, HT core uses 30w. A large render takes a 4core 2 hours, =40w. A HT does it 1.5x faster, so 1hr 20mins, = 40w
 
Solution