News Intel Comet Lake-S Arrives: More Cores, Higher Boosts and Power Draw, but Better Pricing

Page 4 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
So you do understand the issue,while most things an average person will run on a PC is 3 instructions wide or maybe even 4 or 5 skylake is already at 6 so further wideding the core will not give them any additional speed.
If Intel couldn't improve its scheduler to make use of extra instruction issue ports, it wouldn't bother with making the back-end any wider. With Ice Lake, Intel claims 18% higher IPC on average and benchmarks appear to support that. Sunny Cove's scheduler can look up to 320 instructions ahead (vs 224 for Skylake) to find the best combination of instructions to fill as many of its 10 execution ports with as possible.

Two extra execution ports is a 25% increase over Skylake and Intel claims 18% increased IPC. For that to happen, the extra ports have to be used or otherwise contribute to relieving bottlenecks on other ports about 50+% of the time.

Intel is not adding execution ports and shuffling instructions between them just for fun. It is adding and shuffling them because the scheduler is being bottlenecked by running out of available execution ports capable of handling some subset of instructions significantly more often than it is running out of said instructions that would otherwise be ready for execution.
 
If Intel couldn't improve its scheduler to make use of extra instruction issue ports, it wouldn't bother with making the back-end any wider. With Ice Lake, Intel claims 18% higher IPC on average and benchmarks appear to support that.
Based on the "industry standard" pool of benchmarks that are made to extract as much parallelization as possible from modern CPUs?
I'm not arguing that this won't be faster and able to do more throughput but haven't I made my point clear that I'm talking about software that can't do that?
 
I'm not arguing that this won't be faster and able to do more throughput but haven't I made my point clear that I'm talking about software that can't do that?
It isn't up to software to do anything about IPC, the scheduler will attempt to get all of the instruction-level parallelism it can out of the code, no matter how poorly written and optimized it might be. All that model-driven optimizations does is help the scheduler find more optimal instruction mixes to execute more often.

What isn't clear to me is your grasp of out-of-order and speculative execution. The CPU does not need your blessing to keep itself as busy as it possibly can. You'd practically have to intentionally design code to undermine the scheduler's efforts to prevent IPC from going up on CPUs with more execution resources and the more advanced scheduler needed to keep those busy.
 
Instruction-level parallellism typically increases with the complexity and amount of code. As a very rough example you can use mathematical equations with varying degrees of complexity.

Example 1:
y = x^2 + bx + c

This equation has two instructions that can be executed in parallel. After that, it has two more sequential instructions.

Example 2:
y = x^4 + 2x^3 + 6x^2 + 3x + 5

This equation starts with no less than 4 instructions that may be executed simultaneously. Then we get 3, then 2 more, concluding with one instruction. For a CPU, there is another interesting thing going on here. Powers higher than 2 need multiple instructions. With that knowledge we can say that the instruction-level parallellism each cycle is [5-4-2-1], assuming each instruction takes exactly one cycle. You can see that this far more complex equation only took one more cycle to complete, but also that it was able to reduce execution time by 20% by performing more than 4 instructions at the same time.

The latter equation would only be around 10 lines in assembly language. Considering how many lines of code there are in most applications, it is clear that it becomes very easy to achieve a high level of instruction-level parallellism when the resources are available. The depth of the instruction queue compensates for more 'linear' code. Speculative execution allows the core to work ahead of time, for example when it is evaluating a conditional branch.
 
You can see that this far more complex equation only took one more cycle to complete, but also that it was able to reduce execution time by 20% by performing more than 4 instructions at the same time.
While making a parallelism example out of a single equation is fine, people mustn't forget that with OoO, ILP/IPC is not limited to closely related instructions. The CPU can be working on multiple different pieces of code several instructions ahead (up to 320 for Ice Lake) so long as dependencies are met. Also, with speculative execution and branch prediction, it can get concurrent head-starts on multiple loop iterations and recursions too.

Writing useful code that can only be executed in strict serial order is practically impossible on modern IPC-centric CPUs.
 
It isn't up to software to do anything about IPC, the scheduler will attempt to get all of the instruction-level parallelism it can out of the code, no matter how poorly written and optimized it might be. All that model-driven optimizations does is help the scheduler find more optimal instruction mixes to execute more often.
By ‘software’ he clearly means the algorithm(s) that the software uses. Let’s take a matrix multiplication as an example. Sure with parallel processing you can theoretically do all the needed multiplications in a single cycle. But then you will need another cycle to add up the products. Computationally there is nothing more you can do, as you reduced the process to just two cycles and the second cycle is dependent on the first. In such a case, in its entirely, the algorithm can only be sped up by higher frequency.
 
While making a parallelism example out of a single equation is fine, people mustn't forget that with OoO, ILP/IPC is not limited to closely related instructions. The CPU can be working on multiple different pieces of code several instructions ahead (up to 320 for Ice Lake) so long as dependencies are met. Also, with speculative execution and branch prediction, it can get concurrent head-starts on multiple loop iterations and recursions too.
Indeed, that's why I said it's a very rough example. I also touched on why that is in the last paragraph.
 
By ‘software’ he clearly means the algorithm(s) that the software uses. Let’s take a matrix multiplication as an example. Sure with parallel processing you can theoretically do all the needed multiplications in a single cycle. But then you will need another cycle to add up the products. Computationally there is nothing more you can do, as you reduced the process to just two cycles and the second cycle is dependent on the first. In such a case, in its entirely, the algorithm can only be sped up by higher frequency.
Yes, but obviously a large matrix needs more cycles on a CPU that has a limited number of execution resources. As shown in my example, you would always need at least 4 cycles to evaluate that equation, regardless of the resources you have, but if there is only one ALU, it will always take 12 cycles.

An absolute worst case scenario for ILP would be a number that needs to be incremented every cycle. It's why brute-forcing encryption is so inefficient. However, because you know what numbers the algorithm will generate, you could make multiple threads that start at an offset from each other, so that each available logical processor can process a smaller range of numbers. For the lowest possible throughput, I'd be thinking about generating a series of random numbers that are the seed for the next random number. There is some ILP there, but it's impossible to use multiple threads for such a workload.
 
Computationally there is nothing more you can do, as you reduced the process to just two cycles and the second cycle is dependent on the first. In such a case, in its entirely, the algorithm can only be sped up by higher frequency.
That would only be true if your algorithm consisted exclusively of instructions that depend on the previous instruction's result. However, no useful algorithm exists in that sort of vacuum.

What does a typical algorithm contain? Memory address calculations for loading things from memory, some number of computations on inputs, possibly some more address calculations for store operations, some number of operations on the loop control and a conditional branch. Many of those have independent data dependencies, can happen concurrently with other stuff within the loop and also concurrently with speculatively executed iterations of that loop. Your computation does multiple sequential operations on EAX? That does not stop the CPU from having a head-start on calculating the write address for the eventual result unless said address depends on EAX, decrementing the loop counter and speculatively getting started on the next loop iteration where it gets a whole new set of things to work on. While the retired program counter (how far the CPU is in terms of fully completed execution) may still technically be in loop-0 of your algorithm, speculative OoO execution may have instructions from 10 iterations ahead already in-flight.

Speculative out-of-order CPUs aren't limited to executing code in human-readable order. They execute instructions in whatever order dependencies get resolved in and look as far ahead as they need to up to however far they can look to keep execution units as busy as they possibly can.
 
That would only be true if your algorithm consisted exclusively of instructions that depend on the previous instruction's result. However, no useful algorithm exists in that sort of vacuum.
Do you think that every software,or at least most or some decent amount, has more ILP in it than what coffee lake already has capabilities for?All the typical benchmarks have but what about normal software?
You can't extract more ILP out of software than what it has no matter how much better the CPU is at extracting it,having 5 people sit down on 6 chairs will be just as fast as them sitting down on 8 chairs,it won't be any faster.
Going down in frequency in this case will slow down these softwares,we don't want to cut people in half or bring in people from a different group to increase ILP we want the people to sit down faster.
 
Do you think that every software,or at least most or some decent amount, has more ILP in it than what coffee lake already has capabilities for?
Yes, otherwise Intel wouldn't have bothered making the architecture wider in the first place. Making the architecture wider makes the scheduler more complex, uses more die space, uses more power, reduces attainable clock frequencies if the process does not improve enough to compensate, etc. For the extra complexity to be worth bothering with, it has to at least offset the costs.

As for your restaurant example, you have 320 customers waiting at the door. Chances are quite good you'll find people willing and able to sit on your three spare chairs on a regular basis even if those chairs only get to pick from the daily specials or appetizers.
 
Yes, otherwise Intel wouldn't have bothered making the architecture wider in the first place. Making the architecture wider makes the scheduler more complex, uses more die space, uses more power, reduces attainable clock frequencies if the process does not improve enough to compensate, etc. For the extra complexity to be worth bothering with, it has to at least offset the costs.

As for your restaurant example, you have 320 customers waiting at the door. Chances are quite good you'll find people willing and able to sit on your three spare chairs on a regular basis even if those chairs only get to pick from the daily specials or appetizers.
How did I ever suggested that it's not worth it or not doing anything?
Yes with a lot of stuff running on your system it does improve throughput,I think I already agreed on this before in this topic.
I'm just aware that there are also programs and games out there that will have zero benefit from it.
The day that only 6 people show up the other 314 of your seats will be completely useless no matter if your place is full every other time.
 
The day that only 6 people show up the other 314 of your seats will be completely useless no matter if your place is full every other time.
Where the heck did you take those 314 seats from? There are only 10, the other 310 people are in queue outside waiting for a seat.

Speculative OoO execution makes it nearly impossible for the CPU not to have multiple things to do most of the time no matter how poorly written and sequential your code might be.
 
When will we actually see 10th gen desktop CPUs in the stores, with Mobos that support them?

Newegg has the motherboards for sale, but with a stated release date of 5/20/2020.

They also have two of the new gen 10 chips listed, the i9-10900K at $529 and the i5-10400 @ $195. Both state release date as 5/20/2020.