Skylake: Intel's Core i7-6700K And i5-6600K

InvalidError · Aug 6, 2015

Aspiring techie :

The instruction scheduler needs to tie into:
1- the L1 data and instruction caches
2- the instruction decoder
3- the reservation unit
4- the instruction trace cache
5- the reorder queue
6- the retirement units
7- likely more

Since the scheduler ties into every critical stage of the execution pipeline, it is physically impossible to separate it from the CPU. Putting it off-chip would add so much latency that it would kill the CPU performance outright. In a CPU operating at 4GHz, the scheduler has to reassess scheduling at the picosecond scale, far too fast for any sort of off-chip communications.

Even if you had a scheduler with infinite instruction look-ahead and infinite chip resources, the effective look-ahead would still be limited by data dependencies, conditional and indirect branches. This is a fundamental limitation of sequential software and there is very little left that hardware can do about it that has not already been done.

Wisecracker · Aug 6, 2015

... Forgive me if this idea is just plain stupid (I'm not an expert on this), but what if Intel created almost like a co-processor in place of the instruction scheduler...

It's not that simple

Intel probably spends more money optimizing the 'front-end' of their CPUs than AMD makes in a year.

There is good and bad in this. Intel expends tremendous engineering and design efforts in node-shrinks, and is the 'undisputed champ' in big x86-64 cores. AMD has been 'financially-forced' to optimize on the same node for the last 3 years, and has turned their focus more toward 'mobile' and embedded units.

'Big Desktop' is not dying but has certainly plateaued. Much of Intel's energy and focus has changed, too. They are boosting their integrated graphics and improving efficiency during their node-shrinks.

AMD is doing the same thing in their much more limited fashion (using 28nm as an example). Their transistor density at 28nm SHP is not substantially different than Intel's transistor density at 14nm.

Node-shrinks are not a magic panacea. Logic gets closer together -- it gets hotter and less efficient ... so Intel works hard to increase their efficiency and gain a slight boost in performance.

WHAT NODE-SHRINKS REALLY DO: You get more chips off a wafer

hence, higher profit.

The lower-watt smack-down between the 14nm Skylake chips and the 28nm SHP AMD Carrizo APU chips will be interesting this Fall. No doubt, Intel will maintain their CPU dominance, but the APUs (and their Puma+ cousins from TSMC) likely have a few tricks up their sleeves.

Native HEVC and 'Tonga' compression logic on those fully-enabled GCN cores for a start.

silverblue · Aug 6, 2015

From what I've read about K10, its execution units were always underutilised - a case of far more hardware than is needed. This may have been what led AMD to believe it could cut down on hardware but make more efficient use of what was left. With Bulldozer, that would be true, once each core had its own decoder with Steamroller, that is.

K10's 128-bit registers led to an improvement in FP code over K8 which had to split such instructions in two, which is why I would personally expect Zen to improve over Steamroller (no idea about Excavator, but with dedicated units versus FlexFPU, surely it would?) in the same manner due to wider registers again. Additionally, as Zen is an SMT design, AMD will have beefed up the fetch hardware over previous designs to take advantage of it. Having SMT is better than not having it nowadays and does make a good difference in productivity workloads.

There's also the change to inclusive caches to consider, as well as the fact that they've known for years why Bulldozer's cache design was so bad and as such took baby steps to mitigate it thereafter within the confines of the design, so it stands to reason that they've improved in that area as well.

The point I'm trying to make here is that there's only so much you can do to fix a design before you throw it out of the window and start afresh. Bulldozer was supposed to be that fresh start, perhaps Zen will be after all. I wouldn't be too surprised if it was partly based on the Cat core design, either, albeit far more advanced and clocked higher to boot.

royalcrown · Aug 6, 2015

It's either going to haave to be a design paradigm change, like risk and or material science I think. We're going to have to have something radical on the CPU front like full on Gallium Arsenide;one of this "High K", strained silicon junk, that's just minor improvements. They can only keep bolting on these archetectural changes so long before we need material changes. I think skylake is proving that. The IPC jumps and power savings curves are really flattening out.

I will say the new chipset seems awesome though, waiting to get a hold of that, and when DDR4 matures, I think Skylake will look a little better in the OC dept.

InvalidError :

royalcrown · Aug 6, 2015

Games and rendering could use that type of power now. Look at it this way, lets say you have an environment that you are running around in. Now make that enviornment twice as long and twice as wide without changing height. You've already increased the area by 4, now double the height and it's X 8. I for one would like to see larger and or more detailed game enviornments for every game type. Having to fly around in BF x with more airspace would be nice. I have enough ram, but I gotta crunch all those extra numbers then.

TechyInAZ :

royalcrown · Aug 6, 2015

Yes, and what type of programs are intrinsically serial ... Games. While the AI can run around all it wants, realistically all this has to wait on the player input (unless it's aimbotting 😉 )

InvalidError :

hst101rox :

Intel's Haswell, Broadwell and presumably Skylake CPUs already try to read 192 instructions into the future to keep their existing execution units (issue ports) busy on every cycle. Adding more execution units/ports would not help since the scheduler is already struggling to find enough work for those it already has.

The problem is that normal software contains data dependencies (ex.: c = a+b; e = c*d, 'e' cannot be calculated until 'c' is known), conditional/indexed branches (if/then/else, case, switch, etc.), indirect branches (ex.: virtual function tables and remote procedure calls) and other constructs which limit how effective out-of-order and speculative execution can be at shuffling instruction order to minimize pipeline stalls.

You simply cannot brute-force your way through the intrinsically serial nature, in both code and data, of typical general-purpose code running on general-purpose CPUs. All the fancy tricks may help but produce rapidly diminishing returns.

ramon zarat · Aug 6, 2015

Remember the time when you got 100% performance increase, within the same theological platform? When going from a 486 DX25 to a DX50 gave you practically twice the performance? 7zip benchmark 50 Mhz = 16 MIPS, 25 Mhz (extrapolated from 33 Mhz results) = 8 MIPS! All this within the same year? That year was 1989!

My now decidedly old and venerable Sandy Bridge 2500K @ 4.7Ghz has been chugging along quite nicely since it came out in Q1 2011, over 4 and a half years ago now! And it can *STILL* keep up just fine with the latest CPUs! I'm only within 2-3 FPS and a few seconds from them in nearly every benchmark! At this pace, I don't think I will seriously consider an upgrade for at least a few more years!

My rule of thumb to upgrade is to at least double the performance in every department (CPU/GPU raw power, memory bandwidth, bus bandwidth like DMI, SATA, USB, etc...). That's gonna take a while... Skylake just doubled DMI 2.0 bandwidth from 2 to 4GB sec with DMI 3.0, and USB went from 5 to 10Gb/s with 3.1 USB specs but that's about it as far a doubling is concerned! No doubling in PCIe 3.0, no doubling in CPU computational power and no doubling in RAM bandwidth.

Sources:
https://www.youtube.com/watch?v=oCGxaNdcbmI
https://www.youtube.com/watch?v=OkWYE4lwqdc
http://www.intel.com/pressroom/kits/quickreffam.htm#i486

hst101rox · Aug 6, 2015

Aspiring Techie, I thought the same. But latency as InvalidError said and other things I don't understand yet.
I think a solution might be in software. A compiler that can somehow optimize software so it's more parallelized, multithreaded even though it's not coded that way. I read an article years about something like this, I think.

Awesome discussion guys. Almost nerdgasming.
"Invalid Error"<QUOTE>"Even if you had a scheduler with infinite instruction look-ahead and infinite chip resources, the effective look-ahead would still be limited by data dependencies, conditional and indirect branches. This is a fundamental limitation of sequential software and there is very little left that hardware can do about it that has not already been done. "</QUOTE>

This, I wish weren't true. Data dependencies I can see, data not copied to RAM yet (can a CPU request that or is it up to the program I think). Conditional and indirect branches, not fully understanding why you can't crunch through them, keeping the CPU units fed. Like say compressing a 10GB file. The buffer size would have to be big enough for the Instruction Sheduler to keep ahead? Will research

InvalidError · Aug 6, 2015

hst101rox :

Getting memory content copied to cache ahead of time is called prefetching and Intel implemented this many CPU generations ago but prefetching is not 100% accurate and memory bandwidth is not always sufficient to keep up with the rate at which the CPU may need to access new memory locations. If the memory access pattern is non-linear, such as when dealing with trees or sparse arrays, then prefetching becomes mostly useless since the CPU has no clue which address is going to be accessed next.

As for the conditional or indirect jump, consider this:
switch(a) {
case 1: {do stuff}
case 2: {do something else}
case 3: {...}
...
default: {whatever}
}

The CPU has no way to tell which direction code execution will take until 'a' is known, especially if 'a' follows no predetermined pattern, meaning that branch prediction will likely be wrong more than 50% of the time.

As far as compilers generating threaded code go, many have tried but none have succeeded in a convincing and reasonably convenient manner yet. Inferring thread-level parallelism out of linear code is extremely difficult. The most successful compiler optimization trick on modern CPUs is loop unrolling where the compiler generates static code for multiple repetitions of a loop with separately calculated iteration variables so the CPU has a few mostly independent copies of the loop to work on and that many fewer loop condition checks to worry about.

Blueberries · Aug 6, 2015

InvalidError :

I think what you meant was:

do
{
dwPtrToAddr = ReadProcessMemory...

if (dwPtrToAddr!= NULL)
{
switch ((int*)dwPtrToAddr)
{
etc
}
}
} while (dwPtrToAddr == NULL);

Otherwise a switch statement doesn't use any more resources or memory, anything under "case:" isn't performed unless the condition is true.

push ax
mov ax, [eax+4]
cmp ax, 0
je case1
...
call dword ptr [default]

etc, only one register is pushed onto the stack regardless.

Otherwise, it doesn't really make any sense. My mouse doesn't know which way I'm going to move until I move it. Obviously. There's no magic science to fix that, computers won't predict the future. It's more optimized to wait for the value of A and then read from memory because it's fewer operations, that's actually what you want.

kindredsupreme · Aug 6, 2015

2600k OC to 4.8 can still trade punches to this pfft and i was thinking of upgrading ...screw that now 😀

CROOKID · Aug 7, 2015

What the heck Intel? So, you provide great integrated graphics into Broadwell, then nerf it for Skylake? I guess you had to find a way to help sell your 'paper launch' of Broadwell. I really hope Xen makes you guys wake up; although it more than likely won't.

You guys crack me up with this Zen talk.

You do know it's comparable to Sandy, right?

Aspiring techie · Aug 7, 2015

InvalidError :

hst101rox :

Getting memory content copied to cache ahead of time is called prefetching and Intel implemented this many CPU generations ago but prefetching is not 100% accurate and memory bandwidth is not always sufficient to keep up with the rate at which the CPU may need to access new memory locations. If the memory access pattern is non-linear, such as when dealing with trees or sparse arrays, then prefetching becomes mostly useless since the CPU has no clue which address is going to be accessed next.

As for the conditional or indirect jump, consider this:
switch(a) {
case 1: {do stuff}
case 2: {do something else}
case 3: {...}
...
default: {whatever}
}

The CPU has no way to tell which direction code execution will take until 'a' is known, especially if 'a' follows no predetermined pattern, meaning that branch prediction will likely be wrong more than 50% of the time.

As far as compilers generating threaded code go, many have tried but none have succeeded in a convincing and reasonably convenient manner yet. Inferring thread-level parallelism out of linear code is extremely difficult. The most successful compiler optimization trick on modern CPUs is loop unrolling where the compiler generates static code for multiple repetitions of a loop with separately calculated iteration variables so the CPU has a few mostly independent copies of the loop to work on and that many fewer loop condition checks to worry about.

I guess this means that we have gotten about as good with silicone as we can get. Shame. I guess that the future is in titanium trisulfide, which is a good alternative to graphene. This should enable us to hit much higher clock rates and such in the future. But that, eventually will plateau too.

InvalidError · Aug 7, 2015

Blueberries :

The point was not the resources used, it was that the branch cannot be executed until the value of 'a' is known, wherever 'a' is being produced or read from. If the computation of 'a' is tied up in data dependencies, then the branch is also tied up due to its data dependency on 'a' and execution stalls at the switch's branches.

As for what I had in mind when I used a switch an indirect branch example, that would be because one possible compiler output for switch statements is a jump table where 'a' is used as the index and would look something like this in ASM:

C++:

mov eax, a  // wherever 'a' came from, 'a' has already been range-checked
shl eax, 2
mov eax, jumptable[eax]  // use 'a' as the offset for the jump table
jmp eax
jumptable:
.dword case0
.dword case1
.dword case2
...
case0:
  // do stuff
  jmp endcase
...
endcase:

Far more efficient than a cascade of cmp/jne for switches. Since the value of 'a' is used to access the jump table, the jump cannot proceed until 'a' is known. (Technically, the CPU can speculate on the branch but if multiple values of 'a' are similarly likely, the CPU's predictions will be wrong most of the time, resulting in speculative execution causing more wasted work than anything useful.)

Vlad Rose · Aug 7, 2015

CROOKID :

So, you're suggesting AMD will nerf their iGPUs as well? That would be a serious mistake on their part.

jaber2 · Aug 7, 2015

It's a good start for skylake, I am sure once they get better wafers you will see a much better cpu's coming out, but for now my i7 4790k was a good purchase

jimmysmitty · Aug 7, 2015

ramon zarat :

PCIe 3.0s bandwidth actually did double compared to PCIe 2.0 thanks to the new encoding scheme of 128b/130b vs PCIe 2.0s 8b/10b. PCIe 2.0 has a 20% performance loss just in the encoding side of it while PCIe 3.0 only has roughly 1.6% performance loss with its scheme.

That said, the biggest difference is that that was the early years of the x86 CPU. Hell you listed 25MH now they have CPUs running 200x faster at stock than that.

InvalidError · Aug 7, 2015

jimmysmitty :

Also, the 486 and older chips were strict in-order architectures that required multiple cycles to complete most instructions and could only work on a single instruction at a time while modern Intel CPUs can have up to 192 out-of-order instructions in-flight at various stages of completion and retire multiple instructions per clock.

There is a lot more to modern CPU performance than the ~1000X clock frequency increase since the first IBM PC's 4.77MHz 8088/8086.

sonofhendrix · Aug 7, 2015

This is a disappointment. I currently have a 3570k, I purchased back in what May 2012 (39 months ago).
I mainly render 3D graphics as a HOBBY making NO MONEY, and sometimes gaming, skylake only offering 2X the performance at best, in rendering, and that's that i7 8 thread variant.... Very very poor.. I thought cpu performance doubled every 18 months.
We should be looking at quadruple the performance for the same cost and power budget.
I am so waiting for AMD Zen. And i hope to god they loose those stupid usless on-board GPUS and replace them with cores.
Every gamer has a discrete graphics card.
Even in render engines the onboard GPU is usless, offering the same GFLOPS as the CPU roughly but less features, and again, totally pointless when you have a HD7970 or something in your system.

InvalidError · Aug 7, 2015

sonofhendrix :

That hasn't been the case in nearly 10 years. Between Nehalem (first-gen Core iX) and Skylake, we have about double the performance per core over a span of five years. The last major performance leap for both AMD and Intel was when they integrated the memory controller in their respective CPUs, slashing about half of the memory access latency off.

randomizer · Aug 7, 2015

sonofhendrix :

This was a rough prediction based on another prediction.

Guest · Aug 8, 2015

CPU releases this year seem to follow suit with the GPU releases for the most part, excluding some higher end GPU's of course. If you have decent hardware from 2012 or newer, upgrading doesn't seem to be worth it yet. Myself, I'm waiting until next year for both CPU's and GPU's. I think those will truly be interesting and may warrant a full system rebuild for me!

sonofhendrix · Aug 8, 2015

There seems to be intel PR folk on here or damage control.. The fact is cpu performance is supposed to double every 18 months. That is of course price/performance, so dont try to twist what i said by talking about single core speed.

IGPU is utterly irrelevant, and using it in DX12 to give a boost to the discreet GPU is a total pipe dream. I have render engines that try and use both to work on OpenCL simultaneously and all it does is choke the HD7970 or whatever powerful GPU you have.
The other fact is that Intel holds a monopoly and is course a for profit corporation, and that means that xeons with the real power and no useless IGPU taking up die space will always be way out of the budget of millions of normal people that like to render things like game assets and things making virtually no profit. Yea there was the 5960x but guess what theres also a 17 core xeon, im sure high end gamers would leap at the chance to buy a 16 core 5960x with no IGPU.
And yes people really really do need that kind of power, especially for rendering, its still painfully slow on an a core i5 or i7.

I think we are either seeing Intel holding back because of a monopoly on the market, or the end of silicon, and we are waiting for a paradigm shift to one of the many new technologies, such as molecular 3D processing, quantum computing, graphene, optical light speed processors or something else.

Another thing is Intel are going to see poor sales of skylake, especially with such a small performance bump.

randomizer · Aug 8, 2015

sonofhendrix :

Do you have a source for your claim that manufacturers have some obligation to improve performance (or price/performance, even though it's not the same thing) every 18 months? I don't think the industry is quite that heavily regulated.

InvalidError · Aug 8, 2015

sonofhendrix :

Look at AMD. They aren't doing any better. If Intel was really holding back, AMD should have no trouble catching up. Performance gains are slowing down because AMD and Intel are hitting fundamental limits and AMD is hit harder because they are still stuck on 28nm.

Skylake: Intel's Core i7-6700K And i5-6600K

Titan

Splendid

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Reputable

Titan

Reputable

Distinguished

Honorable

Reputable

Titan

Reputable

Distinguished

Champion

Titan

Distinguished

Titan

Champion

Guest

Guest

Distinguished

Champion

Titan

Share this page