Skylake: Intel's Core i7-6700K And i5-6600K

beetlejuicegr · Aug 9, 2015

Remember the time when you got 100% performance increase, within the same theological platform? When going from a 486 DX25 to a DX50 gave you practically twice the performance? 7zip benchmark 50 Mhz = 16 MIPS, 25 Mhz (extrapolated from 33 Mhz results) = 8 MIPS! All this within the same year? That year was 1989!

My now decidedly old and venerable Sandy Bridge 2500K @ 4.7Ghz has been chugging along quite nicely since it came out in Q1 2011, over 4 and a half years ago now! And it can *STILL* keep up just fine with the latest CPUs! I'm only within 2-3 FPS and a few seconds from them in nearly every benchmark! At this pace, I don't think I will seriously consider an upgrade for at least a few more years!

My rule of thumb to upgrade is to at least double the performance in every department (CPU/GPU raw power, memory bandwidth, bus bandwidth like DMI, SATA, USB, etc...). That's gonna take a while... Skylake just doubled DMI 2.0 bandwidth from 2 to 4GB sec with DMI 3.0, and USB went from 5 to 10Gb/s with 3.1 USB specs but that's about it as far a doubling is concerned! No doubling in PCIe 3.0, no doubling in CPU computational power and no doubling in RAM bandwidth.

Sources:
https://www.youtube.com/watch?v=oCGxaNdcbmI
https://www.youtube.com/watch?v=OkWYE4lwqdc
http://www.intel.com/pressroom/kits/quickreffam.htm#i486

I am at i5 2500k and i see no reason to upgrade yet. When i was buying this setup, due to a catastrophic failure of my i7 920, i was told from a high end engineer that i would keep this cpu for years and he knew i am also a gamer so i was surprised he said it. Seems he was right.

turkey3_scratch · Aug 9, 2015

It will reach a point where Moore's Law, which has been quite stable for a long time, will eventually change. Eventually we will have to increase the size of the die to add more transistors. They are made of atoms, and each atom can only occupy so much space, and you need to have multiple atoms on a transistor so it is stable. One atom width would not do.

randomizer · Aug 9, 2015

tea urchin :

My comment about regulation was somewhat tongue-in-cheek.

I knew the comment I quoted was referencing a common misconception about Moore's Law (that is has anything to do with performance).

palladin9479 · Aug 10, 2015

InvalidError :

Hiding so as to not create a huge reply pyramid.

Remember SPARC and POWER are RISC ISA's while x86 is CISC. Performance wise there isn't any difference anymore, but there is still a very important architectural difference which influences CPU designs. RISC instructions are already one cycle execution times in machine code while CISC instructions require many cycles each. Now a single compiled x86 OPCODE can do the same thing as several compiled RISC OPCODES, so it's not quite a one to one comparison, but the RISC ISAs don't need to have a complicated instruction decoder and can get by with a much simpler instruction scheduler precisely because they don't need to do that translation / decoding. CISC CPU's are internally RISC CPU's with a front end decoder that's deeply tied with the instruction scheduler. RISC CPU's still have the scheduler but don't need the decoder and thus don't need as much silicon associated with it.

That is why they all seem simpler, it's not because they are inferior but because they just don't need to be more complex. Also because they come binary as one cycle instructions they are easier to cache and track. Of course nothing is free, you lose the ability to radically redesign your uArch because it would break your ISA. Intel and AMD have radically redesigned the inside of their chips on multiple occasions but because there is a front end decoder that does translation work, all legacy code executes just find and nobody needs to rebuild compilers or operating systems.

As for which chips have the most power, both the UltraSparc and Power chips will easily beat the Intel ones on a socket vs socket basis. They are both optimized for extremely high throughput of ridiculously large working sets. The UltraSparc (POWER is a different beat altogether) core designed to predict when a thread will stall due to waiting on I/O and execute another thread during that period. It's about juggling many consecutive threads on each core, like what you would experience in a webhosting server, large database server, doing modeling, or other "enterprisey" workloads. You most certainly wouldn't want to run a single threaded game on it. The downside is they are also stupidly more expensive. With virtual datacenters and converged fabric now standard, it's almost always more cost effective to purchase multiple cheap x86 servers then a few UltraSparc / POWER servers, unless someone has a very specific requirement. Powerful x86 servers can be had for 10~15K USD each while a decent UltraSparc system is going to run 40~60K USD. And that's before warranty and support contract.

You were right it's the T4/5 that had OOO execution, it helped in single threaded tasks which was their previous weakness but didn't do much for wide multi-threaded tasks.

Some more detailed information

http://www.oracle.com/technetwork/server-storage/sun-sparc-enterprise/documentation/o13-024-sparc-t5-architecture-1920540.pdf

Guide on how to do parallel on SPARC, works on most other ISA's.

http://www.oracle.com/technetwork/server-storage/sun-sparc-enterprise/documentation/o13-060-t5-multicore-using-threads-1999179.pdf

InvalidError · Aug 10, 2015

palladin9479 :

The distinction between RISC and CISC is largely moot these days since x86 CPUs put their instruction decoders at the top of their execution pipeline to turn x86 code into RISC-like micro-ops and it is those micro-ops that end up in the reorder buffer. From this point of view, the execution pipeline of modern x86 CPUs can be considered closer to pure RISC than many RISC chips.

Now, as you said, a pure RISC CPU would need more instructions to perform the same task as an x86 CPU and if a RISC CPU tried to achieve the same average throughput per clock as Skylake, I bet the RISC scheduler would need to be quite beefy to dredge up the extra instruction or two per clock needed to keep up the pace.

Intel makes tiny x86 schedulers too: look at Atom and Xeon Phi. Scheduler complexity is directly related to the amount of ILP you want to extract from the instruction stream. When the CPU lacks execution units to achieve high ILP (Atom) or does not care about ILP because it is optimized for TLP (Phi), scheduling becomes much simpler.

BTW, in modern pipelined RISC CPUs, NOP is probably the only true single-cycle instruction. Everything else typically requires at least three ticks: one tick to move data from the register file to the execution unit input, one cycle through the ALU and one more cycle to store the result back into a register and/or present it at the input of another execution unit if a dependent instruction is scheduled there next.

palladin9479 · Aug 10, 2015

The distinction between RISC and CISC is largely moot these days since x86 CPUs put their instruction decoders at the top of their execution pipeline to turn x86 code into RISC-like micro-ops and it is those micro-ops that end up in the reorder buffer. From this point of view, the execution pipeline of modern x86 CPUs can be considered closer to pure RISC than many RISC chips.

Now, as you said, a pure RISC CPU would need more instructions to perform the same task as an x86 CPU and if a RISC CPU tried to achieve the same average throughput per clock as Skylake, I bet the RISC scheduler would need to be quite beefy to dredge up the extra instruction or two per clock needed to keep up the pace.

Intel makes tiny x86 schedulers too: look at Atom and Xeon Phi. Scheduler complexity is directly related to the amount of ILP you want to extract from the instruction stream. When the CPU lacks execution units to achieve high ILP (Atom) or does not care about ILP because it is optimized for TLP (Phi), scheduling becomes much simpler.

BTW, in modern pipelined RISC CPUs, NOP is probably the only true single-cycle instruction. Everything else typically requires at least three ticks: one tick to move data from the register file to the execution unit input, one cycle through the ALU and one more cycle to store the result back into a register and/or present it at the input of another execution unit if a dependent instruction is scheduled there next.

This doesn't make any sense. Because RISC CPU's do not need to convert macro-ops into micro-ops, with very few exceptions, they don't need nearly as complex a front end decoder. The decoder on a RISC CPU is primarily responsible for determining valid instructions, doing assignments, looking up references to the register windows and so forth, it's no where near as beefy as what Intel bolts onto their CPUs. This in turn means the scheduler can be less complex, it doesn't need to deal with an external x86 macro-op being five to seven micro-ops and tracking them as such. There are pro's and cons to this approach with the result being the final performance outcome being the same. Neither is inherently better then the other, but because of the different design philosophies you need to optimize processor resources differently. CISC designs require very strong front end decoders & schedulers in order to achieve high throughput while RISC design's require very powerful I/O subsystems to do the same thing. The link I provided from Oracle explains what each component is doing.

Virtually all RISC instructions take 1 clock cycle to execute. This is because they do not use complex referencing or, with very few exceptions, allow compound instructions. Each ISA strays from this a little bit with SPARC being the most militant in it's adherence. Once you start adding complex pipelining this number becomes a bit fuzzy. The RISC instruction still takes exactly one cycle to execute, there is just a lot of preparatory work done ahead of time to optimize the number of executions that can happen while avoiding stalls. CISC on the other hand would take many cycles, depending on the instruction, division being the worst. The difference is that a single CISC instruction would have to be represented by three or more RISC instructions, so as I said it's not a one to one comparison. The one cycle execution time is important to designing the predictor and scheduler, specifically fixed execution times allow for simpler timing vs variable execution time.

Take a simple MULT instruction that multiples two numbers and stores the result in location of the first variable.

In x86 Pseudo-code, would be like this

MULT Variable_A Variable_B

That's a single operand in machine language

In RISC Pseudo-code, would look like this

LOAD Register_A, Variable_A
LOAD Register_B, Variable_B
PROD A, B
STORE Variable_A, Register_B

And depending on the ISA there might be a requirement to first lookup the memory location of the variables instead of directly referencing it. And remember this is machine language, so four to five separate operands going into the intrusion queue.

The top machine instruction would need to first be converted into something that looks much like the bottom but is done entirely inside the CPU, then each of those micro-ops would be ordered and scheduled, executed, put back together, retired then another macro-op fetched (in reality several were fetched simultaneously). The bottom set would just be scheduled, possibly reordered, executed individually and no requirement to return as they are in native language and part of a stream of other instructions. The single CISC instruction takes multiple cycles to execute because it's complex, each of those RISC instructions takes one cycle each to execute. You get about the same total execution time but the scheduling is done very differently.

Two different ways of accomplishing the same task, each with different pro's and cons. RISC is heavily limited by memory I/O due to raw number of operands flying across (instruction not data), CISC is limited by prediction and caching. If you make a mistake in predicting a RISC instruction it's not as big a performance penalty as mispredicting a CISC instruction. RISC relies a lot more on the compiler not screwing up and making spaghetti code. This is why Intel invested so much engineering effort into their magic branch predictor and caching technology. They have, by far, the absolute best caching prediction logic on the planet. It's responsible for their performance lead in single threaded computing.

bystander · Aug 10, 2015

palladin9479 :

The distinction between RISC and CISC is largely moot these days since x86 CPUs put their instruction decoders at the top of their execution pipeline to turn x86 code into RISC-like micro-ops and it is those micro-ops that end up in the reorder buffer. From this point of view, the execution pipeline of modern x86 CPUs can be considered closer to pure RISC than many RISC chips.

Now, as you said, a pure RISC CPU would need more instructions to perform the same task as an x86 CPU and if a RISC CPU tried to achieve the same average throughput per clock as Skylake, I bet the RISC scheduler would need to be quite beefy to dredge up the extra instruction or two per clock needed to keep up the pace.

Intel makes tiny x86 schedulers too: look at Atom and Xeon Phi. Scheduler complexity is directly related to the amount of ILP you want to extract from the instruction stream. When the CPU lacks execution units to achieve high ILP (Atom) or does not care about ILP because it is optimized for TLP (Phi), scheduling becomes much simpler.

BTW, in modern pipelined RISC CPUs, NOP is probably the only true single-cycle instruction. Everything else typically requires at least three ticks: one tick to move data from the register file to the execution unit input, one cycle through the ALU and one more cycle to store the result back into a register and/or present it at the input of another execution unit if a dependent instruction is scheduled there next.

This doesn't make any sense. Because RISC CPU's do not need to convert macro-ops into micro-ops, with very few exceptions, they don't need nearly as complex a front end decoder. The decoder on a RISC CPU is primarily responsible for determining valid instructions, doing assignments, looking up references to the register windows and so forth, it's no where near as beefy as what Intel bolts onto their CPUs. This in turn means the scheduler can be less complex, it doesn't need to deal with an external x86 macro-op being five to seven micro-ops and tracking them as such. There are pro's and cons to this approach with the result being the final performance outcome being the same. Neither is inherently better then the other, but because of the different design philosophies you need to optimize processor resources differently. CISC designs require very strong front end decoders & schedulers in order to achieve high throughput while RISC design's require very powerful I/O subsystems to do the same thing. The link I provided from Oracle explains what each component is doing.

Virtually all RISC instructions take 1 clock cycle to execute. This is because they do not use complex referencing or, with very few exceptions, allow compound instructions. Each ISA strays from this a little bit with SPARC being the most militant in it's adherence. Once you start adding complex pipelining this number becomes a bit fuzzy. The RISC instruction still takes exactly one cycle to execute, there is just a lot of preparatory work done ahead of time to optimize the number of executions that can happen while avoiding stalls. CISC on the other hand would take many cycles, depending on the instruction, division being the worst. The difference is that a single CISC instruction would have to be represented by three or more RISC instructions, so as I said it's not a one to one comparison. The one cycle execution time is important to designing the predictor and scheduler, specifically fixed execution times allow for simpler timing vs variable execution time.

Take a simple MULT instruction that multiples two numbers and stores the result in location of the first variable.

In x86 Pseudo-code, would be like this

MULT Variable_A Variable_B

That's a single operand in machine language

In RISC Pseudo-code, would look like this

LOAD Register_A, Variable_A
LOAD Register_B, Variable_B
PROD A, B
STORE Variable_A, Register_B

And depending on the ISA there might be a requirement to first lookup the memory location of the variables instead of directly referencing it. And remember this is machine language, so four to five separate operands going into the intrusion queue.

The top machine instruction would need to first be converted into something that looks much like the bottom but is done entirely inside the CPU, then each of those micro-ops would be ordered and scheduled, executed, put back together, retired then another macro-op fetched (in reality several were fetched simultaneously). The bottom set would just be scheduled, possibly reordered, executed individually and no requirement to return as they are in native language and part of a stream of other instructions. The single CISC instruction takes multiple cycles to execute because it's complex, each of those RISC instructions takes one cycle each to execute. You get about the same total execution time but the scheduling is done very differently.

Two different ways of accomplishing the same task, each with different pro's and cons. RISC is heavily limited by memory I/O due to raw number of operands flying across (instruction not data), CISC is limited by prediction and caching. If you make a mistake in predicting a RISC instruction it's not as big a performance penalty as mispredicting a CISC instruction. RISC relies a lot more on the compiler not screwing up and making spaghetti code. This is why Intel invested so much engineering effort into their magic branch predictor and caching technology. They have, by far, the absolute best caching prediction logic on the planet. It's responsible for their performance lead in single threaded computing.

I used to do some ASM coding on x86 back in the day. I even did some on RISC chips, but it's been 20 years. What you described as the CISC instructions is just what a high level language makes it look, and that is the case for either type of chip. What you showed for the RISC is how both look like when you wrote ASM for either type of CPU. The difference is, the CISC CPU just has more instructions, some of which could do more specialized things, but you still loaded them into registers, called an instruction, then stored them when you were done manipulating them. Some CISC instructions were limited to specialized registers as well.

I'm pretty sure InvalidError is talking about is simply that those more specialized instructions that the RISC CPU doesn't have, are run as list of RISC like operations to accomplish them. Basically the core of the CPU and the instructions they handle are pretty much the same, only the CISC chips has a larger list of instructions, some of which are higher level, and those higher level instructions are just performed with a series of simple core instructions, similar to the same instructions a RISC CPU has. Where as many years ago, those specialized instructions had special units to handle them, now they simply get converted into simpler instructions.

InvalidError · Aug 10, 2015

palladin9479 :

Show me even one modern RISC architecture running at more than 1GHz that achieves single-cycle all-inclusive instruction execution from register read to result commit.

I will spare you the trouble: you won't find any because all modern high-speed CPUs are pipelined and the pipelining typically introduces at least three cycles of latency even to the most basic ALU operations: it takes at least one cycle to select registers and route their values to the ALUs' inputs, it takes at least one cycle for the ALUs to produce the result and it takes at least one cycle to store the result back into the register file, so we are talking about a minimum execution latency of three cycles, two if we assume the scheduler does a perfect job scheduling instructions so intermediate results are always consumed on the same clock cycle they are produced to avoid the extra latency of writing results back to the register file before they can be used again.

It seems like you may have confused the ability of pipelined ALUs to produce one result per cycle with an execution time of one cycle and moving data that is already in the register file from the register file to the ALU with moving data from wherever to the register file.

palladin9479 · Aug 10, 2015

I used to do some ASM coding on x86 back in the day. I even did some on RISC chips, but it's been 20 years. What you described as the CISC instructions is just what a high level language makes it look, and that is the case for either type of chip. What you showed for the RISC is how both look like when you wrote ASM for either type of CPU. The difference is, the CISC CPU just has more instructions, some of which could do more specialized things, but you still loaded them into registers, called an instruction, then stored them when you were done manipulating them. Some CISC instructions were limited to specialized registers as well.

I'm pretty sure InvalidError is talking about is simply that those more specialized instructions that the RISC CPU doesn't have, are run as list of RISC like operations to accomplish them. Basically the core of the CPU and the instructions they handle are pretty much the same, only the CISC chips has a larger list of instructions, some of which are higher level, and those higher level instructions are just performed with a series of simple core instructions, similar to the same instructions a RISC CPU has. Where as many years ago, those specialized instructions had special units to handle them, now they simply get converted into simpler instructions.

Umm... that is not a high level language, that's the X86 ASM code for multiply. OPCODE F6 or F7

http://x86.renejeschke.de/html/file_module_x86_id_210.html

Here is the rest of the x86 ISA

http://x86.renejeschke.de/

RISC ISA's have more instructions then CISC's, the complex part isn't in reference to the number of instructions but whether there are any complex memory access modes or whether an execution instruction can reference a memory address. In CISC the instruction can reference a memory address instead of a register, and sometimes that memory address can itself reference another memory address that holds the data. This complexity requires a variable execution time because it's difficulty to predict at compile time how many steps the CPU will have to take to get the data into the registers for execution. RISC, on the other hand, requires the compiler coding to explicitly load and store data to and from registers prior to executing instructions on that data, the CPU doesn't have to interpret the instructions.

ASM for the 80386 was my very first programming language, it was hard but I learned to see the computing from the CPU's point of view. The real difference between RISC and CISC these days is where the code optimization takes place. CISC (x86 really) does the code optimization inside the CPU while RISC (ARM / SPARC / PPC / MIPS / IA64) explicitly expect the compiler to have done this. The effect of this difference is how the hardware engineers design the scheduling of instructions. In the x86 CPU the front end decoder is tightly integrated with the scheduler and predictor in an attempt to maximize instruction throughput while in the RISC examples there is less need since the code is already broken down into load / store statements and the CPU doesn't' need to translate memory address's into data. Since RISC instructions are fixed sized, memory access is always aligned on boundaries and most are single cycle execution times, they are easy to schedule and even Intel use's a RISC-like uArch on the inside of their CPU's. Unfortunately the compiled machine language being fed into those same CPU's is still x86 and thus still requires the instructions to be decoded, turned into multiple micro-ops, scheduled and then returned output as an x86 CPU would, so there is a need for a complex high performance decoder / scheduler. CISC needs to be translated to be able to be pipelined while RISC can be pipelined natively.

Basically the modern x86 CPU is a giant 80486 emulator with 64-bit instructions tacked on. The SIMD SSE FPU is treated like an entirely separate vector co-processor with it's own register sets and language.

palladin9479 · Aug 10, 2015

InvalidError :

*Slams head on desk*

I'm beginning to think I might be talking over peoples heads.

Every modern CPU takes exactly one cycle to execute the following

MOV r1, 06 (move the value of integer 06 to the register 1)

or this

ADD r1, r2 (add the contents of register 2 to the contents of register 1 and store the result in register1)

This is binary math, happens in a single cycle. It's been happening in a single cycle since the 70's. Doing binary math in single cycles just requires you to have transistors setup to do it. ASICs demonstrate how you can do this. Virtually any operate can be done this way though the more complex it gets the more transistors you need to dedicate to that single instruction type.

Everything you just stated was the load and stores PRIOR to that add instruction or after it. In RISC world those are different, separate, instructions. LOAD the data into the registers (1 cycle), EXECUTE the instruction (1 cycle), STORE the result somewhere useful (1 cycle). It's called a LOAD / STORE architecture for a reason, load's and stores aren't combined with the execution instruction.

Furthermore pipelining adds latency but not execution time. If the instruction takes one cycle to execute, then when it gets it's turn it will still take one cycle to execute, even if it's been sitting in a pipeline for six cycles. There is a difference because for those six cycles that the instruction was sitting in a pipeline you were doing other instructions. It's actually not very efficient to pipeline variable execution time instructions, it's so inefficient that instead of trying Intel instead decided to build dedicated hardware that converts it's variable length instructions into fixed length instructions and then pipeline those new fixed length instructions. And thus their decoder was built. The importance of all this is to understand that having fixed length, fixed execution time instructions in binary machine language means the CPU manufacture doesn't need to build complex translation hardware into silicon to turn variable length, variable execution time instructions into fixed length, fixed execution time instructions.

I don't know how else to describe this level of micro architecture. It feels like I'm describing the color of grass and people are instead arguing about tree leaves.

bystander · Aug 10, 2015

palladin9479 :

I think you are just misunderstanding what I have written, or meant to write. If you read the what the information on the instruction you linked here: http://x86.renejeschke.de/html/file_module_x86_id_210.html You'll see they still load those values into a register, multiple them, and send the result to a register. The actual code may have changed a bit since I coded, or I just didn't recognize it. Either way, that instruction still did all the register stuff like RISC would.

Complex instruction set computing (CISC /ˈsɪsk/) is a processor design where single instructions can execute several low-level operations (such as a load from memory, an arithmetic operation, and a memory store) or are capable of multi-step operations or addressing modes within single instructions.

This is what I'm talking about. A CISC design offers complex or what I wrote as higher level instructions. They simply can perform a series of operations in a single instruction.

What makes something CISC or RISC is not set in stone. The rules are more of a concept, and the same chips sometimes get labeled both ways depending on who is talking about it.

The point being, the instructions offered, doesn't necessarily mirror how they are actually performed internally. Just because there are higher level instructions available, doesn't mean internally, they aren't broken down to their simplest instructions, similar to a RISC processor.

palladin9479 · Aug 10, 2015

Either things have radically changed since I was taught and coded in ASM, or you are misunderstanding what CISC and RISC is. If you read the what the information on the instruction you linked here: http://x86.renejeschke.de/html/file_module_x86_id_210.h... You'll see they still load those values into a register, multiple them, and send the result to a register.

The CPU is capable of executing those instructions with a memory address being fed into it instead of a hard register, it's part of the ISA. Standard practice is to use MOV's like LOAD / STORES and to try to do all executions directly on registers. This is practice not because the CPU requires you to do so. And because the ISA says it must be able to do that, the hardware logic must therefor be capable of doing that.

What makes something CISC or RISC is not set in stone. The rules are more of a concept, and the same chips sometimes get labeled both ways depending on who is talking about it.

In actuality there is no such thing as "CISC". In early CPU days all instructions were coded directly into transistors. As CPU's become more complex, supporting all those complex instructions inside silicon was getting extremely expensive. A design philosophy came about as a solution, instead of trying to cram every instruction imaginable into hardware silicon, reduce the amount of silicon required by standardizing instructions and using multiple simple instructions instead of one complex one. Reduced Instruction Set Computing (RISC) doesn't mean the ISA has less OPCODES, but rather that there is reduced silicon required for executing those same opcodes. Instead of trying to include the data with the instruction we treat them separately and thus simply the processor design.

RISC is a design philosophy, a set of rules that should be followed whenever possible.
They are that instructions should be kept simple, no more then one memory access per instruction.
Instructions should execute within a single cycle whenever possible.
Memory access should be simple and uniform, no complex memory segmentation addressing
Load / Store architecture should be used
Focus should be on execution via strings of smaller instructions rather then a few big instructions

There are more but no CPU designer is entirely 100% "RISC", the closest is the SPARC line and even they break the rules on a few things. The idea of "CISC" came about as a way to describe CPU designs that didn't follow the RISC principles and thus were not-RISC. There were several back in the 80's but since only x86 still survives and only in an ephemeral sense since the CPU itself is emulating the x86 ISA in hardware.

This kind of explains the different philosophy's vs features of "RISC"

http://www.inf.fu-berlin.de/lehre/WS94/RA/RISC-9.html

The reason for this whole discussion is that because Intel's CPU's need to emulate the x86 ISA, which didn't follow the RISC design principles, there is a prerequisite for them to have an advanced, high performance emulator / translator attached onto their CPU rather then just letting it accept code in it's own native language. CPU's that have ISA's developed to follow those RISC principles don't need that front end emulator but instead must deal with a higher number of incoming instructions. This gets into different methods of doing scheduling, prediction and caching.

InvalidError · Aug 10, 2015

palladin9479 :

Data movement between the register files and ALUs (what I was writing about) has absolutely nothing to do with load/store (memory) instructions which are used to move data between memory and registers.

tsnor · Aug 10, 2015

not grabbing the entire quote stream, just...

"..*Slams head on desk*

I'm beginning to think I might be talking over peoples heads.

Every modern CPU takes exactly one cycle to execute the following

MOV r1, 06 (move the value of integer 06 to the register 1)

or this

ADD r1, r2 (add the contents of register 2 to the contents of register 1 and store the result in register1).."

Actually these type of instructions often take 0 cycles, they are typically overlapped with execution of something that takes a while. It's really hard to look at a pipelined processor and assign cycles to each instruction.

I'm not sure if I get the point you are making, but "...Every modern CPU takes exactly one cycle..." is a really hard claim to make.

Aside,
Consider these two instructions, one after the other
MOV r1, 06
ADD r1, r2

Then consider the following two which looks similar, but the value of r1 is not changing immediately before an instruction that uses it.
MOV r3,06
ADD r1,r2

The second pair of instructions is apt to be faster since the add is not dependent on the results of the MOV.

aside: the hardware engineers cheat. At any point in time there are apt to be several r1 values used for processing, some of them set by branches that are not actually taken. The processor only has to be correct when something happens that lets you examine the results. That why if you trace them, certain instruction will never get an external interrupt (timer pop).. the CPU is choosing not to let you see that in that code sequence it always executes the later instruction before a prior instruction so cannot be interrupted between the two.

palladin9479 · Aug 10, 2015

Actually these type of instructions often take 0 cycles, they are typically overlapped with execution of something that takes a while. It's really hard to look at a pipelined processor and assign cycles to each instruction.

They must take some time to execute, a signal must be passed down a set of transistors which read input signal bits and output a result of those bits, this is binary math.

Every instruction has an execution time, otherwise prediction, pipelining and caching would be impossible. Variable execution length instructions, which is any instruction that's execution time can't be known until right before it's executed, are extremely difficult to pipeline because you can't get a valid prediction on when it will be scheduled and when the execution resources will be free for another instruction. Fixed execution instructions, which are those that always take an exact amount of time to execute, typically one cycle, are much easier to schedule around. From the CPU's point of view everything is measured in cycles and when they do look ahead and prefetch, they are counting cycles for and scheduling based on this future time, again measured in cycles. Native x86 has many variable execution instructions, left overs from it's 1980's 8088 / 8086 / 80286 roots. Because those instructions are still valid, the CPU must process them correct and assume they can happen at any time. This makes pipelining x86 natively virtually impossible without a huge expensive die. So instead we emulate the x86 externally and translate it into single cycle RISC micro-ops.

This is how a CPU would see the world, loading an instruction stream it's at instruction #48. The CPU see's it has six instructions on the pipeline and since those instructions are all one cycle instructions, it knows it has six cycles before instruction #48 is scheduled to be executed. And because instruction #48 itself is a single execution cycle, it knows it will be seven cycles before the instruction #48 is finished executing. Using these precise measurements it can then schedule to have all the data required by all those instructions ready inside the L1D or even already loaded into a register file and ready for rename. Now in actuality it won't be seven cycles, the exact number depends on many things but the point is that because all the cycles are fixed execution times, the scheduler, predictor and cache subsystem can all do their job. With variable execution instructions this becomes impossible because the CPU can't see more then one or two instructions out and thus can't pipeline and schedule very efficiently. Intel x86 is an ISA with many variable execution instructions.

And yes if you have multiple processing resources then your pipelined system can get really fancy and do concurrent execution. And again it's not really possible unless those instructions are all uniform in length and execution time. Small bites are far easier and safer to work with then large ones.

Data movement between the register files and ALUs (what I was writing about) has absolutely nothing to do with load/store (memory) instructions which are used to move data between memory and registers.

That happens before the instruction hits the execution unit and is part of the pipeline, it has absolutely nothing to do with execution time. I think your really not getting the difference between execution time and execution latency. One is set by the ISA and the other is variable based on what the system's doing at any moment in time. The latency doesn't matter from an engineering point of view, it's just a number in a queue and doesn't effect overall uArch. The execution time on the other hand is a key parameter that determines how effective your scheduling and pipelining will be. It's almost always one cycle in RISC design's because RISC instructions are limited to one memory operation per instruction, thus all data will already be present prior to the instruction executing in hardware transistors. Complex instructions with multiple memory operations being required will have execution times longer the one cycle and the exact time won't be known until immediately before execution, which invalidates any prefetching or scheduling you had done beforehand.

InvalidError · Aug 10, 2015

palladin9479 :

What is the effective execution rate of:
INC EAX
INC EAX
INC EAX
INC EAX
...
?

Two cycles per instruction due to the extra cycle it takes to route the result of one ALU to the input of the ALU handling the next INC. It is not the time spent in the ALU that matters, it is the total time between the instruction being issued (presented at the ALU) and the result being usable as an input for other instructions.

palladin9479 · Aug 10, 2015

InvalidError :

One cycle exactly per. Each one just adds another one to the previous so you don't even need to move data out of the register into another one.

And now your just ignoring what I write and inventing your own descriptions. You can't redifine terms to make your invalid argument correct. Execution time is the time it takes to execute a specific instruction in transistors, it's a well known term used for designing integrated logic. You keep trying to make it about instruction latency and then hand waive to cover the fact your using one term in place of another. ALU's are not magic box's with little leprechauns inside, they are just a bunch of dedicated transistor logic. Every instruction type they handle has a separate circuit dedicated for that instruction type, ADD gets one circuit, SUB gets another, MUL gets it's own, and then the logic compares each get their own. You load your data signals on the input pins while also loading the control pins to select the instruction type, cycle happens and *BAM* transistors do their work and the result is loaded onto the output pins. This is why ALU's often have different instruction types assigned to different ALU's, the ALU's themselves are not interchangeable. Two of the same instruction type will be assigned to the same ALU, especially if they are concurrent like your example.

"Sending" and instruction to an ALU isn't a data copy, it's just loading the voltage on the appropriate lines / pins during the next clock cycle. Data copies and moves aren't even handled by the ALU but instead by the MMU which, like the ALU, is a bunch of dedicated transistor logic for those specific instructions. Pipelines have absolutely zero to do with instruction execution times, they are just ways to organize and optimized the process of instructions in order to minimize stalls. With a 1 stage, 4 stage or 20 stage pipeline, an ADD, MOV, MUL or INC instruction will have the exact same execution time regardless. The latencies might vary, buy the execution time is static because otherwise it's impossible to schedule.

I keep telling you this is important because if an instruction has the possibility of a variable execution time, meaning it could take multiple transactions on those dedicated logic circuits, it becomes nearly impossible to pipeline without creating dedicated logic circuits for every permutation of that instruction.

Anyhow this is just you trying to distract and obfuscate. You said no instruction in a modern CPU had one cycle execution time, this is patently false. It's a key concept in all new CPU designs, and there is a plethora of technical documentation supporting and defining this. It's definition is standardized and not open to reinterpretation by random people on internet forums. This standard definition is critically important when building instruction schedulers that are the glue that hold all our shiny CPU's together.

InvalidError · Aug 10, 2015

palladin9479 :

And if you write hard-realtime code, you have to assume the worst possible latency penalty on top of the execution time because you cannot guarantee that the operation result will be available any sooner than that.

And all instructions need to retire / commit their result to wherever they need to go.

turkey3_scratch · Aug 10, 2015

Anyway, aside from machine code talk, as a consumer I am highly considering purchasing the 6700K. I find Skylake to be a solid performance increase. You'll notice, for instance, than in Blender it improved a lot over the 4790k. It seems to be generally a 10%-15% improvement, and I think people undermine those percentages. Yes, in the past it used to be more, but this is present, and for an already-so-great 4790K CPU, the 6700K is even better, so why complain?

It also gets the benefits of DDR4 and Z170.

palladin9479 · Aug 10, 2015

InvalidError :

That's irrelevant to the discussion at hand.

This is purely about the differences between RISC and not-RISC (x86) and how those differences dictate how important the front end decoder and scheduler is. You brought up how Intel's x86 front end scheduler was so much more beefy then the ones on the RISC big iron chips. I said that those because of the nature of RISC there is less of a need for such robust scheduling logic. One of those benefits is that most RISC instructions are fixed length one cycle execution time and thus are easier for the scheduler to handle and predict, where as because x86 native is variable length variable execution time, Intel's scheduler needs to do CPU jujitsu to get them pipelined. You got offended that there might possibly be a benefit of RISC over not-RISC (x86) and thus started a debate where you attempted to "win" by changing the definition of the words used while using ever shorter responses.

There is a good graphic and explination of the Classic RISC architecture used as a reference. No actual CPU use's this, but instead it's the basis used for all modern RISC chips.

https://en.wikipedia.org/wiki/Classic_RISC_pipeline

The Execute stage is where the actual computation occurs. Typically this stage consists of an Arithmetic and Logic Unit, and also a bit shifter. It may also include a multiple cycle multiplier and divider.

The Arithmetic and Logic Unit is responsible for performing boolean operations (and, or, not, nand, nor, xor, xnor) and also for performing integer addition and subtraction. Besides the result, the ALU typically provides status bits such as whether or not the result was 0, or if an overflow occurred.

The bit shifter is responsible for shift and rotations.

Instructions on these simple RISC machines can be divided into three latency classes according to the type of the operation:

Register-Register Operation (Single-cycle latency): Add, subtract, compare, and logical operations. During the execute stage, the two arguments were fed to a simple ALU, which generated the result by the end of the execute stage.

Memory Reference (Two-cycle latency). All loads from memory. During the execute stage, the ALU added the two arguments (a register and a constant offset) to produce a virtual address by the end of the cycle.

Multi-cycle Instructions (Many cycle latency). Integer multiply and divide and all floating-point operations. During the execute stage, the operands to these operations were fed to the multi-cycle multiply/divide unit. The rest of the pipeline was free to continue execution while the multiply/divide unit did its work. To avoid complicating the writeback stage and issue logic, multicycle instruction wrote their results to a separate set of registers.

Here is an older MIPS pipeline

http://www.sdsc.edu/~allans/cs141/L2.ISA.pdf

It outlines the various stages, execution is a separate stage then decode and fetching. The decode and fetching stage would have already loaded the data into registers and onto the appropriate input lines / pins of the CPU resource being needed (ALU / MMU / FPU / ect..). There is no hidden pause in the middle of the execute stage where those pins are loaded. The reading of the results is done on a consecutive stage, it's not part of the execution stage. The ALU doesn't actually operate on registers, instead the contents of the registers are read and the appropriate lines have a voltage put down them during the previous stage.

http://www.agner.org/optimize/microarchitecture.pdf

Page 120 starts the section of SB and the newer Intel CPU designs, page 127 has the table about instruction latencies at the execution stage, which is their execution times. An ADD uOP (RISC instruction) takes one clock cycle to execute, a MOV uOP (load / store) takes one cycle to execute, a Boolean logic uOP compare takes one clock cycle to execute, a MUL uOP takes three or five cycles depending on how it's generated, a shift takes one cycle, a jump takes on cycle and a floating point division takes 10 ~ 22 cycles to execute. Haswell is on page 135 and similarly has instruction execution latencies in the execution stage listed. Lots of one cycle instructions there.

Those are all micro-ops, which is the RISC-like language that the front end decoder converts the x86 machine code into. One x86 instruction can generate several uOPs depending on how it's coded, and it's that variance that requires such a beefy scheduler with a Ninja like decoder. They will actually convert from x86 to uOPs then attempt to fuse or reorder uOPs ahead of time in order to reduce how long the total latency is. Better yet, there is even a 0 "execution time" instruction, an instruction that copies data from one register to another is handled at rename stage and never even hits the execution stage, though the rename stage still takes one cycle.

royalcrown · Aug 11, 2015

Darnit...how am I sposed to mod you two with all you're fancy citified book learned programmin' talk. That there fancy stuff ain't ebun english !!

InvalidError :

royalcrown · Aug 11, 2015

IMO people are complaining because they use their PC to surf the web and play some games. Most people don't use a lot of the apps that are benefiting, therefore underwhelming. As far as DDR4 goes, the price is still at a premium and the performance of it isn't any better than 3 right now, or it's worse. I agree that if and when speeds and latencies improve as DDR4 matures, it'll be more worth it.

turkey3_scratch :

turkey3_scratch · Aug 11, 2015

DDR4 is not worse. Divide the CAS latency by the frequency, and you will fine it is smaller than DDR3 RAM with its lower frequency. Anyone paying $250+ for the 6600K or $350 for the 6700K, well, I don't see why they can't squeeze in a 750Ti and buy a Haswell I5-4460.

Edit: Unless DDR4 has a higher response time, I don't see how it's worse, even if small improvement.

palladin9479 · Aug 11, 2015

Edit: Unless DDR4 has a higher response time, I don't see how it's worse, even if small improvement.

It actually has the exact same real latency as DDR3 and previously DDR2. There is a ~7ns absolute limit limit on getting data back from a DRAM chip. Higher quality silicon can go a bit lower, lower quality silicon is higher but that is the lower limit due to the signal path from the CPU's memory controller, across the motherboard, through the DRAM chips until it hits the last chip and terminates. The only way to go under that would be to physically move the DRAM chips closer to the memory controller and have less DRAM chips running in serial.

DDR2-800 CL4 is 10ns
DDR2-800 CL5 is 12.5ns

DDR3-1600 CL6 is 7.5ns (expensive stuff)
DDR3-1600 CL9 is 11.25ns (common stuff)

DDR4-3200 CL16 is 10ns and so forth.

So while memory bandwidth is definitely going up, command latency has been stagnant. We're at the mercy of physics here.

royalcrown · Aug 11, 2015

palladin9479 :

We should still have room as far as your argument goes because:

Light travels ~29.979 cm in one nanosecond, meaning that, technically, a light-foot is ~1.0167 nanoseconds.[5]

https://en.wikipedia.org/wiki/Nanosecond

and 29 centimeters is about a foot, there should be some room left as far as the trace length goes.

I can do the math if you disagree, just being lazy atm.

Skylake: Intel's Core i7-6700K And i5-6600K

Distinguished

Expert

Champion

Splendid

Titan

Splendid

Champion

Titan

Splendid

Splendid

Champion

Splendid

Titan

Splendid

Splendid

Titan

Splendid

Titan

Expert

Splendid

Distinguished

Distinguished

Expert

Splendid

Distinguished

Share this page