Skylake: Intel's Core i7-6700K And i5-6600K

Page 9 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

Once you have trace lengths longer than 1/10th the wavelength though, transmission line effects start having a major impact on signal integrity, especially so on multi-drop busses. This is why nearly all high-speed interfaces have gone point-to-point serial.

Also, due to the dielectric constant of PCBs and insulation in wires, the speed of electrical signals is typically closer to 200e6 m/s. At 3.2GT/s, that makes each bit only 6.25cm long on the bus. With DRAM data traces being ~10cm long, this means that by the time a set of bytes reaches the CPU, the memory chips are already starting to drive the next ones on the line. When you reach speeds where you have multiple bits in-flight on the wire at the same time, it becomes extremely difficult to maintain signal integrity on parallel multi-drop busses and this is why having multiple DIMMs per channel become increasingly touchy as clock frequencies go up..
 
Yeah, i get all that, I was just going on the theoretical hard limit, not the practical ones. Such as on a 5ghz processor you need a 5.9cm trace maximum to make a round trip in 1 nanosecond. But considering latency is 6,7,9, I think we could shave it down to 4 even being practical. Just my 2 cents.



 


LOL we've been hit by an overloaded definition. Please accept that 'execution time' for example as used in 'the execution time of this program was 2.1 CPU seconds' does *NOT* mean the time it takes to execute a specific instruction in transistors with an infinite l1 cache. It refers to wall clock time the CPU was considered busy by the operating system (or whatever tool was measuring performance), and included all of the pipe stalls, cache misses, etc that contribute to the wall clock measured execution time.

Rereading your posts they make sense with your definition. Pretty funny.
 


Execution time is an actual term, it's not a definition up for debate. It's the length of time a specific instruction takes to execute in silicon, not how long it takes to setup the instruction. I linked documentation that describes the various execution latencies during execution stage, which is the same thing as execution time. And I've said, several times, that this number is important because you design your scheduler around it. The scheduler doesn't care about how long, in real time, it takes an instruction to pass down the entire pipeline, it only cares about how in absolute cycles it takes for each stage in order to optimize instruction flow.

We should still have room as far as your argument goes because:

Light travels ~29.979 cm in one nanosecond, meaning that, technically, a light-foot is ~1.0167 nanoseconds.[5]

https://en.wikipedia.org/wiki/Nanosecond

and 29 centimeters is about a foot, there should be some room left as far as the trace length goes.

I can do the math if you disagree, just being lazy atm.

If there was a fiber optic wire connecting the memory controller to the very last chip on the line then yet, but because the signal needs to pass through several gates and be analyzed on it's path, then no. Also the path isn't perfect with minimal resistance and perfect uniformity, there are solder joins, DIMM contacts and some slight EMI from other bus's along the way. All those little things add up to a practical limit of about ~7 ~ 7.5ns for command latency. Now if you had DRAM chips soldered onto the same PCB as the CPU and sitting right next to it, then you can go a bit lower with command latency. If you avoided the solder entirely, used a serial bus and put the memory onto the same package as the CPU, then you could make real headway.

Ultimately it's not that big a deal, CPU prefetching and caching hide most of the latency from the user, it's only when there is a mispredict and the data isn't inside any cache that command latency really becomes an issue.
 

The scheduler also takes into account the delays of routing results from the execution unit to the register file or other execution unit for instructions that depend on previous results but because register write-back and bypass delays are the same regardless of which execution unit the result comes out of, it is omitted.

In old architectures, instruction execution time included everything from instruction fetch to result write-back into registers since every instruction had to go through every stage to complete the instruction's execution. Today, most of that is bypassed by the reorder buffer but instructions still have an execution stage (1-5 cycles depending on the instruction and execution units dealing with multi-cycle operations are pipelined to accept new operands on every clock cycles for most instructions) and a retire stage which adds one more clock tick to complete the instruction's execution and make the result available to other instructions waiting on it.

An instruction is not truly complete until its result becomes available to other instructions waiting on it.
 
The scheduler also takes into account the delays of routing results from the execution unit to the register file or other execution unit for instructions that depend on previous results but because register write-back and bypass delays are the same regardless of which execution unit the result comes out of, it is omitted.

This happens before the execution window, I've already demonstrated that.

An instruction is not truly complete until its result becomes available to other instructions waiting on it.

And a pig isn't truly a pig without cute squiggly tail. No true Scotsman fallacy in full swing.

Most RISC instructions have a one cycle execution time, regardless of how you want to redefine the terminology. This is on purpose in order to facilitate easier pipelining.

You can't argue that 1 + 1 = 3 because the universal law of inequality adds an invisible one via super-positional plasma inductance along the warp core.

I have provided detailed material that explains this, outlines the various stages, explains how the instructions are executed and so on / so forth.
 
I've decided I'll not consider a new Hardware platform in the remaining financial year.

Frankly, all those X99 and Z97 custum-built machines have more enough to meet the computational demands at least by the end of the next financial year (2016). By that time I'll phase out the existing Windows OS machines (20% share) -- no more Windows on the company networks.

Last but not least, I'm somewhat disappointed to learn higher TDP of 6700K. Devil's Canyon isn't lagging that much so I won't lose my sleep over it.
 


Peace.

If you talk 'bus driver' to an EE one definition comes to mind. If you are standing near a greyhound bus another is in play.

Declaring "..A bus driver is a buffer designed to drive a bus. .." and is not up to debate sort of depends on the audience.

 
I never said we wouldn't have to radically change how we do stuff in order to obtain those goals. I do have a rudimentary grasp of the things that are impeding such progress and perfect signal propagation. Things such as the speed of electrons doesn't equal the speed of light in a solid and inductance etc,; I really don't want to obtain my masters in circuit design though, and just stating I think that lower times are doable.

*edit* tsnor reminded me, I do have a background in the EE side of things rather than the programming side, I went for my AASET in '93 or so. So I would be looking at this from the physics side palladin.

The thing is we aren't going to get them because that would require things that the industry isn't going to do until there is no other choice. Such things as throwing compatibility out the window, radical redesigns such as main memory on the die, standardized hardware; those things aren't happening because the people that could do this aren't done fleecing us for tiny baby steps, and it's hard and expensive. The powers that be have no interest in making the best they can for the sake of doing it.

Pretty soon the rest of the improvements are going to have to come from coders knowing the hardware and throwing this object oriented visual crap out the window.




 

"Object Oriented crap" is not going away any time soon since dealing with the complexity of modern OSes and APIs by managing everything manually would be borderline insane.

What you are going to see is more libraries and toolkits that re-implement many common algorithms in more threading-friendly manners or with application-specific twists to accommodate specific use-cases. Few programmers want to waste their time re-inventing the wheel unless they have very specific reasons to.
 
That's true...it's just that to get better performance it's going to take better coding soon, were going almost full circle, pretty ironic. It's still object oriented crap to me, I understand it's necessary because the firmware and OSes are overly complicated now and no team could write anything in a timely manner without it.

That's why after that the next step is standardized cookie cutter hardware for each usage type, we're hitting a wall as far as raw horsepower for home use and the next jump will be stuff like DX12 and stripping legacy code. That's only going to work so well as long as we need HALs. Once we squeeze the code as much as we easily can, we are going to need to dump hardware abstraction and go to a standardized hardware set for any real jumps in performance. Optimization doesn't work very well when we have gobs of plug in components.

Look at how relatively awesome the C-64 was at 1 mhz compared to an 5150 or 8086 at the time. It destroyed it as far as speed because it was a standard platform and programmers knew how to get the max out of it. When a 1 mhz system can beat a system almost 5x faster and with 10x the memory, that makes the case for standardized platforms IMO.


 
The difference between a C-64 and a PC is that the C-64 was one set of hardware to program for. It is much like a XB1/PS4 vs a PC. Same base for the hardware but the PC has so many variables it makes it hard to optimize and take advantage of all the speed from the get go.

That does not mean that the XB1/PS4 is faster than a PC, as we have seen a top end PC with a highly optimized game can make the XB1/PS4 look like last gen. It just means that when you can code for a single set of hardware and software, you can make things work better. With many sets it takes time.
 

Do you really want to get intimate with the GPU architecture down to clock-accurate compiler models for each and every GPU model? When you get to that point, you need to create optimized code for each architecture you want to support and this become a nightmare of its own. It is extremely unlikely that AMD, Intel, Nvidia and whoever else might make a GPU to pair with an x86 CPU will have exactly the same architecture for a given instruction set, API or whatever and they will all have hardware scaled from basic to enthusiast, so you need an abstraction layer to avoid having to support every possible case individually while still allowing progress and scaling under the hood.

If you look at what has happened on the CPU side, the whole x86 front-end on modern x86 CPUs is basically a hardware abstraction layer for the RISC core that actually runs instructions with the reorder buffer and issue queue taking care of most deep architectural details so programmers can focus on higher-level code.
 

What if more paths are made? I don't know how many paths are in each memory channel, but what would happen if we just increased the amount? The speed of each bit would slow, but the amount of bits delivered overall would be faster. For the sake of argument, lets say that there are 2 paths from the CPU to the RAM. If the number was increased to 10, then the data throughput would be 5x that of the original model. I don't know how many bits are needed to execute a command, but a wider path would deliver them much quicker since they are delivered simultaneously.

One problem with that might be the same as with the PATA interface: the bits were in danger of not being delivered at the same time because of little variables, so one might get delivered too soon, and another too late, causing loads of problems. I think that a good solution to that would be to send 6 bits, shut down the paths for 2 cycles, send 6 bits, shut down for 2 cycles, etc. That will ensure that the data won't get mixed up. To compensate for the lost cycles, lots of paths need to be added. Tripling or quadrupling the paths would easily make up for the lost cycles. That's just my 2 cents.
(please go easy on me. This isn't my area of expertise, so forgive me if I overlooked the obvious or said something dumb)
 


Why would it be linear? Bandwidth =/= throughput.
 

I was assuming that all else (clock rates, latencies,etc) would be equal. If one path transmits 1 bit per nanosecond, then 10 paths would transmit 10 bits per nanosecond.
 


We already do that, DDR DRAM access isn't a serialized interface. Each memory channel is 64-bits wide and systems have anywhere from two to four memory channels per socket. Those parallel lines are actually one of the limiting factors behind command latency, you need to wait for the signal on the slowest lane to terminate before sending another one, and since not all lanes are exactly the same length, there will always be synchronization issues. This is why we've been going to bundled serial interfaces for everything outside of a socket, it generally lets us pump faster by treating each line as a separate interface rather then treating them all as a single big one.

Also latency =/= bandwidth, those are two separate things though often intertwined. The best way to describe it is a lord and his servants. A lord has ten servants and wish's to take a hot bath. He orders one servant to go get his bath ready. Latency is how long it takes that servant to run down to the servants quarters and order the rest to start preparing the bath water. Bandwidth is how fast buckets can be brought to the tub. No matter how many servants (bandwidth) the lord has, his individual commands can not go faster then the time it takes one servant to relay them to the others (latency). So going wider only increases bandwidth, it does nothing for latency. Going faster can increase both but then physics weights in, servants can only be made to run so fast while you can always hire more of them.
 
I am sorry, but what "enthusiast" purchases a CPU to use for integrated graphics for a desktop? There aren't any worth a dime, and honestly, little reason to do so anyway. I even gave AMD another shot with one of their A10 APU's, purchased a separate discreet card compatible to crossfire with the APU, and I was still extremely dissapointed. You can run games on low settings or older, less graphics-intensive, games on high at an acceptable framerate. For my kids' games and such (ages under 8), the performace and graphics isn't bad at all for the total price you pay. It just isn't for me when I want to have my "me" time with something that requires a little more beef.

You can get a good enough graphics card for dirt cheap (under $150). You can get one that can run most if not any current gen games on max settings for under $200. Yes, this might be pushing it but you CAN find the right deal. I did not spend much more total on my Intel machine as I did on this AMD APU rig, but the performance in gaming and more has proven the extra buck to be more than worth it.

You should not be looking at a desktop CPU for its graphics capabilities, but more for its processing pwoer and performance. If your needs are different, and you do not have any use of powerful graphics, then have at it. I would still go to Intel for such uses, but that can be otherwise debateable when you start comparing actual use and cost.
 
What the heck Intel? So, you provide great integrated graphics into Broadwell, then nerf it for Skylake? I guess you had to find a way to help sell your 'paper launch' of Broadwell. I really hope Xen makes you guys wake up; although it more than likely won't.

...Great integrated graphics? Is that wwhat you call it? It is still garbage for anything more than basic 3D use. Most people aren't looking at an Intel CPU for graphics processing. I hope this "Zen" you speak of from AMD can actually provide viable graphics and processing, because their A-series APU's aren't worth a dime. I would actually like to see a good APU without needing discreet graphics.
 
I don't necessarily want it because:

The best platform wouldn't be the one people pick.

I am not talking about having gobs of architectures, I mean ONE set of hardware for home use. If everyone had the exact same internals, how easy would it be to optimize, or fix a bug and they are fixed for all users. Yes, there are minuses to the idea, but we'll reach a point where to progress any further (tangible gains, not 1 percent junk) that will be what will be necessary. Home users simply can't keep adding more hardware forever, it'll get cost prohibitive as well as electrically and thermally.

Don't know when it'll happen, but we'll be at that point someday, or we just stop getting faster, more powerful hardware and make due.





 

Different people have different requirements, budgets and applications. If you wanted to force a single hardware platform for everyone, then you would either need to force a horribly over-powered PC on people who only use their PC for email or a grossly under-powered PCs on enthusiast gamers or anything in-between. Different architectures fare better at different workloads, so you often have reasons to consider using different hardware.

Hardware is still nowhere near cheap and powerful enough to enable a one-size-fits-all approach.
 


If you want everything standardized and the same, go join the Apple cult. They will give you what they think you need and you will have no options.
 


There is a reason why Apple has never become more than a niche product (their PCs) because there is no one set of hardware and software that works for everyone. Some people just need a tablet in terms of power while others need the top of the line. Apple doesn't do this as well as a PC.

And yes it would be easy. It just wont ever happen.
 
I've had apple, thank you. It's not some kind of mindless cult that people can't leave.



 
Status
Not open for further replies.