Amd Ryzen Threadripper & X399 MegaThread! FAQ & Resources

juanrga · Aug 15, 2017

gamerk316 :

CPUs around serial ISAs as x86 aren't designed for parallel workloads, but there are ways to handle that, such as using explicitly parallel ISAs: VLIW or SIMD come to mind.

The AVX family of ISAs avoid those scheduling overheads at the OS level, and avoid others like fetching/decoding overheads at the core level. CPU designed around those explicitly parallel ISAs can compete with GPUs on massively parallel workloads. AVX512 is a standard in HPC. And ARM has designed the SVE ISA with the same goals.

Solarion · Aug 15, 2017

juanrga :

lol Are you for realz?

...or looked at another way, the most expensive CPU on the chart, the 7900x, was 6% slower than a CPU that costs $60 less. A gap that will look even worse for the sky-x part when handbrake is optimized to scale better with more cores.

Also, while the 7900x was faster than the 1920x on that benchmark, the gap was 8.5% and the 1920x is a whopping 33% cheaper. The sky-x part just isn't a good value. In fact the only sky-x part that makes much sense at all is the 7820x and that'll be fixed in short order when the 1900x shows up at $549.

Intel is kind of in a bind with their HEDT platform at this time. They're not the best and they're certainly not the cheapest. Intel's own "k" series chips are faster, both in terms of IPC and clocks, while AMD's Ryzen Threadripper HEDT processor line offers both more bang for the buck and more top end high core count performance. Currently there's really no compelling reason to go with a 79x0x processor unless you're doing so for brand loyalty.

jdwii · Aug 15, 2017

16/12=33%
18.9/16.4= 15.2%

Meaning that no handbrake can't use all 16 cores and 32 threads as efficiently with perfect scaling.

Can it use it yes does it use it to it's full potential no

But its not something to keep going on about haha i'm just saying i knew handbrake wasn't going to push it to 100% or anything. Adobe can actually be worse depending on the settings one uses i heard at times it can barely even take advantage of 8 cores.

juanrga · Aug 15, 2017

jdwii :

1950X has lower clocks. Consider also Amdahl's law limits. Throwing 33% moar cores doesn't warrant 33% higher performance even if clocks are the same.

Assuming 90% parallelism (which is a lot of parallelism) in Handbrake and taking all factors together

(16/12) * (3.4/3.5) * 0.90 ==> 16.6%

which is very close to the observed 15.2%.

Solarion · Aug 15, 2017

...and? This is not a failing of the Ryzen Threadripper, it's just the way software has been coded to this point. The same scaling issues will apply to the $2k 7980xe when it finally shows up...with its 2.6Ghz base clock and distinct lack of indium solder under its lid.

Now that Intel has been dragged kicking and screaming into a world where HEDT users have the option to purchase processors with > 10 cores(ty AMD) we'll likely start to see better scaling on typically intensive software applications. This has always been a chicken and egg type of situation. Why should a company like Adobe for instance worry about coding software in such a way that a tiny fraction of a fraction of users with 2P server class workstations can enjoy better scaling? That situation will probably change in the near future with both chip providers offering skus with > 10 cores. It's not rocketry surgery here.

When enough video editing professionals get huffy with developers because their pricey 12 - 18 core workstation isn't being fully utilized to render a video...things will change.

liberty610 · Aug 15, 2017

EXACTLY!!! And I am telling you right now, with threadripper being an inexpensive 16 core (inexpensive compared to what it could or maybe even SHOULD be) more HEDT users are going to start gunning for the better chips and getting bigger systems built. I don't think people quite get the wall that is being broken down here with this release. Whether people want to admit it or not, this IS going to be a game changer. If these tpyes of chips are becoming more common and getting into more systems, software AND games will have to take different routes.

Solarion · Aug 16, 2017

The stagnation in desktop x86 CPUs was blatantly obvious prior to this year. No competition for the past several years hasn't served anyone well, save a certain #2 chip company and their shareholders. This has been a fantastic year for progress on this front, not just for AMD, but for anyone that likes innovation and more powerful CPUs at lower price points.

juanrga · Aug 16, 2017

liberty610 :

Same kind of arguments repeated since Bulldozer. It didn't work then and it will not work now. Moreover, the Bulldozer chips did target a more mainstream user, whereas ThreadRipper targets a niche of a niche market and will have even less sales.

There are two basic kind of problems in computer science: those can be parallelized and those cannot. Throwing moar x86 cores helps to the first class of problems with diminishing returns (see my above mention of Amdahl's law limits). Moar x86 cores doesn't help the second class. Moreover, if there are large explicit parallelism in code, then something as AVX512 is more useful, because reduces the computational overheads associated to moar x86 cores and more threads. Having 8 cores running 512bit instructions is better than having 32 cores running 128bit instructions. That is the reason why Intel and ARM have developed AVX and SVE, respectively. That is the reason why Fujitsu developed 256bit instructions for their current supercomputer and are updating to 512bit for the next.

The idea that all games don't scale to 16-cores because the hardware is not popular is also incorrect. We have to remember that game developers were given the option of a 16-core console, and it was rejected by consensus, due to difficulties on using all those cores (games aren't parallel workloads). That is why PS4 has only 8-cores (and not all of them are available to games). If we check AMD last slides, they are using a quad-core 7700k for game demos, because a RyZen chip or a ThreadRipper chip would hurt performance.

Solarion · Aug 16, 2017

...except a 7700k doesn't outperform either threadripper OR a skylake-x chip for parallel tasks does it? It's not even close...despite it's IPC and clock rate advantages. That's why there are server chips and that's why there are HEDT platforms. Is this somehow new information?

So what would you have users in search of workstation class products do? Are you suggesting we should all be trying to manage clusters of 7700k systems? LOL

What's the point in haunting a threadripper thread just to rail against inefficiencies in software scaling? The only viable alternative to threadripper for an x86 workstation is skylake-x/x299 or a server class product(Xeon/Epyc). They all have their pros and cons, but the reason people choose these platforms is to get levels of performance not available on the mainstream platforms.

gamerk316 · Aug 18, 2017

Solarion :

For workstations running tasks that are parallel in nature, server based CPUs (like Xeon) have been available at reasonable prices for years. It's not like CPUs beyond 4C/8T haven't been around.

The reason Intel hasn't pushed consumer CPUs beyond 4/8 C/T is because there is currently no benefit to doing so.

If Intel put out a 16 core monster, offered it for $999, you know what EVERYONE here would say? "It looses to a 7700k at gaming".

And as I've noted for, what, a decade now: Amdahl's Law holds true for gaming: The majority of the workload outside the rendering process is more or less serial and nature, and can NOT be made parallel. The last big thing that could be made parallel in games easily was the primary rendering thread, which was addressed as part of DX12/Vulkan.

It's the same argument from Bulldozer, and I'm really getting sick of offering the same exact counter argument over and over again.

-Fran- · Aug 18, 2017

Is this important to anyone?

https://twitter.com/tekwendell/status/898388056457326592

Cheers!

juanrga · Aug 18, 2017

-Fran- :

Wow! 2% higher overclock than the average 1800X.

-Fran- · Aug 18, 2017

juanrga :

That was my first thought as well. I think the "impressive" (or important?) part is the RAM OC they achieve and how that scales across the CCX'es and dies. It seems to give TR a hefty jump in MT performance.

I would imagine if JDEC gets less restrictive with ECC-RAM, then AMD might see a healthy performance increase in servers? I don't know much about how ECC stuff gets handled, but I would imagine no sane sysops would trade off higher performance for volatility, haha.

Cheers!

juanrga · Aug 18, 2017

-Fran- :

As demonstrated above, a 33% overclock on RAM translated to a whole 3% performance gain, how much do you believe a 8% higher OC RAM increases performance?

ECC RAM and OC RAM are in the opposite ends of the design requirements for a memory technology. Instead imagining JEDEC does weird stuff to save one company, wouldn't be better to imagine AMD engineers do their work and improve both the Zen microarchitecture and the memory controller?

-Fran- · Aug 18, 2017

juanrga :

I'm willing to think the gains for TR might be bigger when going wide, since IF has to cross outside the die as well, so it should display it in specific situations. I don't have proof either way, so I can't really argue more than do good guesses.

And I don't disagree with AMD doing a better job with the IMC. Last thing I remember around that topic was Intel getting the important patents for DDR-SDRAM, so AMD is kind of going to what they have. I wonder if AMD would have the guts to use HBM2+ in a CPU design someday. Anyway, I do disagree with making the uArch better (in principle, not at face value), since Zen is a pretty good design. They made the best use of IF they could and really got great improvements in all metrics no one really expected. Not having support for fancy stuff is a weighted decision (they did learn from Intel's backstabbing of FMA4) that I agree with and they saved money by making the best use of IF instead of brute-forcing a HyperTransport arrangement that would have been a very short-sighted move for a fully vertical uArch (yes, keep in mind Zen will be used from top to bottom).

Cheers!

juanrga · Aug 18, 2017

-Fran- :

Then do you suggest we ignore the empirical data, as the above 23 benchmarks, and the power throttling mechanism found by reviews, and embrace an invalid guess incompatible with all what we know?

AMD problem is not one of patents. AMD and Intel signed cross-licensing agreement where AMD can access Intel licenses and vice verse. The problem is that designing a latency-optimized memory controller or a latency-optimized microarchitecture is hard. This hardness is the reason why Bulldozer microarchitecture was optimized for throughput (not latency) and it is the same reason why Zen microarchitecture is optimized for throughput (not latency).

-Fran- · Aug 18, 2017

juanrga :

Are you sure that cross-licensing agreement covers things outside the X86 realm?

And, although I have an idea of what AMD wants to accomplish with Zen, vertically, it won't sway your view on it. But, take a closer look at the individual CCX as what Zen really is. Everything else built around it, is just a way to make it escalable.

Cheers!

juanrga · Aug 19, 2017

-Fran- :

Intel has access to AMD GPU patents for instance...

Latency/throughput microarchitecture and scalable/non-scalable SoCs are two different concepts that belong to different design points. One can design a latency-optimized core and a interconnect that scale-up well. Example? SKL-X. is a 10--20% ahead of Zen on latency, but the mesh interconnect has much better scaling than Infinity Fabric. In fact the mesh interconnect has better latency as well.

chainspell · Aug 19, 2017

jdwii :

Here you go: https://forum-en.msi.com/index.php?topic=291116.0

Solarion · Aug 25, 2017

juanrga :

-Fran- :

Then do you suggest we ignore the empirical data, as the above 23 benchmarks, and the power throttling mechanism found by reviews, and embrace an invalid guess incompatible with all what we know?

AMD problem is not one of patents. AMD and Intel signed cross-licensing agreement where AMD can access Intel licenses and vice verse. The problem is that designing a latency-optimized memory controller or a latency-optimized microarchitecture is hard. This hardness is the reason why Bulldozer microarchitecture was optimized for throughput (not latency) and it is the same reason why Zen microarchitecture is optimized for throughput (not latency).

I've asked you again and again to explain what you mean by "power throttling mechanism" on x399, but you ignore me every single time. I'm actually on a threadripper system right this second...there is no ominous "power throttling mechanism" beyond the usual maintain TDP mechanisms that both chipmakers implement. These apply only to stock settings...I'd be shocked if even you did not know this, but hey, surprise me.

In short...you're full of manure. Which helps explain why you keep posting gibberish without any links to support it...and then ignore all the follow ups. You seem to be just tossing out a bunch of word salads that don't even make a bunch of sense.

Does this look like it's throttling to you?

genz · Aug 29, 2017

juanrga :

That was not an answer to his question. Your original statement was that AMD can access Intel licenses. Vice versa too, but this statement of AMD having access to Intel licenses, and thus patents would have SIGNIFICANTLY changed the landscape of the industry over the last 10 years so has obviously already been proved untrue by time.

Think about it. Sandy Bridge and Nehalem could be countered without the re-engineering of the wheel that took so long... aka micro-patenting. aka patenting every little advance separately, forcing alternate methods at every stage from whoever tries to implement the idea second... simple ideas like Turbo Boost vs XFR. Phenoms and Bulldozer eras would have used instruction level caches like SB and Nehalem, and AMD wouldn't have spent 10 years working out why Intel was whipping them at prefetch and prediction (literally only caught up with Ryzen) and thus IPC. XFR took long to design because AMD were evading dozens of patents and the end result is often less efficient... for example XFR doesn't raise clocks of individual core counts granularly. Weaknesses like these are likely a patent motivated deficit that wouldn't exist without Turbo Boost because even Phenom 2s could clock granularly before the invention of TB: Windows might have disabled that part of the Phenom, but you could have 3 cores at base clock and 1 at 400mhz or each core at different individual speeds built into the design of the processor and accessible on Linux... sounds like they already had the logic of that part of TB covered.

Anyway, he asked if that counts outside the X86 realm (and it should be clear to both of us he is referring to the cross license agreements that make up the X86-64 'standard' today, and not proprietary/seperately licensable patents which either company may have payed for the privilege of implementing separate equivalents). It doesn't. Intel do not have blanket access to AMD patents or vice versa, and anything they do have would be strictly controlled as they are obviously competitors in almost all major markets they operate in. They do have a CLA on basic x86 and x86_64 as well as SIMD and SSE for example, but that means they can literally take the implementation. If Fudzilla was right about the rumor you state as fact, that would mean Intel literally putting modified Radeons onboard in their CPUs, not Intel making their own GCN GPU. You can see how preposterous this sounds when applied to CPUs between AMD and Intel.

Finally, to speak on latency vs throughput vs scalability specifically, they are related if you actually understand AMD's philosophy from the top down. Infinity Fabric does a few things you simply won't acknowledge. A. It gives a huge wad of lanes to all cores but connects them not that fast, then gives every single core a connection to RAM that is so fast that they can get RAM to L3 in a similar time as an inter-die connect. This is a clear intent. AMD expects the rest of the system to throw as much data at a die as any other dies will throw at it. Long term that MUST be the case because B. It establishes a protocol that is practically universally communicable throughout the computer... aka 'USB' of 'mesh interconnects' and gives that separately dedicated (from PCI-E) 256bit buses. With the last 5 years, so much other computer tech is speeding up that AMD is right on the long term plan with this. GPGPU, HBM, SSDs, NVRAM etc etc will all demand more from the buses coming into your CPU, not between them, and the sooner we get an interconnect/CPU that can shoot instruction and working data to NVRAM and power off granularly, we will see order of magnitude power consumption increases, and the interconnect will be more important than the IMC for consumption. C. It's locked to DDR clock so scales up and will continue to without needing physical upgrades. (I suspect the lack of performance increase we are seeing with 3200 is due to the fact that AMD haven't got EPYC IMC logic perfect yet and will use microcode updates from that release model to enable higher IF clocks on TB... we saw the same thing with Ryzen launch.)

Latency is pretty much dead second in real contexts because the primary host of everything bar Server is Windows and Windows is both notoriously non-Realtime and extremely throughput centered and well optimised. Hell the whole industry has agreed that prediction and caches are the future, and all they do is improve throughput. It's sprinting the marathon vs jogging. Linear processing vs thread ordering. As a result there is such potential for resource disruption throughout the pipeline that RTOS style programming is usually detrimental to Windows app performance due to system interrupt processing etc etc. To add insult to injury, it's a die space tradeoff for AMD that multiplies with core count and destroys yields on already low yield products for what exactly? Single digit performance advantages, that's what.

Despite it sounding like Windows is bad in this context, Windows throughput centred scheduling remains one of the fastest out there today and soundly beats MacOS, and all but the most spartan Linux distros. DDR ram is the same... throughput (channels*mhz) make far more difference than low latencies. PS4 has far higher RAM latencies than Xbox One due to GDDR main memory, but much bigger throughput. I could go on but I think it's enough. Sad because I used to like your posts in the old AMD threads.

juanrga · Aug 29, 2017

genz :

I have answered his question. I have confirmed that Intel has access to patents "outside the X86 realm" as part of a general cross-licensing agreement with AMD. "Cross-licensing" means that Intel can access AMD patents and AMD can access Intel patents. Intel and Nvidia signed a similar agreement.

Developing alternative designs is cheap in a new field of industry, because there is a small number of patents. In mature markets as the GPU market, the number of patents is so high that it makes almost impractical to develop alternative technologies that don't piss-off patents by someone else. That is the reason why companies prefer to sign license agreements to access patents from other companies. This is why AMD, Intel and Nvidia have signed agreements.

The reason why AMD CPU are more optimized for throughput than latency, doesn't have anything to do with "philosophy". Designing a latency-optimized muarch is very hard due the nonlinearities involved in the process, and AMD has to push in the opposite direction with Zen, due to lack of resources, as they did with Bulldozer. That is why the MC on RyZen is optimized for BW not latency; that is why Zen core is wider, with smaller IPC than Intel but higher SMT yields; that is why AMD brings moar cores (e.g. 32C vs 28C), and so on.

Stop pretending that AMD is inventing the wheel. All companies have designed or are designing fast interconnects with tons of bandwidth for either homogeneous or heterogeneous chips. I can easily recall now designs from IBM, Nvidia, or Fujitsu.

EPYC/TR is not RyZen. As mentioned a couple times TR has a power throttling mechanism, from the SP3 inheritance, that reduces core clocks under the base clocks, when memory/IF are overclocked. The lack of performance that you mention is due to the cores loosing performance due to this automatic underclocking feature.

Latency is a key concept here. As stated in the AMD HSA specification, CPUs are a special kind of LCU: Latency Compute Unit.

Latency is the reason why AMD designed a new wider core, Zen, instead reusing the narrow Jaguar cores for instance.

Latency is the reason why L2 cache in Zen is private instead shared.

Latency is the reason why AMD targets higher clocks with turbo mode. It is also the reason why EPYC with higher-clocked RAM runs better due to reduction of both memory and IF latencies.

Latency is the reason why AMD locked the IF clock to the RAM clock. Otherwise, engineers would introduced intermediate buffers between clock domains and those buffers would increase latency.

You mention "prediction and caches are the future, and all they do is improve throughput". This is wrong at two levels. First, they aren't the future, they are the present. Second they improve latency so much as throughput. Caches have much lower access times than main memory, which reduces latency when accessing data. Prediction algorithms ensure with some given probability that correct data/instructions are ready to enter the pipeline at the correct instant. When this doesn't happen, i.e. when the prediction fails, the cost is paid in the extra cycles needed to search the correct data/instructions, move the information to the pipeline and flush the incorrect data/istructions from the pipeline.

Same happens with consoles. Latency is king. Contrary to a widespread myth GDDR5 doesn't have significantly worse latencies than DDR3. Moreover, the PS4 has cores clocked low (1.6GHz), which helps to reduces the latency gap between execution and memory pool. Adding aGDDR5 memory subsystem optimized for throughput doesn't affect the same to jaguar cores @1.6GHz than to Skylake cores @4.5GHz. However, even with low core clocks, several game developers have had latency problems with the PS4. Precisely the first thing that Sony has made on the new Pro has been to increase the clocks on cores and memory, reducing latencies by up to 30%...

This all is well-known

The PS4 Pro and Xbox One X both improve this situation modestly. While no one will ever mistake Jaguar for a true high-performance CPU core, cranking up the clock speed by ~1.35x (both Sony and MS) will improve overall system performance. It should even help reduce L2 latencies, since Jaguar’s L2 is typically clocked at 1/2 clock speed.

"Cross-cluster cache latencies on the PS4 are bad enough that companies had to come up with innovative ways to keep data local while simultaneously keeping their cores fed"

-Fran- · Aug 30, 2017

Cache latency is not the same as general purpose memory latency. Do not mix the two. A BUS latency tied to the RAM, as your example of the IF, is the best simple design you can do without having to deal with different clock generators that add unnecessary complexity to your design. KISS at its finest.

That is not to say that I disagree with your overall sentiment that latency is not important; it is indeed important in any decent design to keep it in check, but you won't pursue low latency when you need to sacrifice throughput for it (HBM as a concept attacks this problem precisely; like the approach or not, it is a good middle ground).

And I still don't think the cross license agreement covers IMC patents; having access to GPU tech does not mean they have access to everything else (false equivalency?). Otherwise, AMD would not have the massive difference with Intel and I would not have in my mind that is the reason for them to be so far back in terms of IMC performance.

Cheers!

juanrga · Aug 30, 2017

-Fran- :

I am not mixing anything, when I refer to the "latency" of a design I am talking about the total latency. This total latency is given by the sum of the latencies of all the components; this includes cache latencies, DRAM latencies, execution latencies, internal buses latencies, and so on. As I have mentioned everything in Zen, from the distributed core design up to the memory controller is designed for a kind of workloads. And it is in this kind of workloads where Zen shines.

genz · Aug 30, 2017

juanrga :

Again, I direct you at the word 'blanket' and point at your evasive wording. Blanket means over then entire die, and the answer to that is no. There have been numerous lawsuits between AMD and Intel that show this. From the original MMX naming dispute to modern Quickpath tech being Intel exclusive despite likely being very easy to implement on Radeon IGPs. Modern x86 processors are mostly 'uncore' and AMD and Intel have little oversight into patents from either side in relation to uncore afaik. Sure AMD jut did the GPU deal, but that's unlikely to mean that they could just take what they want from Radeon and add it to Iris. 'Outside the x86 realm' means outside the actual major x86_64 cross licensing agreement AMD made with Intel that allowed them to implement EMT64 or vice versa with MMX etc etc. The answer to that AFAIK is no there is no all encompassing agreement.

Access means nothing: They are patents and can be accessed by anyone including me and you. Can you be more specific?

juanrga :

True, but so obviously subjectively written. Zen has higher yields because yields are done on a per-chip basis and there are 4+ in TR's MCM. 1 bad part on an i9 and it's the bin for you. 1 bad part on a TR and you're only binning 1 quarter of the chip, so binning for a 32c is about as difficult as an 8 core. To counter this a bit, Intel has had much more time and revisions with the node. Look at the RRP of TR4 and tell me that wasn't a great idea to reduce BOM, then realise that this strategy is also why 'moar cores' is better for them too. So you can turn more of them off between SKUs and use more half CCX units etc. TR's per core IPC is within 10% of i9's too.

Intel's 'very hard' 'optimised muarch' answer was only so difficult that it took 4 months for them to pull out the entire lineup from the TR announcement. You have to admit it's either a hack (rushed) or Intel were holding back on us and probably still are (they certainly had plenty of time to introduce their own massively multicore HEDTs before this).

As owner of a SkyX platform that's bad news in either direction.

juanrga :

I'm not pretending AMD is reinventing the wheel, I am inferring that AMD is listening to it's inventors faster than Intel. i9s come with faster, lower latency core 2 core interconnects and a faster core itself, but lower lane counts of PCI and little in external (homo or hetero) new pipework otherwise UPI notwithstanding. This was it's response to AMD's IF and before that they had a ring interconnect which was already a fatter, lower latency pipe that simply couldn't scale to the cores needed because having the fastest pipe is not always having the lowest latency. The laser focus on latency lost to parallelism and throughput. Hold that thought.

AMD made a universally instructible SDN mesh interconnect that replaces Q/UPI/HyperTransport as well as core to core inter That's repurposing the fat multi socket pipe going to waste in i9 (but used for IGP in i7 downward) for HBM and GPGPU via NUMA. In addition, every CCX core is directly connected to the IMC via L2 then connected to IF, and IF is connected to the RAM and bridges dozens of times as a result... outside the cores. This actually gives the actual TR chip the capability to take and output far far more data without saturation than the i9, even if core 2 core isn't as great outside each MCM package... aka only relevant on processes that have over 16 threads worth of CPU demands, yet small enough to fit in L2+L3 without fetching from RAM. TR doesn't need L3 because it can access the RAM as fast many times and without having to copy to L3 first.

In short, isolate the effect of this and it looks like this: I might lose in a 20 core CPU-only render race with you, but if we both stack up on non-HBM GPUs then run the same render I will see a greater improvement than you due to PCI-E lanes. If we both stack up on HBM GPUs I will see a massive improvement due to deduplication of VRAM data and the fact that there is now a NUMA bus that can address that VRAM at super low latency compared to previous/current tech.

juanrga :

Latency is not IPC. Jaguar was technologically EOL long ago and was replaced because it was not competitive; there were dozens of reasons. You talk as if wider cores are not a throughput-prioritising decision: they are. In terms of die space, reducing the physical size of a wide pipe to fit it's predecessor (or making space for it elsewhere) and making a narrow pipe run faster are both very time consuming, as is parallelising with the intent to multiply throughput without regard to latency. I said before, latency improves throughput but not always the reverse so this is a non argument

juanrga :

L2 is private in Zen because it's directly connected to the IMC. Fat L2 has always been the philosophy behind AMD's core cache distribution, whereas Intel liked faster, larger L3 than AMD and weaker L2. Intel switched to AMD's point of view just now because it makes sense when your ring is now a grid with an exponential amount of connections to bits of your SoC and your L3 needs access to all of them.

juanrga :

Failing to acknowledge that this is only because AMD has a much much higher throughput IF solution consisting of multiple IMCs. Latency took a backseat compared to Intel's single IMC decision out of the core. Again, and I stress that I said this before, latency is important but not all-important. Certain bits are limited by other means. More specifically, having one part that is granularly of lower latency than another does not necessarily make for a lower latency total system, but having a high throughput part will ensure that it is *possible* to reach low latencies by virtue of optimal conditions. Often the latter will lead you to much lower latencies than the former, else we would not have multicore computers in the first place. The whole framework for threading and out-of order execution etc etc is something that would add significant internal latency to any computer compared to a single in-order arch design, but nothing compared to the advantages presented by the extra cores.

juanrga :

If you mean they are the present within present releases, sure they are, but that doesn't mean that the major performance benefits of parts in the next 5 years won't be down to improvements within prediction and caches.

juanrga :

Point number one is answered by point number two. PS4 GDDR CAS timing is in the high double to low triple digits, and I do not know the 'popular belief' stories but I know my and probably your GPU has much faster GDDR at much worse clock latencies. I am aware of GDDR's intrinsic strengths and that $ for $ you will find lower timings as a result of duplex design, but even with a 30% increase they lag severely behind similar clocked DDR3/4. I am also aware that the Xbox has DDR3 ram and a significant throughput deficit, yet a latency advantage further exacerbated by super low latency ESram... and has uglier games as a result and far more performance complaints. The example I gave was commentary on the example I gave, not the GDDR technology, but if you want commentary there, I will say that the large batch memory transfer optimisation is a core reason why it works better in GPUs but when L2+3 caches are big enough I'm sure it will be the same story for CPUs. Both AMD and Intel are moving toward parallelism and throughput over latency below L2 and I believe AMD is in the lead there. Everything below L2 is just a cache.

BIOS patches and microcode will remove the limitations on clocking IF I expect. Let's remember that TR shares more with EPYC than Ryzen and if Ryzen is anything to look at AMD is actually working on optimising IF on a daily basis... and still were late to market with 3200RAM support because of IF.

Amd Ryzen Threadripper & X399 MegaThread! FAQ & Resources

Distinguished

Prominent

Splendid

Distinguished

Prominent

Distinguished

Prominent

Distinguished

Prominent

Glorious

Glorious

Distinguished

Glorious

Distinguished

Glorious

Distinguished

Glorious

Distinguished

Distinguished

Prominent

Distinguished

Distinguished

Glorious

Distinguished

Distinguished

Share this page