juanrga :
I have answered his question. I have confirmed that Intel has access to patents "outside the X86 realm" as part of a general cross-licensing agreement with AMD. "Cross-licensing" means that Intel can access AMD patents and AMD can access Intel patents. Intel and Nvidia signed a similar agreement.
Again, I direct you at the word 'blanket' and point at your evasive wording. Blanket means over then entire die, and the answer to that is no. There have been numerous lawsuits between AMD and Intel that show this. From the original MMX naming dispute to modern Quickpath tech being Intel exclusive despite likely being very easy to implement on Radeon IGPs. Modern x86 processors are mostly 'uncore' and AMD and Intel have little oversight into patents from either side in relation to uncore afaik. Sure AMD jut did the GPU deal, but that's unlikely to mean that they could just take what they want from Radeon and add it to Iris. 'Outside the x86 realm' means outside the actual major x86_64 cross licensing agreement AMD made with Intel that allowed them to implement EMT64 or vice versa with MMX etc etc. The answer to that AFAIK is no there is no all encompassing agreement.
Access means nothing: They are patents and can be accessed by anyone including me and you. Can you be more specific?
juanrga :
The reason why AMD CPU are more optimized for throughput than latency, doesn't have anything to do with "philosophy". Designing a latency-optimized muarch is very hard due the nonlinearities involved in the process, and AMD has to push in the opposite direction with Zen, due to lack of resources, as they did with Bulldozer. That is why the MC on RyZen is optimized for BW not latency; that is why Zen core is wider, with smaller IPC than Intel but higher SMT yields; that is why AMD brings moar cores (e.g. 32C vs 28C), and so on.
True, but so obviously subjectively written. Zen has higher yields because yields are done on a per-chip basis and there are 4+ in TR's MCM. 1 bad part on an i9 and it's the bin for you. 1 bad part on a TR and you're only binning 1 quarter of the chip, so binning for a 32c is about as difficult as an 8 core. To counter this a bit, Intel has had much more time and revisions with the node. Look at the RRP of TR4 and tell me that wasn't a great idea to reduce BOM, then realise that this strategy is also why 'moar cores' is better for them too. So you can turn more of them off between SKUs and use more half CCX units etc. TR's per core IPC is within 10% of i9's too.
Intel's 'very hard' 'optimised muarch' answer was only so difficult that it took 4 months for them to pull out the entire lineup from the TR announcement. You have to admit it's either a hack (rushed) or Intel were holding back on us and probably still are (they certainly had plenty of time to introduce their own massively multicore HEDTs before this).
As owner of a SkyX platform that's bad news in either direction.
juanrga :
Stop pretending that AMD is inventing the wheel. All companies have designed or are designing fast interconnects with tons of bandwidth for either homogeneous or heterogeneous chips. I can easily recall now designs from IBM, Nvidia, or Fujitsu.
EPYC/TR is not RyZen. As mentioned a couple times TR has a power throttling mechanism, from the SP3 inheritance, that reduces core clocks under the base clocks, when memory/IF are overclocked. The lack of performance that you mention is due to the cores loosing performance due to this automatic underclocking feature.
I'm not pretending AMD is reinventing the wheel, I am inferring that AMD is listening to it's inventors faster than Intel. i9s come with faster, lower latency core 2 core interconnects and a faster core itself, but lower lane counts of PCI and little in external (homo or hetero) new pipework otherwise UPI notwithstanding. This was it's response to AMD's IF and before that they had a ring interconnect which was already a fatter, lower latency pipe that simply couldn't scale to the cores needed because having the fastest pipe is not always having the lowest latency. The laser focus on latency lost to parallelism and throughput. Hold that thought.
AMD made a universally instructible SDN mesh interconnect that replaces Q/UPI/HyperTransport as well as core to core inter That's repurposing the fat multi socket pipe going to waste in i9 (but used for IGP in i7 downward) for HBM and GPGPU via NUMA. In addition, every CCX core is directly connected to the IMC via L2 then connected to IF, and IF is connected to the RAM and bridges dozens of times as a result... outside the cores. This actually gives the actual TR chip the capability to take and output far far more data without saturation than the i9, even if core 2 core isn't as great outside each MCM package... aka only relevant on processes that have over 16 threads worth of CPU demands, yet small enough to fit in L2+L3 without fetching from RAM. TR doesn't need L3 because it can access the RAM as fast many times and without having to copy to L3 first.
In short, isolate the effect of this and it looks like this: I might lose in a 20 core CPU-only render race with you, but if we both stack up on non-HBM GPUs then run the same render I will see a greater improvement than you due to PCI-E lanes. If we both stack up on HBM GPUs I will see a massive improvement due to deduplication of VRAM data and the fact that there is now a NUMA bus that can address that VRAM at super low latency compared to previous/current tech.
juanrga :
Latency is a key concept here. As stated in the AMD HSA specification, CPUs are a special kind of LCU: Latency Compute Unit.
Latency is the reason why AMD designed a new wider core, Zen, instead [of, I assume] reusing the narrow Jaguar cores for instance.
Latency is not IPC. Jaguar was technologically EOL long ago and was replaced because it was not competitive; there were dozens of reasons. You talk as if wider cores are not a throughput-prioritising decision: they are. In terms of die space, reducing the physical size of a wide pipe to fit it's predecessor (or making space for it elsewhere) and making a narrow pipe run faster are both very time consuming, as is parallelising with the intent to multiply throughput without regard to latency. I said before, latency improves throughput but not always the reverse so this is a non argument
juanrga :
Latency is the reason why L2 cache in Zen is private instead shared.
L2 is private in Zen because it's directly connected to the IMC. Fat L2 has always been the philosophy behind AMD's core cache distribution, whereas Intel liked faster, larger L3 than AMD and weaker L2. Intel switched to AMD's point of view just now because it makes sense when your ring is now a grid with an exponential amount of connections to bits of your SoC and your L3 needs access to all of them.
juanrga :
Latency is the reason why AMD targets higher clocks with turbo mode. It is also the reason why EPYC with higher-clocked RAM runs better due to reduction of both memory and IF latencies.
Latency is the reason why AMD locked the IF clock to the RAM clock. Otherwise, engineers would introduced intermediate buffers between clock domains and those buffers would increase latency.
Failing to acknowledge that this is only because AMD has a much much higher throughput IF solution consisting of multiple IMCs. Latency took a backseat compared to Intel's single IMC decision out of the core. Again, and I stress that I said this before, latency is important but not all-important. Certain bits are limited by other means. More specifically, having one part that is granularly of lower latency than another does not necessarily make for a lower latency total system, but having a high throughput part will ensure that it is *possible* to reach low latencies by virtue of optimal conditions. Often the latter will lead you to much lower latencies than the former, else we would not have multicore computers in the first place. The whole framework for threading and out-of order execution etc etc is something that would add significant internal latency to any computer compared to a single in-order arch design, but nothing compared to the advantages presented by the extra cores.
juanrga :
You mention "prediction and caches are the future, and all they do is improve throughput". This is wrong at two levels. First, they aren't the future, they are the present. Second they improve latency so much as throughput. Caches have much lower access times than main memory, which reduces latency when accessing data. Prediction algorithms ensure with some given probability that correct data/instructions are ready to enter the pipeline at the correct instant. When this doesn't happen, i.e. when the prediction fails, the cost is paid in the extra cycles needed to search the correct data/instructions, move the information to the pipeline and flush the incorrect data/istructions from the pipeline.
If you mean they are the present within present releases, sure they are, but that doesn't mean that the major performance benefits of parts in the next 5 years won't be down to improvements within prediction and caches.
juanrga :
Same happens with consoles. Latency is king. Contrary to a widespread myth GDDR5 doesn't have significantly worse latencies than DDR3. Moreover, the PS4 has cores clocked low (1.6GHz), which helps to reduces the latency gap between execution and memory pool.
Adding aGDDR5 memory subsystem optimized for throughput doesn't affect the same to jaguar cores @1.6GHz than to Skylake cores @4.5GHz. However, even with low core clocks, several game developers have had latency problems with the PS4. Precisely the first thing that Sony has made on the new Pro has been to increase the clocks on cores and memory, reducing latencies by up to 30%...
This all is well-known
Point number one is answered by point number two. PS4 GDDR CAS timing is in the high double to low triple digits, and I do not know the 'popular belief' stories but I know my and probably your GPU has much faster GDDR at much worse clock latencies. I am aware of GDDR's intrinsic strengths and that $ for $ you will find lower timings as a result of duplex design, but even with a 30% increase they lag severely behind similar clocked DDR3/4. I am also aware that the Xbox has DDR3 ram and a significant throughput deficit, yet a latency advantage further exacerbated by super low latency ESram... and has uglier games as a result and far more performance complaints. The example I gave was commentary on the example I gave, not the GDDR technology, but if you want commentary there, I will say that the large batch memory transfer optimisation is a core reason why it works better in GPUs but when L2+3 caches are big enough I'm sure it will be the same story for CPUs. Both AMD and Intel are moving toward parallelism and throughput over latency below L2 and I believe AMD is in the lead there. Everything below L2 is just a cache.
BIOS patches and microcode will remove the limitations on clocking IF I expect. Let's remember that TR shares more with EPYC than Ryzen and if Ryzen is anything to look at AMD is actually working on optimising IF on a daily basis... and still were late to market with 3200RAM support because of IF.