AMD's Future Chips & SoC's: News, Info & Rumours.

Page 32 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
It does sound interesting AMD making a 6-core CCX , but I wonder how they'll do L3 with such arrangement. Currently, 4-core CCX has L3 right in the middle, lowering the latency to it with a simple design, so having 6-core CCX would negate some of that simple approach. Would they make L2 bigger and put the L3 elsewhere?

It's an interesting prediction, none the less. I'm just wondering how the simplistic design barriers would be dodged.

Cheers!
 


That would make sense, yes. I would imagine I.F. does not impose any sort of design limitation on how to "orientate" the cores and their intra-core communication paths.

Still, I'd love to see a "flower" arrangement, rofl. A 2x3 block arrangement looks boring, haha.

Cheers!
 


...because they can. and it is amd's big market differentiation thing.
the drag of non-multithreaded coding is noted.

it would be good if they could focus on ipc and clocks. but their scalable core thing will be there.8 cores can be 16 cores.

but yeah. if they can get 8 mega cores at 7nm with amazing ipc and clocks. the bulk of the market will be much more interested in that in mainstream pc. --- sure as * INTEL will be doing this with their ice lake 8 core. amd needs to compete.

the TR core is the HEDT and server thing....starship etc.
 
There is no "TR core". There is only Zen core, which is used in Zeppelin die, which is used in RyZen, ThreadRipper, and EPYC. This same Zen core will be also used in the RR die for the future APUs for mobile and desktop.

If moar cores was the only important thing, then AMD hadn't not wasted time and money on developing Zen. AMD would simply port excavator or Jaguar to 14nm and ship, for instance 32-core Jaguar CPU for desktop and mobile. Instead AMD has developed Zen focusing on increasing both IPC and single thread performance.

16-core is useless for mainstream users. In fact it is useless for most enthusiast users, as reviews have shown very little desktop software can scale above 10-cores.

Even 8-core is too much for mainstream users and AMD has developed a 4-core die for mainstream users. This die will be used in Raven Ridge chips.

The transition from 14LPP to 7LP will bring about twice the density. AMD would use all this extra space to improve the IPC for Zen2 so what everyone will benefit from it, but engineers cannot do that. Instead they will use most of that extra space to add moar cores to the die. The main goal is server, which can use many cores. As mentioned above I expect 6-core CCX. Thus the 32-core Naples will be replaced by 48-core Starships. The transition from 4-core APU to 6-core APU will be very interesting for mainstream users, but the 12-core in the desktop will be useless for most users.
 


of course they can do that. more transistors at 7nm, per core performance can easily be increased. and yes i'm aware of the scalability of zen cores. and yes by your own words higher core count is useless for mainstream desktop.
i think you're assuming that AMD will not compete in the desktop market at all, assuming zero per core performance increase and just touting the INTEL fanbois mantra of AMD's 'moar cores'

if its a 6 core ccx, they can still have whatever core count they want, as now, by disabling cores. 6, 8, 12, 16, 18...whatever.
 
If AMD uses 2 6 core CCX complexes they could make a 12 core. Which they could possibly distribute 12,10,8,6, and 4 cores in basically the same way they do now. Core counts are moving up for both Intel and AMD, because the costs involved in the shrinking process this is an easy way to increase overall performance in many different metrics while claiming 50% more performance without increasing IPC like Intel just did with 8th gen. Now, we would assume AMD will continue to work on a few things:

1. Infinity fabric speed and stability
2. Reducing internal latency
3. Increase single core IPC
4. Reducing power consumption

I think it's worth noting the performance of the i5 8400 shows us high clock speeds are not needed to achieve high FPS.
 
well acording to my opinion, increasing the core count is not a viable option for AMD. Right now they really focus on decreasing the latency and increasing both IPC and frequency. With this they will have more effective cores which will help them increase the efficiency of single core computing and also multi core scores where they are already well ahead of Intel.
Increasing the core count, on the other hand, will make them better on an area where they have the lead.An this area has low market share. If AMD can increase their IPC, they can sell more product in the "casual user" market.
 
Interesting read.

Even though TR is behind Skylake-X, you would think (according to reviews) it was going to be more. AMD is behind in IPC and process, so for them to actually consume less power at the same clocks while maintaining a decent 2nd place is no small feat in my book. If they really close the process gap, they'll just need to work on their IPC. Intel must be really throwing money to their manufacturing fabs now.

Plus, this quote: "However, the kicker is, you would also be paying 78% more for the 7960X, meaning you would be getting 30% more performance for 80% more cost". Caveat being, if you have a power plant as a PSU, you can still OC the 7960X higher than TR and get more performance out of it.

Cheers!
 
Let me clarify my point, because it has been misunderstood.

I expect AMD to develop Zen2 around a 6-core CCX module. So I expect AMD to release a 12-core die for CPU and 6-core die for APUs; two CCX in the first die and one CCX in the second die. I expect Zen2 ThreadRipper CPUs to be up to 24-core (two dies of 12-core each). I expect Starship CPUs to be up to 48-core (four dies of 12-core each).

Different models will be obtainewd by disabling cores. For instance I expect the cheaper Zen2 CPU to be a 6-core CPU (3+3).

I expect Zen2 to bring about 7% higher IPC than Zen2. What I said is that AMD engineers cannot use all the extra space given by the 7LP node to increase the IPC of each core. They cannot by virtue of the IPC wall. That is the reason why AMD engineers will use only a small amount of that space to increase the IPC of Zen2 cores, whereas most of the extra space will be used to add more cores to the die. As stated above I expect 4-core CCX to be replaced by 6-core CCX. I.e. I am expecting 50% moar cores.

What I said is that whereas 6-core Zen2 APU will be relevant for 99% of users, those future 12-core CPUs will be useless for most users, because most desktop code doesn't scale to many cores. One can simply compare reviews of 12-core ThreadRipper vs 8-core RyZen and see that overall ThreadRipper isn't 50% faster.

AMD will be not doing 12-core Zen2 dies because it benefits most desktop users. AMD will be doing 12-core Zen2 dies, because of (i) IPC wall and (ii) those dies will be used in servers, where workloads scale to many cores.
 


The IPC gap is higher than 30%. Since the relation between power and IPC is nonlinear, the Zen cores would consuming much much much less power than they are consuming. The difference in process node is playing a fundamental role, but I don't think it can explain fully the difference in efficiency between both chips. That is, I think that Zen core in the same process node than Intel core would be still less efficient.

Also performance is not a linear function of costs. 30% more IPC costs around 70% more in design complexity.
 


In integer intensive tasks, TR is actually better than Skylake-X (according to the link). So it's only for FPU type of workloads that Intel has a clear cut advantage with a huge gap. Hence why in most "mixed load" applications the gap is not so pronounced and TR comes as close as it does in both performance and perf/watt.

Don't make it look like it's clear cut, because this is a deep rabbit hole.

Cheers!

EDIT: This went under the radar for me at least, but it is interesting none-the-less. Maybe this is another of the reasons Intel is trying to push 10nm parts now and see how it goes: https://www.anandtech.com/show/11946/samsungs-8lpp-process-technology-qualified-ready-for-production
 


The worst case is a tie, 128 vs 128, on AIDA integer test. SPEcint scores (not shown in this review), Dhrystone scores (not shown in this review) and AVX2/AVX512 integer workloads (not shown in this review) demonstrate Skylake-X is also much faster in integer code. For instance in Dhrystone the 16C SKylake does about 662 whereas the 16C ThreadRipper does less than 500.


Moreover Intel wins in memory latency tests.
 
Let's not over hype the comparison. This review is an outlier favoring Intel compared to many other reviews. Gamers Nexus who test 7960X CPU against the 1950X show huge power consumption differences at the EPS( upwards 50% more overclock) and used delidded 7960X.
Let's look at the test set up first:
Cooler: Enermax TR Liquitec 360 and Corsair H115i AIO Water Cooler
The 7960X used the 360mm cooling solution!
The Intel 7960X is clocked up to 4.0GHz using multiplier only overclocking. Memory speed is increased to 3200MHz per XMP, using the same exact kit that was used for the Threadripper testing. The Threadripper system also had its infinity fabric increased in speed by 50%, so I went ahead and overclocked the mesh speed from 2.4GHz to 3.2GHz, which is a 33% increase. An input voltage of 2.1v was used, and LLC was turned on. A fan was blowing right over the VRMs, and the radiator fans were always blowing full speed.
He states Infinity fabric was overclocked, this is by product of the uArch(which was deemed by most as a weakness and it now viewed as an unfair advantage) by virtue of simply putting in a stick of RAM and having it work at it's rated speed. Bios settings had to be manipulated to overclock the mesh! Intel's VRMs had to have a fan blowing right over the VRMs, and the radiator fans blowing at full speed just too get this thing to run at stock speeds without thermal throttling. This has been demonstrated on multiple reviews!

https://www.tweaktown.com/articles/8379/amd-threadripper-vs-intel-core-i9-cpus-clock/index2.html

Now let's look at the benchmarks, and what all went into those overall numbers even after seeing the test setup heavily favors Intel with solutions that most users will not see out of the box!
8379_04_amd-threadripper-vs-intel-core-i9-clock.png

@4GHz Single thread performance difference is 12 points or 7% in favor of Intel, and multithread of 75 points or 2% in favor of Intel. Let's not forget the quote that comes later:
The price gap remains, so the Intel costs roughly 80% more.
Looking at cooling involved a 360mm radiator and a fan blowing on the VRMS, and relative small performance gains not worth a cost of 80%. But that's just the first test! Let's look at the second test!
8379_05_amd-threadripper-vs-intel-core-i9-clock.png

@4GHz there is a 1.5 point or 3% gain over the 1950X
8379_06_amd-threadripper-vs-intel-core-i9-clock.png

Here we are with one of the test that are factored into overall performance that was used to come up with the 24% productivity and 31% overall! Really?
8379_07_amd-threadripper-vs-intel-core-i9-clock.png

@4Ghz the 1950X is 8% faster at copy and 13% faster at write while being 15% slower at read. Still not seeing that 24% better in productivity!
We see that clock for clock AMD's Threadripper equals Intel's Skylake-X in 64-bit integer IOPS, but in SP FLOPs it is half as fast, but we saw the same type of activity at stock. Memory bandwidth is an interesting thing, AMD's reads and writes are faster, but Intel's copy is faster. Intel's memory latency is superior, but that is because Intel doesn't have an Infinity Fabric
8379_09_amd-threadripper-vs-intel-core-i9-clock.png

@4GHz AMD is 3% faster than Intel.
8379_10_amd-threadripper-vs-intel-core-i9-clock.png

And here we go at 720p transcoding we see Intel gain 18% over AMD. They throw in a benchmark that greatly favors Intel to help Intel get that 24% better productivity. I wonder how many people buying $1,000 plus CPU's will be doing 720p transcoding? They don't even mention 720p in the commentary.
AMD had the lead in Handbrake 4K encoding at stock and maintains it in our clock for clock tests, but the margin is smaller. Intel had the lead in Handbrake transcoding, and the margin is still maintained when they are equalized.
8379_11_amd-threadripper-vs-intel-core-i9-clock.png

@4GHz we see Intel take the lead in overall performance by 17%, still not seeing that 24% productivity performance increase here.
8379_12_amd-threadripper-vs-intel-core-i9-clock.png

@4GHz Intel takes away a good win here performing 25% better. And that concludes all the "productivity performance" benchmarks. So, the majority of that 24% increase over AMD comes from synthetic benchmarks affected by memory latency.
In ScienceMark Intel's offering is faster. In SuperPU we see Intel's offering is significantly faster, and that isn't just because SuperPI is a single core benchmark, it also greatly relies on memory latency (where Intel shines).
And I still don't see where they get that 24% productivity performance increase! They must really give a lot of weight to Aida64 FPU Test for single-precision FLOP's where it out performed AMD by 50%!
Read more: https://www.tweaktown.com/articles/8379/amd-threadripper-vs-intel-core-i9-cpus-clock/index4.html
Now let's take a look at the same type of difference at the 4GHz clock for clock. The price gap remains, so the Intel costs roughly 80% more. We see Intel offer roughly 31% overall, 36% gaming, and 24% productivity performance increases over AMD. Compared to stock, Intel's margins increase 2% overall in overall performance and 8% in gaming performance. However, Intel's margin in productivity decreases from 28% to 24%, meaning Intel's margin over AMD at 16 cores has decreased 4% in productivity applications. However, while Intel's power consumption was 12% higher at stock, it's now 16% higher overclocked (includes idle and load). Putting the Intel 7960X and AMD 1950X head to head, clock for clock, reveals three major trends. Intel's performance gains over AMD's decrease in productivity applications, but increase by a larger margin in gaming applications, resulting in a slight increase in overall performance. At the same time, Intel's power consumption increases 4% while overall performance margins only increase 2%-3%. There are other factors to take into account, and this article isn't about telling you which CPU to buy, rather it's looking at how AMD's microarchitecture is doing against Intel's in the high-end desktop segment, so far, quite good.

This is another heavily Intel bias review using better cooling, bios manipulation, and using cherry picked benchmarks to favor Intel. The review remains an outlier.
 


Care to share links for them?
 


Since CPUs are instances of Latency Compute Units (as AMD mentions in the HSA specification), the more relevant benchmarks for a CPU are latency-sensitive benches. Throughput-sensitive benches such as rendering/encoding are better run on GPUs, FPGAs, vector units, and others Throughput Compute Units.



I am pretty sure they are giving the same weight to all the benches.



https://www.realworldtech.com/forum/?threadid=169894&curpostid=170012

https://www.overclockersclub.com/reviews/intel_core_i9_7980xe__core_i9_7960x/6.htm

http://www.sisoftware.eu/2017/06/23/intel-core-i9-skl-x-review-and-benchmarks-cpu-avx512-is-here/
 


First link: https://www.realworldtech.com/forum/?threadid=169894&curpostid=170038

That is the *old* aftermath of the problem found after AnandTech's showed some erratic behaviour with GCC and AVX512 vs AVX2 stuff. There was a problem with the compilers at the end or not? But even with that, it's hardly evidence of what you're saying, and as mr. Torvalds. says, the numbers touted by mr. Kanter are from Intel itself. Salt in big quantities are needed, specially when their own scores are all over the place (quoted in the same thread).

Second link: https://www.overclockersclub.com/reviews/intel_core_i9_7980xe__core_i9_7960x/3.htm

Testing Setup: Intel Socket 2066 18 & 16 Core

Processors: Intel Core i9 7980XE, Intel Core i9 7960X
CPU Cooling: Liquid cooling = EK Block and 360mm Radiator, D5 pump
Motherboard: MSI X299 Xpower AC
Memory: G.Skill Ripjaws V 3600MHz 32GB
Video Card: NVIDIA GTX 1080 8GB Founders Edition
Power Supply: Corsair RM1000x
Hard Drive: Corsair Force GT 240GB SATA 3
Optical Drive: Lite-On Blu-ray
Case: Corsair 780T
OS: Windows 10 Professional 64-bit

Testing Setup: AMD AM4 Ryzen 7

Processors: AMD Ryzen R7 1800X, R7 1700X, R7 1700
CPU Cooling: Corsair H110i
Motherboard: Gigabyte AX370-Gaming 5 Aorus
Memory: Corsair Vengeance 3000MHz 16GB
Video Card: NVIDIA GTX 1080 8GB Founders Edition
Power Supply: Corsair RM1000x
Hard Drive: Corsair Force GT 240GB SATA 3
Optical Drive: Lite-On Blu-ray
Case: Corsair 780T
OS: Windows 10 Professional 64-bit

They don't even test TR, so that can't be used as proof either, since it's missing TR. Unless you have a link with TR in it? Also, notice the different RAM and cooling used?

Third link has an interesting quote: "For heavy vectorised SIMD code – as long as it’s updated to AVX512 – there is no other choice."

It's interesting to see that a SIMD unit can be decomposed to emulate integer operations and actually be fast, since the old ALUs should be purpose specific. Which, going to that quote, only when you have specific operations that require 64/128bit, vectorized integer operations, AVX512 is the place to be.

I'll be also providing this Venn diagram, since I found it nice to understand where Intel is standing with the "AVX512" marketing keyword:

https://twitter.com/InstLatX64/status/918796987352408064

Cheers!
 


What has to do compiler regression with hardware performance? If a new version of a compiler has some problem and reduces the performance on the same hardware, then one avoids that specific compiler version or the flags generating the problem until it is solved.

The scores given by Kanter are official SPEC submission rather than in house non-validated benches. The scores for the Xeon system are from Huawei, not from Intel. The scores for EPYC are from AMD itself. And they show that Skylake core is much faster than Zen core for integer stuff.

The overclock3d link has Dhrystone benches for the i9-7960X and the i9-7900X. The Sisoftware link has Dhrystone benches for i9-7900X and 1950X. As mentioned in a former post:

For instance in Dhrystone the 16C SKylake does about 662 whereas the 16C ThreadRipper does less than 500.

So this is a second bench where Skylake-X core is much faster in integer code than Zen core.

SIMD units in Intel cores don't "emulate integer operations". There are vector ALUs that execute integer instructions. This is an old diagram for Haswell, relating ports and execution units

Microarchitecture_Haswell_IDF.png


AVX512 is something more than a "marketing keyword". AVX512 is a standard ISA in HPC and it starts to become a standard in servers as well

https://www.hpcwire.com/2017/06/29/reinders-avx-512-may-hidden-gem-intel-xeon-scalable-processors/

https://cloudplatform.googleblog.com/2017/02/Google-Cloud-Platform-is-the-first-cloud-provider-to-offer-Intel-Skylake.html
 


...and you know this how? a crystal ball? how can you possibly know about an IPC wall?

A lot of things that give you more power just require more transistors to build them. Wider buses scale the transistor count up in almost all processor components. High speed caches add transistors according to cache size. If you lengthen a pipeline you need to add stages and more complex control units. If you add execution units to help mitigate a bottleneck in the pipeline, each of those requires more transistors, and then the controls to keep the execution units allocated adds still more transistors.

The thing is, in an electronic circuit, everything happens in parallel. In the software world, the default is for things to be sequential, and software designers go to great pains to get parallelism built into the software so that it can take advantage of the parallel nature of hardware. Parallelism just means more stuff happening at the same time, so roughly equates to speed; the more things that can be done in parallel, the faster you can get things done. The only real parallelism is what you get when you have more transistors on the job.

First instructions are not necessarily "executed sequentially" even on a non-VLIW ISA, execution only needs to appear sequential. An in-order superscalar implementation can execute more than one instruction in parallel with another. To do this effectively the hardware for decoding instructions must be increased (widened), hardware must be added to ensure data independence of instructions to be executed in parallel, the execution resources must be increased, and the number of register file ports is generally increased. All of these add transistors.

An out-of-order implementation, which allows later instructions to be executed before earlier ones as long as there are no data dependencies, uses additional hardware to handle scheduling of instructions as soon as data becomes available and adds rename registers and hardware for mapping, allocating, and freeing them (more transistors) to avoid write-after-read and write-after-write hazards. Out-of-order execution allows the processor to avoid stalling.

The reordering of loads and stores in an out-of-order processor requires ensuring that stores earlier in program order will forward results to later loads of the same address. This implies address comparison logic as well as storage for the addresses (and size) of stores (and storage for the data) until the store has been committed to memory (the cache). (For an ISA with a less weak memory consistency model, it is also necessary to check that loads are ordered properly with respect to stores from other processors--more transistors.)

Pipelining adds some additional control and buffering overhead and prevents the reuse of logic for different parts of instruction handling, but allows the different parts of handling an instruction to overlap in time for different instructions.

Pipelining and superscalar execution increase the impact of control hazards (i.e., conditional branches and jumps). Pipelining (and also out-of-order execution) can delay the availability of the target of even unconditional jumps, so adding hardware to predict targets (and direction for conditional branches) allows fetching of instructions to continue without waiting for the execution portion of the processor to make the necessary data available. More accurate predictors tend to require more transistors.

For an out-of-order processor, it can be desirable to allow a load from memory to execute before the addresses of all preceding stores have been computed, so some hardware to handle such speculation is required, potentially including a predictor.

Caches can reduce the latency and increase the bandwidth of memory accesses, but add transistors to store the data and to store tags (and compare tags with the requested address). Additional hardware is also needed to implement the replacement policy. Hardware prefetching will add more transistors.

Implementing functionality in hardware rather than software can increase performance (while requiring more transistors). E.g., TLB management, complex operations like multiplication or floating point operations, specialized operations like count leading zeros. (Adding instructions also increase the complexity of instruction decode and typically the complexity of execution as well--e.g., to control which parts of the execution hardware will be used.)

SIMD/vector operations increase the amount of work performed per instruction but require more data storage (wider registers) and typically use more execution resources.

(Speculative multithreading could also allow multiple processors to execute a single threaded program faster. Obviously adding processors to a chip will increase the transistor count.)

Having more transistors available can also allow computer architects to provide an ISA with more registers visible to software, potentially reducing the frequency of memory accesses which tend to be slower than register accesses and involve some degree of indirection (e.g., adding an offset to the stack pointer) which increases latency.

Integration--which increases the number of transistors on a chip but not in the system--reduces communication latency and increases bandwidth, obviously allowing an increase in performance. (There is also a reduction in power consumption which may be translated into increased performance.)

Even at the level of instruction execution, adding transistors can increase performance. E.g., a carry select adder adds upper bits twice in parallel with different assumptions of the carry-in from the lower bits, selecting the correct sum of upper bits when the carry out from the lower bits is available, obviously requiring more transistors than a simple ripple carry adder but reducing the delay in producing the full sum. Similarly a multiplier with a single row of carry-save adders uses fewer transistors (but is slower) than a Dadda (or Wallace) tree multiplier and cannot be pipelined (so would have to be replicated to allow another multiply to begin execution while an earlier multiply was in progress).

Microprocessors have advanced significantly in recent years, things like longer pipelines, predicative branching and on chip cache have all added to the complexities associated with a processor.

Sure the basics of CPU processing, fetch, decode, ALU, write is still the same, but to speed things up, longer pipelines are used. Longer pipelnes increase performance for continous code executiion, but also incur bigger hit times when the code branches damage performance. Remedy, predictive branching. Predictive branching is a trade secret, that intel do not normally disclose the full workings of, just simply use it to keep the performance as high as possible on their CPUs.

Cache memory is much faster than RAM, but what to move from RAM into cache and from cache back to RAM??? That is again, proprietary stuff, but it again takes transistors to implement.

So the extra transistors go into things like the longer pipeline, predictive branch algorithms, cache memory, and memory algorithms.

This is without mentioning multi core processors, and shared memory/resource access controllers.
------
steady increase in single threaded performance has been achieved consistently year on year.

matching specific code to the cpu architecture is a factor.
 


The existence of a IPC wall, sometimes named the "ILP wall" was identified by researchers in the late 80s. It is the reason why HP and Intel tried to replace the nonscalable x86 ISA with a new scalable ISA based in a VLIW paradigm. It was known then that x86 was a dead end: the IPC couldn't increase forever. Elementary computer science books discuss this wall.

The HP/Intel attempt was a fiasco because their approach required a too smart compiler that didn't exist. There are some attempts to try that route again. For instance the Mill is an experimental CPU based in an evolution of the VLIW concept.

http://jakob.engbloms.se/archives/2004


EDIT since you edited your post:

The thing is, in an electronic circuit, everything happens in parallel. In the software world, the default is for things to be sequential, and software designers go to great pains to get parallelism built into the software so that it can take advantage of the parallel nature of hardware. Parallelism just means more stuff happening at the same time, so roughly equates to speed; the more things that can be done in parallel, the faster you can get things done. The only real parallelism is what you get when you have more transistors on the job.

Lots of stuff in a circuit happens in a sequential fashion. The problem is not on the software, the problem is on that not everything can be parallelized. There are problems that are sequential. If your problem cannot be parallelized then the software has to be sequential.

But this is unrelated to my former point about the ILP wall. Multicores precisely born due to this wall. Since it was impossible to use all the transistors provided by newest process nodes to increase the performance of a single core, those extra transistors were used to replicate the cores. That is like the first dual core was born and then quad-core and then six-core and octo-core, until the modern CPUs with 32 cores or more.

steady increase in single threaded performance has been achieved consistently year on year.

If you only pay attention to a small temporal window then the increase in serial performance looks linear or close to linear. When one increases the temporal window one can check the serial performance is approaching the wall.

CPU-Scaling.jpg
 
ive seen that graph. but it does not equate to real single threaded performance improvement year on year.

intel has a longer pipeline...
 


It shows the well-known "ILP wall" and "frequency wall", and since single thread performance is the product of both, single thread performance is approaching a wall.
 
Status
Not open for further replies.