AMD's Future Chips & SoC's: News, Info & Rumours.

gamerk316 · May 16, 2019

Yuka said:
https://cpu.userbenchmark.com/Compa...2-3734-N-vs-Intel-Core-i7-8700K/m697865vs3937

I'm going to compare based on the best case, and using the average clock speed recorded for that bench (3.35 for AMD, 5.3 for Intel). So consider these best case numbers.

Single Core: Perf = Clock * IPC

AMD:
107 = 3.35 * IPC
IPC = 31.94

Intel:
137 = 5.3 IPC
IPC = 25.84

Relative Difference: AMD's IPC is ~6.1% faster.

Multi Core: Perf = Clock * IPC * Number_Cores

AMD:
1853 = 3.35 * 12(?) * IPC
IPC = 46.09

Intel:
1068 = 5.3 * 6 * IPC
IPC = 33.58

Relative Difference: AMD's IPC is ~12.51% faster [if we assume a 16 core part, the math works to .99% in AMD's favor. For an 8 core part, we get over 69% so I doubt that's the number of cores we're seeing...]

Again, solving for relative IPC, and ignoring all HTT effects. Regardless, if the recorded clocks we're seeing are correct, then AMD has actually passed Intel in single-core IPC.

InvalidError · May 16, 2019

gamerk316 said:
And as I've long noted, your maximum parallelism is limited by Amdhal's Law. There's only so much you can do by adding more threads, and there's a point where the necessary thread scheduling/memory management starts to become an issue for non-trivial tasks.

Are you implying that scientists who write code that scales to 200 000+ cores are doing trivial tasks?

Some stuff scales better than other.

nuke20ar · May 16, 2019

gamerk316 said:
Again, solving for relative IPC, and ignoring all HTT effects. Regardless, if the recorded clocks we're seeing are correct, then AMD has actually passed Intel in single-core IPC.

I feel like passing Intel's IPC is such a pipe-dream that it's hard to believe it might happen. Especially given all the nay-saying we hear about IF being slower than Ring Bus, etc.

-Fran- · May 16, 2019

gamerk316 said:
The only real difference today as far as software design goes is the primary rendering thread is no longer the only thread that can do all the rendering; you can offload that to multiple different threads now as of DX12/Vulkan. (Note: DX11 allowed this under very constrained circumstances).

And as I've long noted, your maximum parallelism is limited by Amdhal's Law. There's only so much you can do by adding more threads, and there's a point where the necessary thread scheduling/memory management starts to become an issue for non-trivial tasks.

I doubt we'll get smooth 4k, simply because the GPUs won't be able to push it reliably. 4k/60 is probably out of reach for anything that's computationally intensive. 4k/30 is doable.

Also remember that software on consoles can make a lot of assumptions and optimizations that PCs can't, due to having one HW spec. That's why no one really expected the PS4 Pro/XB1 X, but those were necessary due to their CPUs being cripples (as I continue to note: The PS3 has a more powerful CPU then either current gen console). This generation was really hamstrung by using early gen APUs as their main processors.

https://github.com/megai2/d912pxy

That is for Guild Wars 2, a game that is DirectX 9 and from 2012. That plugin swaps the DX9 calls for DX12 path. Check the initial comparison using the same exact game engine by just swapping the rendering path* of the same code.

You can make games be fluid. There's a BIG reason Microsoft bent over moving DX12 to Win7 for... Blizzard was it?

Cheers!

gamerk316 · May 17, 2019

Yuka said:
https://github.com/megai2/d912pxy

That is for Guild Wars 2, a game that is DirectX 9 and from 2012. That plugin swaps the DX9 calls for DX12 path. Check the initial comparison using the same exact game engine by just swapping the rendering path* of the same code.

You can make games be fluid. There's a BIG reason Microsoft bent over moving DX12 to Win7 for... Blizzard was it?

Cheers!

Which is kinda expected? With DX9 most tasks are stuck with two threads doing >90% of the work. And if the CPU core they run on can't keep up, then you start to lose performance. Make the rendering thread smaller/distributed, and half that problem goes away.

(Also, do consider the possibility the DX9 code path in modern GPU drivers is less then optimal compared to DX11/DX12. It's really just there for legacy applications at this point).

gamerk316 · May 17, 2019

nuke20ar said:
I feel like passing Intel's IPC is such a pipe-dream that it's hard to believe it might happen. Especially given all the nay-saying we hear about IF being slower than Ring Bus, etc.

IF is slower for low-core counts. Once you start moving past four cores however, a Ring Bus becomes a liability as the latency is dependent on how far away you are from the core you are attempting to communicate with. IF, by contrast, is going to offer about the same performance regardless of how many cores you have, even if it's "best case" is a little bit slower then a Ring Bus on a Quad Core CPU.

InvalidError said:
Are you implying that scientists who write code that scales to 200 000+ cores are doing trivial tasks?

Some stuff scales better than other.

Simple example: Calculating Primes. From my perspective, calculating if a number is a Prime is "trivial". It's the attempting every single number in sequence to determine Primes that's expensive (it's a classic P/NP problem). That type of work scales; just throw a new thread on a core and move on. By contrast, game engines can't really be coded this way; things have to be done in sequence or based on external events, and everything needs to stay in sync (synchronization is the bane of threading). Sure, you can thread specific subsystems, but are still bound by the non-parallel and synchronized parts.

-Fran- · May 17, 2019

gamerk316 said:
Simple example: Calculating Primes. From my perspective, calculating if a number is a Prime is "trivial". It's the attempting every single number in sequence to determine Primes that's expensive (it's a classic P/NP problem). That type of work scales; just throw a new thread on a core and move on. By contrast, game engines can't really be coded this way; things have to be done in sequence or based on external events, and everything needs to stay in sync (synchronization is the bane of threading). Sure, you can thread specific subsystems, but are still bound by the non-parallel and synchronized parts.

Wait, that is completely wrong.

A problem being Non-Polynomial in cost has nothing to do with underlying implementation to solve said problem. Your logical solving cost being non-polynomial doesn't mean you can't solve the problem in parallel. Apples and Oranges there.

In fact, because of NP problems that massively parallel systems exist and heuristics are widely used.

Unless I'm massively misunderstanding the "NP" concept, I do not see it as something that affects how you can make software parallel. Sorting is a good example of it.

As for Ryzen 3K's IPC you calculated, I believe the scores are not lineal, so the comparison cannot be made like that.

Cheers!

InvalidError · May 17, 2019

gamerk316 said:
IF is slower for low-core counts. Once you start moving past four cores however, a Ring Bus becomes a liability as the latency is dependent on how far away you are from the core you are attempting to communicate with. IF, by contrast, is going to offer about the same performance regardless of how many cores you have, even if it's "best case" is a little bit slower then a Ring Bus on a Quad Core CPU.

The problem with IF is scaling: if you want minimal blocking from any N cores/CCX to any of the other N-1 cores/CCX, your switch fabric and associated power draw grows at a rate of N^2 instead of N for ringbus. If you want to keep the fabric's size manageable, you have to partition it into multiple tiers (CCX, chiplet, possibly separate edge and core tiers within an EPYC IO die) and you are back to having latency concerns, albeit with a different distribution. In both cases, software striving for the maximum total throughput needs to be designed to keep core proximity in mind when multiple threads have to either share data through caches or at least avoid stomping all over each other.

jdwii · May 17, 2019

Isn't zen 2 using 8 cores per ccx now?

InvalidError · May 18, 2019

jdwii said:
Isn't zen 2 using 8 cores per ccx now?

We don't know yet, could be two CCX per chiplet.

jdwii · May 20, 2019

All i know is i use the 2700X as my daily driver and its a beast nothing like the trash Amd made with the FX series(not the good FX new bulldozer crap)

Can't wait to see Amd close the gap when it comes to IPC and single threaded performance which is the only thing Intel has left.

Personally i expect Amd to finally beat Intel in IPC with Zen 2 but i don't think they will pass 4.5ghz on all cores.

I could be wrong i have NO info besides the leaks we all seen but given the design goals i truly do expect Zen 2 to beat a 9900K clock for clock core to core.

digitalgriffin · May 20, 2019

InvalidError said:
It is more about total throughput per watt than just core count. Having twice as many cores for the same performance and power budget is pointless since you now have twice as many threads to manage, more than twice as much inter-process communication overhead to optimize and it is this overhead that ultimately dictates how many threads the algorithm can hypothetically scale to. (I say hypothetical because the practical limit where incremental cost far exceeds incremental performance will come long before that.)

I agree. It all boils down to how many jules of energy does is take to complete task X. Some might bulk at Romes 1.7 base 2.4 boost as pathetically slow. Rome isnt designed to hea speed demon bit am efficiency demon. Clocking the chip down leads to exponential gains in efficiency. This leads to a lower total energy(joules) for a given task. And this is one part of the total cost of ownership that server people write to bean countwrs about.

Now as to thread efficiency and overhead, i will agree that setting up a threaf anf managing concurrent data race conditions are a concern to overall efficiency. But any half decent programmer knows you set threads up once nd put them in a sleep state and wait for a signal to start. Memory should be atomically isolated as well where possible. This is where things like chess moves/ai moves, npc moves, game of life moves, cell analysis breakdown (weather models), assemeyric draw calls (multi thread dx) compression, music, image, & movie encoding and non sequential encryption do well.

Create thread.
Sleep state
Task comes in and appropriate vars set for thread
Signal thread to start
Complete task
Signal
Sleep

Someome really needs to write a knuth like book for threading. Most programmers coming out of school these days arent taught about efficiency or why some algorithms are more efficient then others at certain task.

Managed memory state apps do cut down on the most common bugs, but god are they inefficient.

digitalgriffin · May 20, 2019

gamerk316 said:
The only real difference today as far as software design goes is the primary rendering thread is no longer the only thread that can do all the rendering; you can offload that to multiple different threads now as of DX12/Vulkan. (Note: DX11 allowed this under very constrained circumstances).

And as I've long noted, your maximum parallelism is limited by Amdhal's Law. There's only so much you can do by adding more threads, and there's a point where the necessary thread scheduling/memory management starts to become an issue for non-trivial tasks.

I doubt we'll get smooth 4k, simply because the GPUs won't be able to push it reliably. 4k/60 is probably out of reach for anything that's computationally intensive. 4k/30 is doable.

Also remember that software on consoles can make a lot of assumptions and optimizations that PCs can't, due to having one HW spec. That's why no one really expected the PS4 Pro/XB1 X, but those were necessary due to their CPUs being cripples (as I continue to note: The PS3 has a more powerful CPU then either current gen console). This generation was really hamstrung by using early gen APUs as their main processors.

While this is somewhat true it doesnt paint the entire picture. There are some task that do scale extremely well with large number parallel task or we wouldnt have the monsters that the los alamos and the like are building. It all depends on how atomic your task is. Some task are extremely atomic with large data sets. (Seti at home and folding at home are two more examples)

Now as to next gen consoles there will be several tricks played to scale to 4k. Some will be hardware. Some will be software.

The biggest crusher of apu performance is memory bandwidth. We see the benefits of edram (iris pro and xbox one (original)) at speed improvements. Iris pro actually did quiye well over standard intel graphics despite being a crap arhitecture compared to amd apus.

Heat generation is really a concern also. Therefore a low energy cost memory solution is needed. So im guessing there will be either a 2 or 4gb hbm module on chip package. This is especially true if reports of navi heat issues are true.

There are also several types of memory used. Amd has been playing with the idea of heterogenious memory for years now. Different memory types all laid out like its one big memory pool.

Why would you want to do this? Its simple. Switching memory models and copying memory from one buffer to another waste efficiency. So what do you do? You allow all computing resources access that memory like a big pool. Each memory type uas its benefits and drawbacks. System memory is cheap slow and high density. Gddr memory is moderately fast generates heat, and less dense. Local edram or sram memories are fast but eat up a ton of space.

Then there is cache and cache coherency which the ccx deal with.

So basically amds future apus are going to have both hbm (on package maybe stacked) and sram (on io die). The sram on the io die is just big enough for draw calls. Whats a draw call? A draw call is a set of directions set up by the cpu side to tell the gpu what to draw and how (ie: draw a race car). In ywrms of memory they are relatively small. But a past bottleneck was the transfer of a standard memory draw call buffer over the pcie bus to the gpu which must copy and process it. Using a shared memory approach eliminates a lot of that overhead.

Another bottleneck of the past was draw calls were pretty much linear in nature. Render orc 1 the render orc 2. Then render azeroth player 1. These draw calls are then composted into a final scene and the page flip called. AOS & vulkin/mantle apis showed us what we can accomplish with mgpu and assymetric multithreaded draw calls. They no longer have to be that linear in nature. The problem was no programming house really wanted to deal with the overhead of mgpu. I think i figured out amds secret sauce here and specific deep mgpu programming might no longer be necessary. For years people said chiplette tendering just wasnt possible because sli crossfire proved it was inefficient. Everytime i heard this counter argument i wanted to slap someone silly. Its not crossfire/sli. With some clever tile based rendering and parallel draw calls chiplettes do maoe sense. I worked theough the functional blocks and pseudo code with possible penalties for memory thrash/conflict. Amds infinity fabric, unified memory, and caching addresses a lot of the memory thrash issue. Clever algorithms address the rest.

On the software side of things there will be tricks thrown similar to xbox one x. Different objects will be rendered at different resolutions amd then scaled. This is just one of a few tricks that make 4k 60 possible.

Anyway im ranting. So ill just shut up.

jdwii · May 24, 2019

So no one saw the amazing scores here for Amd's Zen 2 16C/32T?

Beating intel's IPC by quite a margin for some darn reason i feel like this is real not fake i think Amd aimed for IPC vs frequency. Increasing frequency while continuing to lower the fabrication is going to be difficult.

https://www.forbes.com/sites/antony...-is-cinebench-score-really-true/#51439c116341

-Fran- · May 24, 2019

Just Cinebech scores don't tell the whole story, so context and perspective to those, no matter how positive & good they are, is of utmost importance.

While I agree with AMD putting ThreadRipper in a weird place, I've always thought of it as a weird and interesting experiment and not a long-term profitable product from AMD.

In fact, it would make a bazillion more sense (now that they've committed to chiplets) to scale up Ryzen than make ThreadRipper an even more niche product. Make a X595 platform with 4 channels and more PCIe channels for the market and just make the comms chiplet better keeping everything else equals. Could work? Thoughts?

Cheers!

InvalidError · May 24, 2019

Yuka said:
In fact, it would make a bazillion more sense (now that they've committed to chiplets) to scale up Ryzen than make ThreadRipper an even more niche product. Make a X595 platform with 4 channels and more PCIe channels for the market and just make the comms chiplet better keeping everything else equals. Could work? Thoughts?

The AM4 socket does not have enough pins to do that, so you'd still need a new CPU socket and imposing a socket large enough to support quad-channel memory with a heap of PCIe lanes would significantly increase motherboard cost for the rest of people who don't want to pay extra for such a CPU. Not to mention the PR hell Intel went through when it tried to sell desktop chips on HEDT platforms, people didn't take kindly to half of their motherboard's features being unavailable due to lack of CPU PCIe lanes.

The socket that doesn't make sense is TR4: same pin count as EPYC's SP3. May as well scrap TR4 in favor of adopting the SP3 pinout, effectively making ThreadRipper into a single-socket quad-channel EPYC and making the two lines otherwise interchangeable.

-Fran- · May 24, 2019

InvalidError said:
The AM4 socket does not have enough pins to do that, so you'd still need a new CPU socket and imposing a socket large enough to support quad-channel memory with a heap of PCIe lanes would significantly increase motherboard cost for the rest of people who don't want to pay extra for such a CPU. Not to mention the PR hell Intel went through when it tried to sell desktop chips on HEDT platforms, people didn't take kindly to half of their motherboard's features being unavailable due to lack of CPU PCIe lanes.

The socket that doesn't make sense is TR4: same pin count as EPYC's SP3. May as well scrap TR4 in favor of adopting the SP3 pinout, effectively making ThreadRipper into a single-socket quad-channel EPYC and making the two lines otherwise interchangeable.

Although it's a great thing to point out, I'm sure the PIN layout of AM4 can accommodate 2 more channels if they mux them.

If I'm wrong on my assumption though, you're absolutely on point. If they need to keep TR4 for Quad Channel, then scaling up Ryzen is dumb.

Cheers!

InvalidError · May 24, 2019

Yuka said:
Although it's a great thing to point out, I'm sure the PIN layout of AM4 can accommodate 2 more channels if they mux them.

Mux them with what? The main point of adding memory channels is to more easily accommodate the extra memory bandwidth demand of extra cores so they waste less time waiting for memory. If you 'mux' extra DIMMs over the existing two channels similar to putting an FBDIMM-style buffer chip between the CPU and each set of slots, you still only have dual-channel bandwidth. I doubt the 1331 pins AM4 socket has ~250 power/ground pins to spare for re-purposing as two extra memory channel and even if it could spare those pins, their location would be problematic as it would force the memory traces to cut across core power delivery which is very bad for signal integrity and power delivery.

Retrofitting extra memory channels into an existing socket may be possible but not practical. With AM5 most likely bringing DDR5 to the mainstream in about two years, mainstream won't need more than dual-channel for a while longer. After that, we'll probably see 4-16GB on-package HBM or similar and off-package memory bandwidth won't matter much anymore.

-Fran- · May 24, 2019

InvalidError said:
Mux them with what? The main point of adding memory channels is to more easily accommodate the extra memory bandwidth demand of extra cores so they waste less time waiting for memory. If you 'mux' extra DIMMs over the existing two channels similar to putting an FBDIMM-style buffer chip between the CPU and each set of slots, you still only have dual-channel bandwidth. I doubt the 1331 pins AM4 socket has ~250 power/ground pins to spare for re-purposing as two extra memory channel and even if it could spare those pins, their location would be problematic as it would force the memory traces to cut across core power delivery which is very bad for signal integrity and power delivery.

Retrofitting extra memory channels into an existing socket may be possible but not practical. With AM5 most likely bringing DDR5 to the mainstream in about two years, mainstream won't need more than dual-channel for a while longer. After that, we'll probably see 4-16GB on-package HBM or similar and off-package memory bandwidth won't matter much anymore.

I've been looking around for a proper map of the PIN layout and the closest I could find doesn't quite match AM4... Too bad I can't confirm or deny it, but I do have the impression AM4 could handle Quad Channel. I mean, I do get the trend of going "up" in PIN count for sockets, but you can put 4 channels in less than 1300 PINs. It all depends on what the others are used for.

Cheers!

InvalidError · May 24, 2019

Yuka said:
I mean, I do get the trend of going "up" in PIN count for sockets, but you can put 4 channels in less than 1300 PINs. It all depends on what the others are used for.

Look at any modern slot/socket interface, you will see that half if not more of the pins are either power or ground. The reason is simple: to maintain signal integrity and power stability, the reference level (power and ground) have to have very low impedance across the system and the only way to reduce said impedance across a connector or socket is to increase the number of power and ground pins. As the CPU and IO operating frequencies increase, power/ground pin counts have to increase to keep socket impedance in check, that's why in counts are going up all the time even though external IOs and TDP remain (otherwise) mostly unchanged.

-Fran- · May 24, 2019

InvalidError said:
Look at any modern slot/socket interface, you will see that half if not more of the pins are either power or ground. The reason is simple: to maintain signal integrity and power stability, the reference level (power and ground) have to have very low impedance across the system and the only way to reduce said impedance across a connector or socket is to increase the number of power and ground pins. As the CPU and IO operating frequencies increase, power/ground pin counts have to increase to keep socket impedance in check, that's why in counts are going up all the time even though external IOs and TDP remain (otherwise) mostly unchanged.

Not wrong, but doesn't go against what I said. It all depends on the availability of the PINs. The IMC would definitely need changing and consume more power, but I wonder if current power signaling would work anyway? I mean, they made it work with current chiplet design and pre-chiplets (you could argue Ryzen 2K and 1K are not chiplets?) with different power requirements per chiplet (I'd assume), so there is a level of re-usability that is not trivial to under estimate.

Cheers!

InvalidError · May 24, 2019

Yuka said:
It all depends on the availability of the PINs. The IMC would definitely need changing and consume more power, but I wonder if current power signaling would work anyway? I mean, they made it work with current chiplet design and pre-chiplets (you could argue Ryzen 2K and 1K are not chiplets?) with different power requirements per chiplet (I'd assume), so there is a level of re-usability that is not trivial to under estimate.

Chiplets don't change how memory, PCIe and power pass through the socket. Re-purposing ~200 Vcore power pins to implement two extra memory channels on the other hand completely screws things up. Why only Vcore? Because you are still going to need to interleave ground pins with memory data/control for signal integrity reasons. If you scrap nearly half of your Vcore pins, then you lose half of your power delivery. Goodby core count, clock frequency and stability.

Go read "High-Speed Circuit Design: A Handbook of Black magic" and you might understand why high-speed chip packages, slots and sockets have a ridiculous number of ground pins.

-Fran- · May 24, 2019

InvalidError said:
Chiplets don't change how memory, PCIe and power pass through the socket. Re-purposing ~200 Vcore power pins to implement two extra memory channels on the other hand completely screws things up. Why only Vcore? Because you are still going to need to interleave ground pins with memory data/control for signal integrity reasons. If you scrap nearly half of your Vcore pins, then you lose half of your power delivery. Goodby core count, clock frequency and stability.

Go read "High-Speed Circuit Design: A Handbook of Black magic" and you might understand why high-speed chip packages, slots and sockets have a ridiculous number of ground pins.

I know why, that's why I'm curious on what I said. It's another reason why AMD hasn't disclosed any socket PIN diagrams?

Also, power delivery is more or less isolated to sub elements of the CPU; the IMC has its own power delivery mechanism, so that would need changing (which has all the time, BTW). I was more about that, since the IMC is in the outsider chiplet, then it is completely separated from the CPU chiplets, no? The more reason my idea may fly 😀

I mean, they can slap HBM on package using the same PIN count, why not add 2 more channels for memory? xD

Cheers!

InvalidError · May 24, 2019

Yuka said:
I mean, they can slap HBM on package using the same PIN count, why not add 2 more channels for memory? xD

Adding HBM on package has no effect on the number of signals that have to pass through the socket. At most, it may add a few layers to the package substrate to route the extra signals between the CPU and HBM. Adding an external memory interface on the other hand requires those ~100 extra signals per channel, one extra Vmemio pin for every 3-4 memory controller pins and re-arranging ground pins to provide the bunch of extra grounds those memory pins will also require.

You can mess around with what is on-package as much as you want as long as it doesn't change what passes through the socket beyond what the socket was designed and pinout-planned for.

rigg42 · May 24, 2019

Would a shrunken I/O die, a 7nm 8 core chiplet, and 16 gb of HBM2 actually be feasible on an am4 package? That would seemingly be a monstrous gaming CPU. I'm sure this could be pulled off with 3d stacking down the road on a new socket. Probably along with a pretty decent IGPU.

AMD's Future Chips & SoC's: News, Info & Rumours.

Glorious

Titan

Glorious

Glorious

Glorious

Glorious

Titan

Splendid

Titan

Splendid

Splendid

Splendid

Splendid

Glorious

Titan

Glorious

Titan

Glorious

Titan

Glorious

Titan

Glorious

Titan

Honorable

Share this page