• Happy holidays, folks! Thanks to each and every one of you for being part of the Tom's Hardware community!

News Pics of Apple M1 Max Hint at Incoming Chiplet Designs

Is it really safe to resurrect a name from the era that also gave us Performa and Centris? :)

Although "Quadra" would be better than "M1 Max Pro X Ultimate" or some eye-roller like that.
 
Why stop at 4?

While perhaps it's not worth doing right now, such a strategy could allow an arbitrary of SoCs to be tiled together, forming a 4x or 8x or even 16x uber-SoC, and depeding on how forward looking and versatile the design, not every "sub" package has to be fully populated with every conceivable type of core. Perhaps there are designs that have mostly RAM, for example, or mostly GPUs, or mostly whatever.
This design principle might be further extended so that Apple could offer customizable options for the uber users that mirror everything that the current Mac Pro ordering form offers, hardware-wise, plus even more (Machine-Learning specialization, anyone?). Perhaps the M1 series of Mac Pros won't offer it, but the M2 or M3 or M4... will.
 
Why stop at 4?

While perhaps it's not worth doing right now, such a strategy could allow an arbitrary of SoCs to be tiled together, forming a 4x or 8x or even 16x uber-SoC, and depeding on how forward looking and versatile the design, not every "sub" package has to be fully populated with every conceivable type of core. Perhaps there are designs that have mostly RAM, for example, or mostly GPUs, or mostly whatever.
This design principle might be further extended so that Apple could offer customizable options for the uber users that mirror everything that the current Mac Pro ordering form offers, hardware-wise, plus even more (Machine-Learning specialization, anyone?). Perhaps the M1 series of Mac Pros won't offer it, but the M2 or M3 or M4... will.
The problem with doing this sort of thing is you start to run into two issues:
  • Unequal latency if data needs to be shared across blocks
  • Infrastructure limits
The first issue is a problem once you start looking at more than 4 addressable units. It becomes impractical very fast if you try to connect all of the nodes to each other. So instead you have to look at other topology options like ring, mesh, or some variation and combination between the two. Since latency is now variable, it can rob whatever performance you were hoping to achieve with multiple chiplets. The second issue is a problem because the more elements you put on the package, the larger the I/O die has to be to support all of it. And then there's the question of what communication topology the I/O die is using.

I recommend reading this https://www.anandtech.com/show/16930/does-an-amd-chiplet-have-a-core-count-limit
 
  • Like
Reactions: JamesJones44
Is it really safe to resurrect a name from the era that also gave us Performa and Centris? :)

Although "Quadra" would be better than "M1 Max Pro X Ultimate" or some eye-roller like that.

I had a Centris 650. It was awesome. Computers were insanely expensive then. They make Apple's current prices look downright thrifty. It was a "cheap" $600 motherboard swap from a IIvx.

Yes, there was a time Apple made upgrade-able computers and even let you purchase motherboard upgrades.

I also had two Performas. The Performa 6218CD and Performa 6400/180.

Their naming scheme is getting a bit complicated though.
 
This would honestly make a lot of sense. The only two non-Apple Silicon devices at this point are the iMac Pro and Mac Pro. Those two machines can use the Intel Multi Socket Xeon systems (depends on the core count), so this would essentially be that replacement. Single sockets replacement likely gets Duo and dual socket would get the Ultra. The M1 Pro/Max IPC is the same as an M1 so it would seem to follow what Apple did with the MBP, in that they just scale out the CPU core and GPU processing units and call it done for gen 1 of Apple Silicon.
 
128 GPU units ? could this be the death of dedicated GPU cards ?

in theory such SOC would outperform RTX 3080 !
Perhaps it could manage that level of performance, assuming performance were to scale well across multiple chips. But that would also result in a relatively large amount of heat output and power draw within the package, which would likely require lower clocks to keep manageable. A chip like that might not even be practical for laptops.

And it probably wouldn't replace dedicated GPUs, considering you would need a use case for the 40 CPU cores that would come along with it. And of course, this is Apple, so expect the price to performance ratio to be terrible. And at least as far as gaming is concerned, you're probably not going to find many demanding games ported to the processor's ARM architecture for a while, and AAA game ports in general have been few and far between on Apple's platform.

And APUs with integrated higher-end GPUs already exist, like the AMD processors found in the PS5 and Series X. The Series X chip is essentially an 8-core Ryzen Zen2 CPU combined with graphics hardware that can perform nearly on-par with a desktop Radeon 6700 XT, or roughly around the level of a 3060 Ti / 2080 SUPER when compared against Nvidia's desktop cards, at least for rasterized rendering. And that's an inexpensive chip that manages to be part of a device that retails for $500 in its entirety (albeit with some subsidization to manage that price point). So, theoretically, something along the lines of that could also be said to make dedicated cards "obsolete", and no doubt AMD could make a version with 6900 XT / RTX 3080-level hardware, if they felt there was a market for it.
 
128 GPU units ? could this be the death of dedicated GPU cards ?

in theory such SOC would outperform RTX 3080 !
If we're talking about hooking up a bunch of GPUs together, like some sort of SLI equivalent, unfortunately it doesn't really work that way. The biggest challenge with multi-GPU setups is how to distribute the workload in a way that minimizes latency. One might think it should be easy if CPUs can do it, but the workloads we tend to throw at CPUs are not time sensitive. That is, it doesn't matter if the CPU takes 1 minute or 1 second to do the job. We'd prefer faster, but it's not that important. GPUs on the other hand are given workloads that have strict deadlines. We demand at most 16ms, and preferably a nice average time between frames. Anything that adds latency, like having to hop nodes, starts to put a strain on this.
 
If we're talking about hooking up a bunch of GPUs together, like some sort of SLI equivalent, unfortunately it doesn't really work that way. The biggest challenge with multi-GPU setups is how to distribute the workload in a way that minimizes latency. One might think it should be easy if CPUs can do it, but the workloads we tend to throw at CPUs are not time sensitive. That is, it doesn't matter if the CPU takes 1 minute or 1 second to do the job. We'd prefer faster, but it's not that important. GPUs on the other hand are given workloads that have strict deadlines. We demand at most 16ms, and preferably a nice average time between frames. Anything that adds latency, like having to hop nodes, starts to put a strain on this.

This is not hooking GPU's together ... nor SLI like .
 
Connecting multiple M1 chips over an interconnect to "combine" their power is very much like any multi video card setup. Just because they're in the same package substrate doesn't mean they're suddenly a single cohesive unit.

No it is not the same. SLI are separate cards with huge distance between the two GPU/GDDR and much much lower bandwidth between them... and huge latency. this is also visible in the performance differences between 2 CPU on dual sockets boards VS same number of total cores in one chip
 
128 GPU units ? could this be the death of dedicated GPU cards ?

in theory such SOC would outperform RTX 3080 !

Not necessarily, the issue you run into shared CPU and GPU packages is contention for the shared resources. This isn't a problem on a 128 GB machine but on a 16 GB machine 128 GPU units paired with and 8 CPU units + 4 E units this becomes a problem fast. This type of design has actually be used for a very long time on different systems over the last 30 years. The issue you get into is the cost of having enough memory at a fast enough speed to satisfy both under load. This isn't a problem really for Apple since their buyers pay a premium so they will get configuration balanced for that, but in mid and low tear systems it's a big issue and ultimately becomes cheaper to slap a discrete GPU in and call it a day.
 
No it is not the same. SLI are separate cards with huge distance between the two GPU/GDDR and much much lower bandwidth between them... and huge latency. this is also visible in the performance differences between 2 CPU on dual sockets boards VS same number of total cores in one chip
Except the video cards don't share data other than frame buffer data. Each video card has a copy of the contents in order to render its frames. And even first generation EPYC/Threadripper processors acted like separately socketed CPUs, despite being on the same package.

In any case, the problem with multi-video card rendering isn't so much performance, it's pacing. If you have 4 GPUs working on one frame, they all have to sync up once they're done to present it to you. If there's an instance of a GPU that ran its job faster, it can't work on the next frame because it doesn't make sense to show the future when the present isn't even completed (however that makes sense). This isn't a problem for non-real time applications because well, we don't care about what's going on until the job is done.