News Mac Pro with Apple Silicon Finally Here to Complete Transition Away From Intel

bit_user

Polypheme
Ambassador
LMFAO... a SOC as a "workstation".
Yeah, I can see the limited memory and core count pushing some users onto Xeon W or ThreadRipper Pro. The question of dGPU support might also be an issue for some.

For those with less extreme needs, it'll be fine.

I’ll buy almost anything if it’s shiny and made by apple"
The real knee-slapper is usually how much the accessories cost.


On the plus side, this is a rare bit of good news:

"With everything said and done, the (top spec) 2023 M2 Ultra Mac Pro comes in at $12,199 before tax and without a display."


For Apple, that's surprisingly reasonable.
 
  • Like
Reactions: TCA_ChinChin

bit_user

Polypheme
Ambassador
Atually I was expecting something like 128 cores chip at least and 2 TB of RAM.
It would've been interesting, for sure. However, I think the writing has been on the wall, for a while, that it would simply follow the mold of the M1 Ultra.

In the next gen, I'd guess they might scale up to 4x Max chips and offer CXL support as a way to scale up memory capacity well into the TB range.

The problem I think they face is that the Mac Pro market just isn't big enough to justify a purpose-built CPU for it. So, if they don't buy one from somebody like Ampere, then it's going to be some form of scaling up of their laptop SoCs.

Edit: I guess another possibility is that they actually decide to make a server CPU, which they could deploy in their own cloud infrastructure.
 
Last edited:
  • Like
Reactions: TCA_ChinChin
Apple is not some physics defying deity on high that can will thermodynamics to cease to function by sheer style. There is only so much thermal energy that can be transferred through a limit physical area and modern GPU's already push hundreds of watts worth of TDP. Not to mention unified memory architectures absolute suck for GPUs as they have a very different memory requirement then general purpose CPUs. CAD / CAM and other GPU heavy processing tasks are all done better with a dGPU.
 
  • Like
Reactions: gschoen

bit_user

Polypheme
Ambassador
There is only so much thermal energy that can be transferred through a limit physical area and modern GPU's already push hundreds of watts worth of TDP.
It's never going to rival something like a RTX 4090. You're right that thermal density is a limiting factor. Die size is another one, although Apple's M-series Ultras stand alone as the only true multi-die, mass market GPUs.

Not to mention unified memory architectures absolute suck for GPUs as they have a very different memory requirement then general purpose CPUs.
Consoles have shown how much potential the APU system architecture provides. The M1 Ultra provides 800 GB/s of memory bandwidth, which compares rather well to the 1 TB/s of top-end gaming GPUs. I'm still waiting to find out what speed memory the M2 Ultra uses.

CAD / CAM and other GPU heavy processing tasks are all done better with a dGPU.
Truth be told, I genuinely wonder how intensive CAD is on modern GPUs.

I find it a little hard to believe that CAD really needs to display more than several million polys per model. Decent tessellation tech has been the norm for a while, now. And any half-decent scene graph should provide adequate hidden surface removal to avoid much depth complexity.

I'd wager that 3D content creation and high-end video production are much heavier GPU users, these days.
 
For PCIe, it must be using a PLX switch, because I think the M2 Max (i.e. the basis for the Ultra) doesn't have nearly enough lanes to support 5 slots.
You've got to be right, because there's a slide missing here which specifies lanes which STH has. There are 4 slots running x8, 2 running x16, a PCIe 3.0 x4 I/O card, and they're saying 8 thunderbolt 4 ports. That's a lot more I/O than we've seen on any of the M chips to this point so that has to be coming from somewhere. y

STH article: https://www.servethehome.com/apple-...-pro-mac-studio-and-a-face-m2-experience-arm/
 
  • Like
Reactions: bit_user
Jun 6, 2023
11
0
10
For some workstation applications, that's not nearly enough.

For PCIe, it must be using a PLX switch, because I think the M2 Max (i.e. the basis for the Ultra) doesn't have nearly enough lanes to support 5 slots.
Errr, yet it can. out of the box, support 22 simultaneous streams of uncompressed 8K video that would normally require rendering on separate machines. Out of the box, it can support 6x 6K@60Hz and 3x8K @60Hz displays.

You would need to be pretty special for that not to be sufficient....
 
Last edited:
Jun 6, 2023
11
0
10
Apple is not some physics defying deity on high that can will thermodynamics to cease to function by sheer style. There is only so much thermal energy that can be transferred through a limit physical area and modern GPU's already push hundreds of watts worth of TDP. Not to mention unified memory architectures absolute suck for GPUs as they have a very different memory requirement then general purpose CPUs. CAD / CAM and other GPU heavy processing tasks are all done better with a dGPU.
With respect, you don't know what you are talking about. The Apple SOC's have been pulled apart by Anandtech and others who have commented, with testing to back it up, that Apple have definitely moved the dial on heat/performance and are 2nd to none in high efficiency with high power.

In addition, the SOC deep & wide interconnect clearly optimises latency across CPU/GPU & Memory in a way simply impossible with the traditional N/S design in legacy systems.

Credit where it is due - this is a landmark machine in bang per buck...
 

bit_user

Polypheme
Ambassador
Errr, yet it can. out of the box, support 22 simultaneous streams of uncompressed 8K video that would normally require rendering on separate machines. Out of the box, it can support 6x 6K@60Hz and 3x8K @60Hz displays.

You would need to be pretty special for that not to be sufficient....
Please note: I said for some workstation applications.

The video example you cite no doubt utilizes special-purpose hardware, rather than general-purpose compute. Apple's CPU cores are good, but they're not so good that they can outperform a 56-core Xeon W or a 64-core Threadripper Pro on scalable, general-purpose compute tasks.

And then there's the issue of memory capacity, where some scalable workloads require a good deal more than 192 GB.
 

bit_user

Polypheme
Ambassador
With respect, you don't know what you are talking about. The Apple SOC's have been pulled apart by Anandtech and others who have commented, with testing to back it up, that Apple have definitely moved the dial on heat/performance and are 2nd to none in high efficiency with high power.
That comment was about the GPU portion of the SoC, where Apple doesn't have such a great efficiency lead over current industry leaders.

In addition, the SOC deep & wide interconnect clearly optimises latency across CPU/GPU & Memory in a way simply impossible with the traditional N/S design in legacy systems.
At the launch of the PS4, we were warned that certain games might run slower on PCs, due to the inherent latency advantage of CPU/GPU communication, in console APUs. To my knowledge, there was never an example of this. I think the issue is more theoretical than practical, because GPU APIs are designed to hide latency as much as possible, and developers quickly learn how to use them to such ends. If you suddenly cut that latency, you don't get a massive benefit because the software is already designed not to be sensitive to it.

Credit where it is due - this is a landmark machine in bang per buck...
I don't really see anything here that we haven't already seen with the M1 Ultra (which was impressive, I'll grant you). They just added a few more CPU and GPU cores, which is nice. If you want to get any more praise out of me, let's see some benchmarks on real-world workloads.

Even then, I still say it falls short of a true workstation, for the reasons I mentioned.
 
Right right, Apple, through the power of belief and exclusive marketing, totally shoved 500~600 watts of power through a single SOC and cooled it with a magical cooler that was both silent and aesthetically pleasing. This is the exact same nonsense that was spouted by the congregation members back during the PPC era. Decent CPU by ARM standards, terribad GPU and UMA is a very bad thing because linear scalar processing (CPU's) and parallel vector processing (GPU's) have very different performance profiles.

Linear processing favors fast command turn around, what we call latency, over raw bandwidth. The bandwidth doesn't really matter when the main loop can't continue until it gets an answer to a value in memory, Intel and AMD's branch prediction technology does amazing things to try to compensate for such latency. Massive parallel vector processing on the other hand really doesn't care about latency because there is always something else to do while waiting and it needs massive bandwidth to keep data flowing to feed those parallel processors. This is why separate discrete architectures always do better then shared, the system architecture can be designed to meet the specific needs of the component instead of having to compromise one for the other.

On chip design, massive chips are very cost inefficient to produce. Silicon wafers are circular, meaning there is always partially printed dies that are cut off after production. The larger the die being produced, the higher the wasted surface area, multiple smaller dies on the other hand waste less silicon and result in lower production costs. This is why SOC designs only make sense in smaller lower power devices, its combining the function of several other chips into one which can result in higher production costs unless it's kept small.

Apple isn't doing anything special, the only reason the M1 looked good was that they were the first on TSMC's 5nm process node and were being compared to Intel chips on a 10 to 12nm node with AMD being on an 8nm node. Now everyone else is getting on that same 5nm and the only way to make headlines is to make a monstrous die at a ridiculous price then hope enough of the congregation buys the product. The got off Intel because they didn't want people booting into Windows and running apples to apples benchmarks and realizing how terrible the value of Apple PC's are. By going to a custom ARM ISA the faithful can claim some mystical benefit while their Personal Computer is not a "PC".

Apple Faithful in a nutshell

 
  • Like
Reactions: gschoen

bit_user

Polypheme
Ambassador
Apple isn't doing anything special, the only reason the M1 looked good was that they were the first on TSMC's 5nm process node and were being compared to Intel chips on a 10 to 12nm node with AMD being on an 8nm node. Now everyone else is getting on that same 5nm and the only way to make headlines is to make a monstrous die at a ridiculous price then hope enough of the congregation buys the product.
Unlike Intel & AMD's, Apple's cores weren't designed for outright performance. They prioritized power-efficiency. So, while Zen 4 and Raptor Lake are leading on most single-threaded perfomance tests, Apple's M-series are still way ahead on perf/W.

It's not a simple story of process node, either. It's a story of 1.5 decades of careful tuning and microarchitecture development, with a keen focus on maximizing perf/W.

The got off Intel because they didn't want people booting into Windows and running apples to apples benchmarks and realizing how terrible the value of Apple PC's are.
LOL, no.

Apple took control of its own destiny for their iPhone and iPad products. Their CPUs eventually became so good that it only made sense to use them across their entire ecosystem and decouple themselves from Intel, its middling performance improvements (i.e. between Sandybridge and Alder Lake), and their manufacturing troubles (i.e. migrating first to 14 nm and then to 10 nm). This also gives them better cross-device software compatibility.

By going to a custom ARM ISA
The ISA isn't custom, but the microarchitecture is.

ARM always had an advantage over x86, on perf/W. That's the main reason Intel utterly failed to penetrate the phone market.

That's also how Amazon's 64-core Graviton 3 uses only 100 W, even with 8-channel DDR5 and PCIe 5.0.

 
Last edited:
Yes yes, we know think you can levitate and use psychic powers thanks to the power of the fruit logo. Totally don't do any sort of Personal Computing on your not-a-PC Personal Computer with said fruit logo.
 
Last edited:
Jun 6, 2023
11
0
10
Right right, Apple, through the power of belief and exclusive marketing, totally shoved 500~600 watts of power through a single SOC and cooled it with a magical cooler that was both silent and aesthetically pleasing. This is the exact same nonsense that was spouted by the congregation members back during the PPC era. Decent CPU by ARM standards, terribad GPU and UMA is a very bad thing because linear scalar processing (CPU's) and parallel vector processing (GPU's) have very different performance profiles.
You have no idea what you are talking about. Unlikely I know but, in the event that you do actually want to LEARN something, read the Anandtech analysis of the M1 uArch and performance. THEY know what they are talking about. Then you will see the M-series as an excellent extrapolation of the ARM architecture to produce a high performance, highly efficient general purpose chip at a good process node. And because of the design, and the process node, they CAN customise silicon for common pathways. Overall, there is still nothing comparable to the M1 for a mobile power user.

As for the "Pro"? There are no "general purpose" Pro's out there. Pro's do I or 2 things. Every. Day. As the success of the Studio attests, this workstation hits the spot for a great many of them.
 
At the launch of the PS4, we were warned that certain games might run slower on PCs, due to the inherent latency advantage of CPU/GPU communication, in console APUs. To my knowledge, there was never an example of this. I think the issue is more theoretical than practical, because GPU APIs are designed to hide latency as much as possible, and developers quickly learn how to use them to such ends. If you suddenly cut that latency, you don't get a massive benefit because the software is already designed not to be sensitive to it.
Pretty sure you're right regarding practical examples of this theory, and a lot of folks seemed to be pushing it at the time. One thing those pushing this theory seemed to ignore is that the CPU -> memory latency in the consoles is significantly higher than CPU -> main memory in a PC. While this isn't a huge advantage it is a weakness in the console design due to using VRAM instead of DRAM.
I don't really see anything here that we haven't already seen with the M1 Ultra (which was impressive, I'll grant you). They just added a few more CPU and GPU cores, which is nice. If you want to get any more praise out of me, let's see some benchmarks on real-world workloads.

Even then, I still say it falls short of a true workstation, for the reasons I mentioned.
So much this. The design of the M2 makes me wonder if Apple has any actual plans to design a true workstation chip. I'm looking forward to seeing someone tear down the Pro so we can see what all they did to get all the extra connectivity. I can't imagine they will be able to do multiple CPUs economically without a proper design potentially decoupling the GPU and CPU or at least having a smaller GPU. That being said I'm not totally sure Apple isn't just ceding the greater workstation market to Intel and AMD while carving out their own niche within it.
 

bit_user

Polypheme
Ambassador
the CPU -> memory latency in the consoles is significantly higher than CPU -> main memory in a PC. While this isn't a huge advantage it is a weakness in the console design due to using VRAM instead of DRAM.
The original version of XBox One had separate main memory and video memory. I think main memory was regular DDR3. Then, I guess MS saw that Sony was doing fine with a unified pool of GDDR5, and followed their lead since the refresh.

I always wanted to see some benchmarks comparing the real-world impact GDDR5 vs. a comparable PC using DDR3 or DDR4, but it never happened. I was super hopeful when Anandtech apparently got their hands on a Chinese console that could also run desktop MS Windows 10, but they never published PC application benchmarks on it!
: (

So much this. The design of the M2 makes me wonder if Apple has any actual plans to design a true workstation chip.
I mentioned it before - I've recently heard some rumors Apple is working on a server CPU. It probably makes sense for running their own cloud services (I wouldn't necessarily have thought so, but if Amazon, MS, and Google are making their own server CPUs, maybe?).

I can't imagine they will be able to do multiple CPUs economically without a proper design potentially decoupling the GPU and CPU or at least having a smaller GPU.
My guess is that a server CPU would ditch the GPU cores. I think they'll probably keep in-package memory, but also add either external DDR5 or CXL lanes for CXL.mem.

For the next Mac Pro's GPU, maybe they'll just take an Ultra with too many faulty CPU cores, and put it on a PCIe card? That's what I guessed they would do when the M1 Ultra first launched. Either that, or swallow their pride and go back to using AMD dGPUs.
 
The original version of XBox One had separate main memory and video memory. I think main memory was regular DDR3. Then, I guess MS saw that Sony was doing fine with a unified pool of GDDR5, and followed their lead since the refresh.

I always wanted to see some benchmarks comparing the real-world impact GDDR5 vs. a comparable PC using DDR3 or DDR4, but it never happened. I was super hopeful when Anandtech apparently got their hands on a Chinese console that could also run desktop MS Windows 10, but they never published PC application benchmarks on it!
: (
Yeah nobody ended up doing in depth benchmarks because the software support was not very good then the team disbanded. I think LTT had the most in depth anything got and since this was before the push for more thorough testing and accuracy it's lacking. I think all of the boards running the xbox one chips for desktop were using just DDR3.

There was also this, but the GPU portion is disabled in these chips. It does show that the latency is massive and the only place the VRAM was worthwhile is where the extra memory bandwidth can be put to use by the CPU:
I mentioned it before - I've recently heard some rumors Apple is working on a server CPU. It probably makes sense for running their own cloud services (I wouldn't necessarily have thought so, but if Amazon, MS, and Google are making their own server CPUs, maybe?).
These rumors seem to pop up every so often, but we all should remember they lost a bunch of CPU engineers because they didn't want to make a server CPU. Several of them ended up over at Nuvia and one of the M series core architects is over at I think one of the RISC-V companies. They basically lost all of the core engineering team behind the M series.

Right now Apple uses AWS/Google and maybe MS for all their cloud services, and sure while they do have enough money to open their own data centers or probably have their own hardware installed in the others I'm just not sure this is really on the table.
My guess is that a server CPU would ditch the GPU cores. I think they'll probably keep in-package memory, but also add either external DDR5 or CXL lanes for CXL.mem.

For the next Mac Pro's GPU, maybe they'll just take an Ultra with too many faulty CPU cores, and put it on a PCIe card? That's what I guessed they would do when the M1 Ultra first launched. Either that, or swallow their pride and go back to using AMD dGPUs.
They could also do something along the lines of SPR + HBM where the on package is still there, but they also have DIMM slots wired up. Their memory controller technology is certainly up for it since the current biggest one is what 1024 bits so the equivalent of 16 channels.

It would be really interesting to see how the Apple GPUs work if you had multiple on PCIe cards and I could see that being a route they could take.
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
There was also this, but the GPU portion is disabled in these chips. It does show that the latency is massive and the only place the VRAM was worthwhile is where the extra memory bandwidth can be put to use by the CPU:
Thanks! Yeah, it's disappointing that SoC doesn't seem architected to provide much bandwidth to the CPU core complexes. So, it lost out on high latency and didn't benefit from the additional bandwidth.

Still, it's nearly as good at multithreaded as the top-end 8-core Zen+ CPU and better single-threaded perf than the 1800X Zen1 CPU. So, that's not too bad for an 8-core Zen 2, especially considering the price. It's just a shame the GPU portion was disabled.

These rumors seem to pop up every so often, but we all should remember they lost a bunch of CPU engineers because they didn't want to make a server CPU.
Yeah, but that was already 4 years ago! I looked it up:

"NUVIA was founded in early 2019 with the goal of reimagining silicon design to deliver industry-leading performance and energy efficiency for the data center."

Source: https://www.anandtech.com/show/15115/nuvia-breaks-cover-new-startup-to-take-on-datacenter-cpu-market

I've been at companies that would make 2-3 revisions to their product strategy in 4 years!

They basically lost all of the core engineering team behind the M series.
No way. The design teams are pretty big, actually. I doubt they lost more than 1/3rd. Probably disproportionately more of the senior members, but still...

Right now Apple uses AWS/Google and maybe MS for all their cloud services, and sure while they do have enough money to open their own data centers or probably have their own hardware installed in the others I'm just not sure this is really on the table.
They could partially migrate services to their own HW. Anything where you'd want the exact same code to seamlessly migrate between end-user machines/devices and cloud. For those cases, you'd need the same ISA extensions in both places, and Apple is pretty far ahead with their adoption of optional ARMv8-A ISA extensions. And their AMX is a completely separate hardware block that's not even mapped into the ISA. Then, there's their Neural Unit.

The other thing that having their own server CPUs would do is to give them more negotiating power over Amazon/Google.

They could also do something along the lines of SPR + HBM where the on package is still there, but they also have DIMM slots wired up.
Exactly. I really think there's a good case for them to just use CXL for the off-package memory, so you can have like 128 lanes and they can be flexibly re-tasked between storage, peripherals, accelerators, and memory.

Their memory controller technology is certainly up for it since the current biggest one is what 1024 bits so the equivalent of 16 channels.
That's for 2 dies, though. So, it's equivalent to the 512-bit memory controllers we see in a lot of server CPUs, today. Genoa is the biggest, at 786-bit.
 
  • Like
Reactions: thestryker