Apple is doing fine with LPDDR5X. Nvidia claims Grace would use the same power with either that or HBM (I forget if it was 2e or 3), but that LPDDR5X is currently a lot cheaper.
But for ultimate Mobile Performance, you'd want HBM3 for the GPU and LPDDR6 for the CPU side.
Combine them together and the DX12 update for the CPU to fully access the GPU's memory, you'd see some real performance improvements.
It adds nonzero cost, so there needs to be enough cases that benefit from it. I'm going to speculate that most of those cases which gained from SMT2 would show little or no further benefit from SMT4, and we'd probably see even more regression on the cases which don't benefit.
When Intel first launched HT on the P4, it claimed HT added less than 5% to core complexity. Adding more threads when all of the support framework is already there probably wouldn't add much additional complexity besides an extra bit to things that track which thread something belongs to. The difference in responsiveness on my P4 between HT on and off was pretty noticeable, well worth leaving it on even if it may make some stuff slightly slower.
As for how many things that benefit from SMT2 would benefit from SMT4 too, that would depend on how much slack there is left in execution resources with SMT2. In sufficiently heavily threaded applications, you may gain more performance from SMT4 decreasing the pressure on speculative execution and other potentially expensive "gotta go fast" tick burn, allowing things to run more efficiently if not faster.
Of course, code written as a single thread that must run as fast as possible would suffer from any increased degree of resource competition as usual.
If you follow the link in that article to MS' website, it sounds like its main benefit is for unified memory iGPUs. With a dGPU, reading across PCIe is fairly slow (much worse than CXL).
Given the date of the source, this is probably referring to PCIe 2.0:
Eh, pic doesn't show but it claims read latency of 1600 to 1900 ns. Glancing at the paper, I can't confirm if the "No DMA" case is for a single, round-trip PCIe transaction but it can't be for a whole lot. So, take it with a grain of salt. Regardless, you don't want to make a habit of doing PIO reads out of dGPU memory.
When Intel first launched HT on the P4, it claimed HT added less than 5% to core complexity. Adding more threads when all of the support framework is already there probably wouldn't add much additional complexity besides an extra bit to things that track which thread something belongs to.
So, the P4 famously didn't implement HT very well. And they also really didn't worry about side-channel attacks, at all. Since @hotaru.hino and I already touched on the implications, I won't repeat those here.
However, given that a core already implements SMT2, I think the overhead of going to SMT4 would be pretty small, if we're just talking about the bare minimum to do it securely (i.e. no cache increases).
BTW, do we know if recent Intel or AMD cores instantiate a separate decoder per HT?
In sufficiently heavily threaded applications, you may gain more performance from SMT4 decreasing the pressure on speculative execution and other potentially expensive "gotta go fast" tick burn, allowing things to run more efficiently if not faster.
More threads -> more pressure on caches. If it results in your cache hit-rate dropping markedly, then it might not be worth filling the few pipeline bubbles that having more threads could do. I think that's what happened with a lot of the SPEC2017 fp workloads.
As far as I can tell from the Golden Cove architecture slides, there is only one decoder per core and it can decode up to 32B worth of instructions for a maximum of six operations at a time. There isn't much of a point in having dedicated decoders for each thread when execution usually spends most of its time in loops hundreds of instructions long at most and there is a 4k uOPs replay cache to bypass instruction decoding during that time.
But for ultimate Mobile Performance, you'd want HBM3 for the GPU and LPDDR6 for the CPU side.
Combine them together and the DX12 update for the CPU to fully access the GPU's memory, you'd see some real performance improvements.
The most efficient use of resources would be to use HBM3 for everything, then you avoid wasting time, power and space duplicating assets in system memory and VRAM because they are one and the same just like consoles.
As far as I can tell from the Golden Cove architecture slides, there is only one decoder per core and it can decode up to 32B worth of instructions for a maximum of six operations at a time.
Usually, there are restrictions, such as a prior generation (I forget which) being able to decode 1 "complex" instruction + 4 simple ones, per cycle. I think Intel hasn't disclosed what restrictions apply to Golden Cove's decoder.
There isn't much of a point in having dedicated decoders for each thread when execution usually spends most of its time in loops hundreds of instructions long at most and there is a 4k uOPs replay cache to bypass instruction decoding during that time.
Tremont introduced the concept of a split 3+3 decoder, where it seems like each half can decode a different instruction stream. Since Tremont is single-threaded, I think the only way you get multiple concurrent instruction fetches is by speculative execution. Also, Tremont has no uOP cache, so it's more dependent on its decoder than Golden Cove.
Latency would be better for normal CPU (Non GPU) stuff on LPDDR6
So there's that consideration.
And given that it's monolithic, it shouldn't be much of a penalty for a CPU to bounce into the GPU to Read/Manipulate data, even w/o the extra copy function that usually happens on traditional modular setups.
If you follow the link in that article to MS' website, it sounds like its main benefit is for unified memory iGPUs. With a dGPU, reading across PCIe is fairly slow (much worse than CXL).
Given the date of the source, this is probably referring to PCIe 2.0:
Eh, pic doesn't show but it claims read latency of 1600 to 1900 ns. Glancing at the paper, I can't confirm if the "No DMA" case is for a single, round-trip PCIe transaction but it can't be for a whole lot. So, take it with a grain of salt. Regardless, you don't want to make a habit of doing PIO reads out of dGPU memory.
But that's when your GPU is connected via a PCIe bus.
What happens if your GPU is on the same Ultra Low Latency Infinity Fabric bus as your CPU in a monolithic APU?
That should change the latency equation by quite a bit.
And since it's a APU, the latency to access the GPU's memory controller shouldn't be that big of a hit compared to a traditional dGPU model. That was the whole point of the APU.
The most efficient use of resources would be to use HBM3 for everything, then you avoid wasting time, power and space duplicating assets in system memory and VRAM because they are one and the same just like consoles.
I concur on the "Zero-Copy" model, but that's the entire point of the DX12 update.
Microsoft has announced a new DirectX12 GPU optimization feature in conjunction with Resizable-BAR, called GPU Upload Heaps (opens in new tab),that allows the CPU to have direct, simultaneous access to GPU memory. This can increase performance in DX12 titles and decrease system RAM utilization since the feature circumvents the need to copy data from the CPU to the GPU. The new feature is available now in the Agility SDK.
We don't know the actual implications of this feature, but the performance advantages could be significant. Graphics card memory sizes and video game VRAM consumption are getting larger and larger every year. As a result, the CPU needs to move more and more data between itself and the GPU.
With this feature, a game's RAM and CPU utilization could decrease noticeably due to a reduction in data transfers alone. This is because the CPU no longer needs to keep copies of data on both system RAM and GPU VRAM to interact with it. Another bonus is that GPU video memory is very fast these days, so there should be no latency penalties for leaving data on the GPU alone. In fact, there will probably be a latency improvement with CPU access times on high-end GPUs with high-speed video memory.
This update should really benefit APU's dramatically since the VRAM Memory Controller is near-by for the CPU to communicate through. Just like a console.
That's why I want to see a "Big Ass" APU with 64 CU's of RDNA3.
That would be "Amazing" to see crazy performance on a portable LapTop like body.
Latency would be better for normal CPU (Non GPU) stuff on LPDDR6
I concur on the "Zero-Copy" model, but that's the entire point of the DX12 update.
This update should really benefit APU's dramatically since the VRAM Memory Controller is near-by for the CPU to communicate through. Just like a console.
While HBM may have slightly worse worst-case latency than DDR5, most of that latency gets hidden by increased concurrency and the ability to simultaneously issue CAS and RAS commands, latency is more uniform and total interface time to complete a workload is lower most of the time. Intel wouldn't be packing 64GB of the stuff in its shiny new high-end server CPUs if it didn't provide consistently improved performance.
The DX12 update doesn't benefit unified memory APUs at all since the IGP and CPU are already sharing the exact same physical address space. All of the D3D heap memory regardless of which D3D memory pool you get it from (0/system, 1/GPU) was already directly accessible to the IGP as-is since chipset IGPs ~25 years ago.
Like a Macbook Pro? The M2 Max has 400 GB/s of memory bandwidth and 38 core GPU. According to this, it performs similar to a Nvidia RTX 4070 laptop dGPU:
Not only that, but it amazes me to see how many people simply take the synthetic, single-threaded latency benchmarks as the final word on memory latency. That's a best case metric!
When you're dealing with heavily-multithreaded workloads, the queues will fill up, resulting in actual latencies probably several times the best-case latency measured by synthetic tests. That's when bandwidth starts to count for a lot more than best-case latency, since keeping those queues at low-occupancy will more than make up for the higher intrinsic latency of HBM.
For now, it's just the "HPC optimized" variants that will be getting HBM. They suffered a haircut in their max turbo clock speed, having it cut to 3.5 GHz, whereas the fastest of its MCC cousins has a max turbo of 4.2 GHz and its XCC siblings go up to 4.0 GHz.
If you look at Intel's AMX, or know anything about deep learning, it's heavily-dependent on memory bandwidth. So, I think that was a significant motivating factor. I'm eager to see some benchmarks of Xeon Max on other workloads, but I think it's not slated to launch until like Q3.
By my reading, it's optimal for APUs. Without the update, DX12 would force you to keep 2 separate copies of an asset (i.e. if you wanted the CPU to have access to it), even though they happen to be in the same physical memory.
All of the D3D heap memory regardless of which D3D memory pool you get it from (0/system, 1/GPU) was already directly accessible to the IGP as-is since chipset IGPs ~25 years ago.
I think you're not accounting for the fact that iGPUs have their own address space. Driver, kernel, and graphics runtime support is needed to map the GPU's address space into userspace, so your app can look at that memory.
I think you're not accounting for the fact that iGPUs have their own address space. Driver, kernel, and graphics runtime support is needed to map the GPU's address space into userspace, so your app can look at that memory.
Most of the time though, it is only the (I)GPU needing to look into system memory and taking care of those mappings is part of the D3D heap allocation and resource creation process. For an IGP where the mappings are in system memory regardless, software can see whatever the IGP is doing by simply looking at its D3D heap regardless of the DX12 update and BAR status.
Most of the time though, it is only the (I)GPU needing to look into system memory and taking care of those mappings is part of the D3D heap allocation and resource creation process. For an IGP where the mappings are in system memory regardless, software can see whatever the IGP is doing by simply looking at its D3D heap regardless of the DX12 update and BAR status.
I think you're still making some bad assumptions, rather than speaking from actual knowledge.
Memory used by the iGPU must be subject to different cache policy, for the CPU, if the GPU isn't fully cache-coherent. That means blocking it off via MTRR (Memory Type Range Registers), which are a limited resource. This prevents the iGPU from being able to map and access whatever userspace datastructures you might happen to want to pass to it, because you'd quickly run out of configurable memory ranges to make uncacheable. Furthermore, the OS needs to exclude those pages from other uses or swapping (assuming the GPU can't generate page faults, as was the case with older iGPUs that use physical addresses).
Another concern you seem to be completely ignoring is security. Multiple applications need to be able to share the GPU, and not only must they be prevented from seeing each others' data in GPU memory, but their GPU code also mustn't be able to see into the userspace of another process than the one which launched it.
So, it's not as simple as an app snooping the heap for the D3D memory it allocated, and then following those pointers. ...if that's what you were imagining.
I wonder if any of the console makers have some sort of leverage to hold AMD back from making a big APU?
Because they're the ones that might feel "Threatened" if AMD made something that nice.
Personally, I think they're being paranoid about PC Gaming threatening Console gaming.
But we all know that the Console Player Base generally wants a "Simpler" experience and aren't willing to deal with the extra steps that us PC gamers are willing to deal with.
No, I highly doubt it. Not formal leverage, anyway.
Informally, their biggest threat is to switch to another SoC maker. But... I mean, Intel hasn't exactly made the greatest showing with its dGPUs, and switching to Nvidia would mean switching the CPU to ARM. Between them and Sony, I'd say Microsoft is the more likely of the two to go for it, since they're trying to push Windows on ARM and that would probably drag some games onto the platform.
I'm not sure how interested Nvidia would be in doing a custom SoC, either. It's doing so well in AI, and while it might be willing to sell Orin Nano chips to Nintendo, doing a custom SoC for a big console might be a distraction for their engineering department that's not worth the relatively small profits it'd bring them.
I think they feel much more threatened by each other, to be honest. A PC, even one with a comparable APU, would always be more expensive than Playstation and XBox due to the way MS and Sony have cost-optimized them and sell the hardware almost at-cost.
No, I highly doubt it. Not formal leverage, anyway.
Informally, their biggest threat is to switch to another SoC maker. But... I mean, Intel hasn't exactly made the greatest showing with its dGPUs, and switching to Nvidia would mean switching the CPU to ARM. Between them and Sony, I'd say Microsoft is the more likely of the two to go for it, since they're trying to push Windows on ARM and that would probably drag some games onto the platform.
I wouldn't count on Intel for it's Graphics Division for quite some time.
nVIDIA barely got it's SoC division to make any sales.
It's biggest customer is Nintendo and it's Switch.
MS' ARM push has been a joke.
Nobody buys Windows for "ARM".
The install base for "Windows on ARM" suck compared to the MASSIVE x86 library that has been built up over the years and the support that x86 has received.
I'm not sure how interested Nvidia would be in doing a custom SoC, either. It's doing so well in AI, and while it might be willing to sell Orin Nano chips to Nintendo, doing a custom SoC for a big console might be a distraction for their engineering department that's not worth the relatively small profits it'd bring them.
Historically, nVIDIA has burned bridges with many of it's so-called partners.
After Bump-Gate, Apple has REFUSED to work with nVIDIA ever again after they shifted all the blame for the bad solder-bumps to Apple.
When MS wanted a Die-Shrink from nVIDIA for the Xbox, nVIDIA laughed and said "Pay-Up" or "Go-Away".
Sony's relation with nVIDIA for the PS3 wasn't great either.
Many vendors have been One & Done with nVIDIA.
Nintendo is on their first experience with nVIDIA.
We'll see how long they stick with nVIDIA.
Nintendo is FAMOUS for being cheap skates.
Something nVIDIA isn't happy about, they wanted Nintendo to spend more money per SoC for the original Nintendo Switch. Nintendo said naw, we want as cheap as possible to make hardware sales profitable instead of depending on Software Sales to pay for Hardware.
Nintendo got their profit margins, but nVIDIA was very upset at the meager profit margins.
nVIDIA is trying to steer Nintendo to using a higher end updated Orin Automotive SoC. We'll see which model Nintendo will land on. I'm betting on it being the bottom of the barrel SKU given how cheap Nintendo has historically been.
I think they feel much more threatened by each other, to be honest. A PC, even one with a comparable APU, would always be more expensive than Playstation and XBox due to the way MS and Sony have cost-optimized them and sell the hardware almost at-cost.
That's the threat they know & understand. They are always worried about each other every generation.
It's PC gaming that's the perpetual threat in the shadows. PC gamers user base has been growing over time. Now to the point where we're "Undeniable" as a it's own platform.
Each Console makers prioritizes their proprietary closed console platform.
Historically, they haven't been very receptive to PC gaming, despite the fact that PC gaming is HUGE.
It's only been very recently, with the massive PC install base that they realized that PC gaming can't be ignored.
That's why we're seeing more PC ports of console games, they realize that we aren't a threat to their existing install base. That the console Audience & PC Gaming Audience doesn't really overlap.
I'm not going to deep-dive into this as I do not have any particular interest in DX development and we are already 100 miles off-topic. Just saying there should be several opportunities for shortcuts.
BTW, MTRRs have been mostly superseded by Page Attribute Tables since all the way back to the P3.
A few people have re-reviewed the A750 or A770 with the April driver update, looks like the sore spots are clearing up nicely. If Intel wanted to get some of that Sony/Microsoft console SoC action, I'm sure they could work it out. Much easier to do so when there is only one or two standardized platform configurations for everyone from SoC designer to end-users to worry about.
I wonder if any of the console makers have some sort of leverage to hold AMD back from making a big APU?
Because they're the ones that might feel "Threatened" if AMD made something that nice.
Personally, I think they're being paranoid about PC Gaming threatening Console gaming.
I think AMD would be far more concerned about cannibalizing its lower-end dGPU sales by being too generous with its IGPs. For a given amount of graphics performance, AMD may not be able to extract as much of a premium out of its large-IGP APUs as it gets from AIBs for dGPUs.
Annihilating a substantial chunk of the AIBs' business may not go well either.
The key thing is what restrictions you're willing to accept, including how many hoops you have to jump through, to do it. The issues broadly break down into the following categories:
Security
Cache-coherence
Interactions with the kernel's VM subsystem
With newer iGPUs, they might've sufficiently address the cache-coherence problem, although this comes at some cost. GPUs have an incredibly weak memory model, and for good reasons.
Security and VM interactions are probably also addressed in recent iGPU generations, by simply having the GPU go through the same MMU as the CPU cores. This also lets you use memory pages to extend the CPU's security model to code executing on the GPU.
It's because these are fairly recent developments that Microsoft is only adding this now.
A few people have re-reviewed the A750 or A770 with the April driver update, looks like the sore spots are clearing up nicely. If Intel wanted to get some of that Sony/Microsoft console SoC action, I'm sure they could work it out. Much easier to do so when there is only one or two standardized platform configurations for everyone from SoC designer to end-users to worry about.
If you're building a console, you want the most performance per mm^2 of silicon. Intel still has a long way to go, on that front, and I'm sure that's not simply a matter of "drivers".
If you're building a console, you want the most performance per mm^2 of silicon. Intel still has a long way to go, on that front, and I'm sure that's not simply a matter of "drivers".
For sqmm for sqmm performance, the A750 is doing fine: roughly twice the raw performance of an RX6600 in ~70% more space which includes an x16 PCIe interface and 256bits memory controller. Intel is just struggling with consistently wringing out its potential. Having a performance spread that ranges from getting beat senseless by the much slower 3050 to comfortably beating the RTX3060 as it ought to by raw numbers is plain silly. I doubt this would be an issue on consoles where developers have only one architecture to worry about in any given build.
I meant on the same process node, which they're not.
The A770 is 406 mm^2, which translates to between 488 mm^2 and 461 mm^2 equivalent N7 die size. The RX 6650 XT die is 237 mm^2. So, that's between 95% and 106% more die space.
Regarding performance, the latest reviews I could find were from March 7th. The RX 6650 XT is 5.2% faster at 1080p and 2.3% slower at 1440p.
So, if we call performance about equivalent, then Intel is off by about about 2x in performance per transistor. You think either MS or Sony is going to take that kind of risk, while the other one sticks with AMD? I don't.
Not to mention they also have to provide consumers with a compelling reason to upgrade from the previous generation.
Finally, both MS and Sony have invested a lot in optimizing their GPU stuff for AMD. They'd have to start mostly from scratch, if switching to Intel.
So, if we call performance about equivalent, then Intel is off by about about 2x in performance per transistor. You think either MS or Sony is going to take that kind of risk, while the other one sticks with AMD? I don't.
I'm going by FP32 numbers - what the A750/770 should be capable of if Intel got its drivers fully sorted out, not benchmarks since many of those change by 5-2000% every month.
I'm going by FP32 numbers - what the A750/770 should be capable of if Intel got its drivers fully sorted out, not benchmarks since many of those change by 5-2000% every month.
I think that's a mistake. We've seen many examples where FLOPS don't translate into real-world performance, and there's some good evidence it's not just "drivers" holding them back.
For instance, the RX 7900 XTX has 2.41x as much theoretical fp32 compute as the RX 6950 XT (non-boost; with boost clocks, the difference is even greater), but delivers nowhere near that multiple of performance.
It's not just AMD, either. Nvidia's RTX 3090 Ti has 2.85x as much theoretical fp32 compute as the RTX 2080 Ti (non-boost; comparing boost clocks, it's 2.97x as much), but delivers a mere 1.3x to 1.4x the performance.
So, you really can't treat theoretical fp32 numbers as particularly useful performance proxy.
MS' ARM push has been a joke.
Nobody buys Windows for "ARM".
The install base for "Windows on ARM" suck compared to the MASSIVE x86 library that has been built up over the years and the support that x86 has received.
I think it depends a lot on ARM being able to deliver a decent Windows experience on lower-cost hardware than AMD or Intel. It's similar to why ARM made such inroads into the Chromebook market.
Qualcomm has different ideas. They have some fantasy that wealthy executives will "simply demand" a Qualcomm-powered $2k+ ultrabook, for its light weight, 5G connectivity, and long battery life. So far, that doesn't seem to be playing out very well.
And the trouble is that Qualcomm had some sort of exclusivity agree with Microsoft, so nobody else is in the mix. I think you need at least MediaTek, but probably also Samsung or somebody else making those SoCs.
I think AMD would be far more concerned about cannibalizing its lower-end dGPU sales by being too generous with its IGPs. For a given amount of graphics performance, AMD may not be able to extract as much of a premium out of its large-IGP APUs as it gets from AIBs for dGPUs.
Annihilating a substantial chunk of the AIBs' business may not go well either.
I'm going by FP32 numbers - what the A750/770 should be capable of if Intel got its drivers fully sorted out, not benchmarks since many of those change by 5-2000% every month.