News AMD ROCm Comes To Windows On Consumer GPUs

Too little, too late? I've been waiting for ROCm on Windows since launch - it's been a mess. I've been wanting to play around with it first on my RX470 and now on my RX6800XT.

Intel's OneAPI seems to be the best way forward right now as far as open solutions go.
 
  • Like
Reactions: rluker5
Positive news! I hope the eventual goal is to support compute on all recent models, including RDNA1 and Vega iGPUs.

They need to take a page from Nvidia's playbook and support compute up-and-down the entire product line. It has to work out-of-the-box, with the same ease of installation as graphics drivers. Only then can app developers realistically support compute on AMD GPUs.
 
Last edited:
AMD introduced Radeon Open Compute Ecosystem (ROCm) in 2016 as an open-source alternative to Nvidia's CUDA platform.
CUDA is primarily an API. AMD has a clone called HIP (Heterogeneous Interface for Portability), which runs atop multiple different hardware/software platforms, including Nvidia GPUs and there's allegedly even a port to Intel's oneAPI. HIP only supports AMD GPUs atop ROCm, which is why ROCm support for consumer GPUs is important.

AMD's wish is that people would use HIP, instead of CUDA. Then, apps would seamlessly run on both Nvidia and AMD's GPUs. There are other GPU Compute APIs, such as OpenCL and WebGPU, although they lack some of CUDA's advanced features and ecosystem.
 
CUDA is primarily an API. AMD has a clone called HIP (Heterogeneous Interface for Portability), which runs atop multiple different hardware/software platforms, including Nvidia GPUs and there's allegedly even a port to Intel's oneAPI. HIP only supports AMD GPUs atop ROCm, which is why ROCm support for consumer GPUs is important.

AMD's wish is that people would use HIP, instead of CUDA. Then, apps would seamlessly run on both Nvidia and AMD's GPUs. There are other GPU Compute APIs, such as OpenCL and WebGPU, although they lack some of CUDA's advanced features and ecosystem.
I'd rather nobody use any API that is "Vendor Specific".

That's why OpenCL & DirectCompute exists.

If you're targeting Windows, you have DirectCompute.

If you're targeting Open Source and portability, OpenCL.

No matter how nice each vendor specific / proprietary API is, not having to be locked into a hardware vendor would be better IMO.
 
I'd rather nobody use any API that is "Vendor Specific".

That's why OpenCL & DirectCompute exists.
With DirectCompute, you're just trading GPU-specific for platform-specific. That's not much progress, IMO.

Also, from what I can tell, DirectCompute is merely using compute shaders within Direct3D. It doesn't appear to be its own API. I'd speculate they're not much better or more capable than OpenGL compute shaders.

You didn't mention Vulkan Compute, which is another whole can of worms. At least it's portable and probably more capable than compute shaders in either Direct3D or OpenGL. What it's not is suitable for scientific-grade or probably even financial-grade accuracy, like OpenCL.
 
With DirectCompute, you're just trading GPU-specific for platform-specific. That's not much progress, IMO.

Also, from what I can tell, DirectCompute is merely using compute shaders within Direct3D. It doesn't appear to be its own API. I'd speculate they're not much better or more capable than OpenGL compute shaders.

You didn't mention Vulkan Compute, which is another whole can of worms. At least it's portable and probably more capable than compute shaders in either Direct3D or OpenGL. What it's not is suitable for scientific-grade or probably even financial-grade accuracy, like OpenCL.
There really isn't one Open-sourced/Platform Agnostic GP GPU Compute API that meets all those requirements, is there?
 
There really isn't one Open-sourced/Platform Agnostic GP GPU Compute API that meets all those requirements, is there?
OpenCL has the precision and the potential, but big players like Nvidia and AMD no longer see it as central to their success in GPU Compute, the way they see D3D and Vulkan as essential to success in the gaming market. Intel is probably the biggest holdout in the OpenCL market. It forms the foundation of their oneAPI.

One of the upsides I see from the Chinese GPU market is probably coalescing around OpenCL. We could suddenly see it re-invigorated. Or, maybe they'll turn their focus towards beefing up Vulkan Compute.

Oh, and WebGPU is another standard to keep an eye on. It's the web community's latest attempt at GPU API for both graphics and compute workloads. Web no longer means slow - Web Assembly avoids the performance penalties associated with high-level languages like Javascript. And you can even run Web Assembly apps outside of a browser.
 
OpenCL has the precision and the potential, but big players like Nvidia and AMD no longer see it as central to their success in GPU Compute, the way they see D3D and Vulkan as essential to success in the gaming market. Intel is probably the biggest holdout in the OpenCL market. It forms the foundation of their oneAPI.

One of the upsides I see from the Chinese GPU market is probably coalescing around OpenCL. We could suddenly see it re-invigorated. Or, maybe they'll turn their focus towards beefing up Vulkan Compute.

Oh, and WebGPU is another standard to keep an eye on. It's the web community's latest attempt at GPU API for both graphics and compute workloads. Web no longer means slow - Web Assembly avoids the performance penalties associated with high-level languages like Javascript. And you can even run Web Assembly apps outside of a browser.
standards.png

Why does this feel like another XKCD moment?
 
standards.png

Why does this feel like another XKCD moment?
#927 !

It's not, though. It just feels like it. Neither CUDA nor HIP are standards. Nor is Direct3D.

WebGPU is a standard, but then it's meant to succeed WebGL and WebCL, probably not unlike how Vulkan succeeded OpenGL (and WebCL never really caught on). I've never used WebGL, but it sounds very closely-tied to OpenGL ES, and that's basically dead.

I suppose Web Assembly is a bit like Java bytecode. I don't honestly know enough about either one to meaningfully compare them. Java really seems to have fallen out of favor and hopefully avoided the underlying reasons for that.
 
Its never too late for competition.
I'm sure some assess have been getting kicked at AMD for missing out on so much of the AI boom. I hope this is finally resulting in them getting the needed resources for their compute stack.

Incidentally, I just looked at the installation guide for the latest ROCm (Linux) release, and it says nothing about the two consumer RDNA2 GPUs mentioned in this article.

So, I wonder if they'll be supported in the next Linux release, or if that's a Windows-only thing. Also, why just those two models? Is it because that's all they had the time to qualify? Or maybe those are the models they're most desperate to sell? Or, maybe it was just done for some big customer who bought a lot of them?
 
  • Like
Reactions: JamesSneed
Too little, too late? I've been waiting for ROCm on Windows since launch - it's been a mess. I've been wanting to play around with it first on my RX470 and now on my RX6800XT.

Intel's OneAPI seems to be the best way forward right now as far as open solutions go.
To me the original HSA seemed the farm more attractive way to go which culminated in the (alledged?) ability to change control flow from the CPU part of an APU or SoC to GPU or xPU at the level of a function call, offering a really fine grained near zero overhead control transfer, instead of heavy APIs with bounce buffers or perhaps even cache flushes.

I even bought an AMD Kaveri APU to play with that and have been waiting for meaningful software support ever since. The Kaveri recently got retired even as a backup server (literally only ever switched on to do a weekly backup..), because in its anticipation of a changed computing paradigm, its designers had skimped on all those caches and translation buffers that make today's CISC CPUs fast, giving more than half of its die area to a theoretically HSA capable iGPU that was as massive as it was starved for RAM bandwidth and any software to exploit its HSA capabilities.

In other words: it was a dog on every front, except electricity, drinking nearly 100 Watt for a job that an i5-6267 Iris 550 equipped SoC from Intel could do at exactly the same speed for 1/3 of the power budget (I benchmarked them extensively as I had both for years and it's quite fascinating how closely they were always matched in performance, when they were so different in progeny).

The biggest issue was total lack of software support, no compiler to generate a single code image from a single source file, no optimizer able to decide which part should be done by whom.

In a way HSA is finding its way back via all those ML-inference driven ISA extensions, which might have instruction or assembly language mneumonics but no high-level language equivalent to express them, but what's mostly changed is that their development is driven at the behest of cloud giants, who control their entire IT ecosystem, from code down to proprietary xPUs and even ISAs: they are used to having to write their own code, invent their own programming languages or chips.

But those giants tend to be rather averse to sharing any of their IP and general purpose computing only ever happened because developers needed to share their code to offset the cost of its development.

So while the giants increasingly become the only ones to shoulder the cost of evolution, chances are that the fruits of their labor only become openly available when they publish to harm a competitor's potential USP.
 
  • Like
Reactions: bit_user
To me the original HSA seemed the farm more attractive way to go which culminated in the (alledged?) ability to change control flow from the CPU part of an APU or SoC to GPU or xPU at the level of a function call, offering a really fine grained near zero overhead control transfer, instead of heavy APIs with bounce buffers or perhaps even cache flushes.
I don't understand how that would work. You don't do such things for multi-threaded programming on a CPU. Yeah, you could theoretically trigger an interrupt on the GPU and interrupt whatever it's doing to have it switch to running the new function you're calling, but what if it was already running a function called by another CPU thread?

Just like we do for multi-threaded programming on CPUs, I think a GPU API will always be buffer-based. The key thing is that the GPU be cache-coherent, and that's what saves you having to do cache flushes. It lets you communicate through L3, just like you would between CPU threads.

I even bought an AMD Kaveri APU to play with that
I had the same idea, though I was more interested in OpenCL than HSA. Fortunately, I never got around to ordering it, though I think I got as far as spec'ing out all the parts.

drinking nearly 100 Watt for a job that an i5-6267 Iris 550 equipped SoC from Intel could do at exactly the same speed for 1/3 of the power budget (I benchmarked them extensively as I had both for years and it's quite fascinating how closely they were always matched in performance,
100 W at the wall, or CPU self-reported? I'm impressed it was able to match a Skylake, at all. The fact that you compared it against a mobile CPU meant both that the Skylake had an efficiency advantage (due to lower clocks), but also a decent performance penalty.

In a way HSA is finding its way back via all those ML-inference driven ISA extensions,
Not really. CPU cores are never as efficient as GPU cores, for a whole host of reasons.

AMX is an interesting case, but that's really just a special-purpose matrix-multiply engine that Intel bolted on. Oh, and the contents of its tile registers can't be accessed directly by the CPU - the go to/from memory. So, even with AMX, you're still communicating data via memory, even if the actual commands take the form of CPU instructions.
 
I don't understand how that would work. You don't do such things for multi-threaded programming on a CPU. Yeah, you could theoretically trigger an interrupt on the GPU and interrupt whatever it's doing to have it switch to running the new function you're calling, but what if it was already running a function called by another CPU thread?
I'm fuzzy on the details, most likely because it never became a practical reality.

But my understanding was that quite literally it worked via an alternate interpretation of all instruction encodings and the APU would flip between an x86 and a "GPGPU" ISA on calls, jumps or returns.

Since all caches, page tables and even memory consistency semantics were shared between the two within the very same chip, the overhead of switching was absolutely negligible, nothing nearly as expensive as an interrupt, user/kernel mode or process switch, which is what I believe you have to do for ARM64/32/Thumb ISA or on x86/AMD64 transitions.

And by that it became quite obvious that it wouldn't do much good with discrete GPUs, even of both were to come from AMD, as they rarely shared the same memory space and even with an Infinity Fabric HBM/GDDRx/DRAM bandwidth and latency gaps would quickly destroy the value of such a super-tight HSA.

However, it might have been usable on consoles with AMD SoCs, so who knows?

Of course that would have also meant quite a bit of OS support, because the GPU side most likely had a pretty large register file, but then we've had process switch nightmares on x86 for a long time, starting with the venerable 8087.

Just like we do for multi-threaded programming on CPUs, I think a GPU API will always be buffer-based. The key thing is that the GPU be cache-coherent, and that's what saves you having to do cache flushes. It lets you communicate through L3, just like you would between CPU threads.

Buffer based, yes, but generally without having to copy and flush, instead doing user-to-remote-user-space rUDMA and messsage passing control transfers across large Infiniband or CXL fabrics. Perhaps even a bit of NV-link, but that doesn't scale to the tens of thousands of systems now used for GPT4 and beyond.

With such a base, the type of ISA switch HSA was supposed to do could still be done and as long as the architecture is GPGPU first and CPU second, it could be as efficient as it can be. There is this trend somewhat inspired by RISC-V, that general purpose ISAs are really just there for some legacy bootstrapping and general admin stuff, which is why Nvidia's Grace and Hopper are designed to really mesh the memory spaces, Tenstorrent and otheres make the CPU little more than a service processor.
I had the same idea, though I was more interested in OpenCL than HSA. Fortunately, I never got around to ordering it, though I think I got as far as spec'ing out all the parts.


100 W at the wall, or CPU self-reported? I'm impressed it was able to match a Skylake, at all. The fact that you compared it against a mobile CPU meant both that the Skylake had an efficiency advantage (due to lower clocks), but also a decent performance penalty.
Mostly self reported but also checked at the wall. And yes, the 22nm Intel Skylake process using 2 cores and SMT at 3.1 GHz was able to match almost exactly 4 AMD Integer cores (with 2 shared FPUs) at 4 GHz on 28nm GF.

But more importantly the 48EU Iris 550 with 64MB of eDRAM met the performance of the A10-7850 K Kavery 512 "core" iGPU up to a hair (which I had given the single rank of DDR3-2400 it needed for each of the two channels to get peak performance), burning much less energy (Intel iGPU always tend to get precedence, but they never get a lot of juice).

And I think the i5 actually had something like DDR3-1600...

It was really uncanny, how closely they matched in nearly every benchmark out there and please believe me that I tried a lot! That and a Phenom II x6 threw me off AMD after a long love affair that had started with a 100MHz Am486 DX4-100, I think, and lasted many generations (I came back for Zen3).

That chip was a design made just for Apple, most likely horrendously expensive to make as the iGPU was bigger than the CPU cores and mounting the eDRAM probably wasn't cheap, either.

But Apple lost its faith in Intel at this very specific point, evidently because there were daily errata and they expected better yet. So when they were eventually left over they were silently sold off in bulk and cheap to empty the warehouses. I got mine in a cheap Medion notebook and it's still running strong.
Not really. CPU cores are never as efficient as GPU cores, for a whole host of reasons.

AMX is an interesting case, but that's really just a special-purpose matrix-multiply engine that Intel bolted on. Oh, and the contents of its tile registers can't be accessed directly by the CPU - the go to/from memory. So, even with AMX, you're still communicating data via memory, even if the actual commands take the form of CPU instructions.
There are *so many* really interesting memory mapped architectures out there, both historically and even relativley recent, which act like memory to ordinary CPUs whilst they compute internally. The first I remember were Weitek 1167 math co-processors, which were actually designed ISA independent (and supposed to also work with SPARC CPUs, I believe). You'd fill the register file writing to one 64K memory segment (typically on an 80386) and then give it instructions by writing bits to another segment. You could then get results by reading from the register file segment.

A few years ago Micron had an architecture where you'd code non-deterministic finite state machines in a similar manner and send them data to process by writing to what seemed like RAM, collecting the computing results elsewhere. They called it the the Automata Memory Processor (https://www.researchgate.net/publication/310821021_An_overview_of_micron's_automata_processor) and hoped to overcome the memory barrier that way.

There is also a French company called UPMEM, which has been in startup for what seems like decades now, ever quite getting off the ground.
 
my understanding was that quite literally it worked via an alternate interpretation of all instruction encodings and the APU would flip between an x86 and a "GPGPU" ISA on calls, jumps or returns.
No, absolutely not. There's no way a GPU was executing x86. That'd be insane, on so many levels.

HSA defined a portable assembly language which is compiled to native GPU code at runtime. Calling that function should then package up your arguments, ship them over to the GPU block, and tell the GPU to invoke its version of the function.

Once complete, it packages up the results and ships them back, likely with an interrupt to let you know it's done. If they're smart, they won't send an interrupt unless you requested it.

the overhead of switching was absolutely negligible, nothing nearly as expensive as an interrupt,
Interrupts are traditionally pretty cheap. According to this, they're only a couple microseconds:

If you're making enough calls for that overhead to be significant, then you're probably doing it wrong.

it became quite obvious that it wouldn't do much good with discrete GPUs, even of both were to come from AMD, as they rarely shared the same memory space and even with an Infinity Fabric HBM/GDDRx/DRAM bandwidth and latency gaps would quickly destroy the value of such a super-tight HSA.
I think the reason HSA fizzled is just for lack of interest, within the industry.

Buffer based, yes, but generally without having to copy and flush
Probably because of HSA, AMD GPUs have supported cache-coherency with the host CPU, since Vega, if not before.

I think Intel supported it since Broadwell, which was the first to support OpenCL's Shared Virtual Memory.

yes, the 22nm Intel Skylake process using 2 cores and SMT at 3.1 GHz was able to match almost exactly 4 AMD Integer cores (with 2 shared FPUs) at 4 GHz on 28nm GF.
All Skylakes were 14 nm. It's Haswell (2013) that was 22 nm.

That and a Phenom II x6 threw me off AMD after a long love affair that had started with a 100MHz Am486 DX4-100, I think, and lasted many generations (I came back for Zen3).
I had a Phenom II x2 that served me well. Never had any issues with it, actually. It was a Rev. B, however. In fact, I recently bought a Zen 3 to replace it.

There are *so many* really interesting memory mapped architectures out there,
No, read the ISA for it. It's only 8 instructions. The way tiles are read & written is by loading or saving them from/to a memory address. That's an example of communicating through memory, not memory-mapped hardware.
 
No, absolutely not. There's no way a GPU was executing x86. That'd be insane, on so many levels.

HSA defined a portable assembly language which is compiled to native GPU code at runtime. Calling that function should then package up your arguments, ship them over to the GPU block, and tell the GPU to invoke its version of the function.

Once complete, it packages up the results and ships them back, likely with an interrupt to let you know it's done. If they're smart, they won't send an interrupt unless you requested it.
Here is the report I still had in my mind from Anandtech: https://www.anandtech.com/show/7677/amd-kaveri-review-a8-7600-a10-7850k/6

And, not it wasn't an ISA switch, instead both iGPU and iCPU would work concurrently within the very same shared physical memory space using the same (virtual memory) pointers in user space concurrently and would have to use semaphores to avoid stepping on each others toes.

It went beyond what everybody else could do, even their Llano, Trinity and Richmond APUs and I have no idea if that functionality was ever fully debugged or is still available in current APUs.

Pretty sure it would be a security mightmare today, because it's so difficult to get these things right the first time around...
Interrupts are traditionally pretty cheap. According to this, they're only a couple microseconds:
Ja, ja der WoSch, der hat offiziell um 1989 herum meine Diplomarbeit bei der TU-Berlin/GMD-First betreut (ich habe X11R4 auf AX auf einem TMS32016 und einer Motorola 68030 portiert.)

But seriously, interrupts are so expensive that there are all types of offload logic in current network ASICs to avoid using them. For starters, they typically cost a kernel/userspace transition and after SPECTRE the means flushing buffers or using special "stumbles" for a similar effect on CPU internal look-aside structures.

The cost and latencies of interrupts are so severe that Ethernet will go a long way to avoid them and the latency cost of Ethernet are so severe, that HPC centers will use Infiniband or better to get around them and avoid the overhead of a mode switch, even page table flipping.

GPU register files measure in Terabit bandwidth and even kick ancient must-haves like common subexpression elimination, because it's often faster to recompute some piece of data on every GPU core than fetching it from a (RAM) based variable, even when you don't have to bother about coherency.
If you're making enough calls for that overhead to be significant, then you're probably doing it wrong.
With Bulldozer the idea was that AMD could let go of their 8087/MMX type FPUs, because thery had many more GPU cores capable of way bigger floating point compute. But that only works when the applications are converted to do all of their floating point stuff to HSA and there the tight integration really allowed to do that a single loop level: you weren't restricted to an entire ML kernel, which is the level of granularity you have with CUDA these days.

That bit them hard in the arse as nobody paid for Excel to change (they did some basic work for Nico Börries' Staroffice spreadsheet).
I think the reason HSA fizzled is just for lack of interest, within the industry.
Software eco systems, how they come about and why the flounder is incredibly intersting to study, but unfortunately that doesn't mean anyone gets better at creating successful ones 🙂
Probably because of HSA, AMD GPUs have supported cache-coherency with the host CPU, since Vega, if not before.

I think Intel supported it since Broadwell, which was the first to support OpenCL's Shared Virtual Memory.


All Skylakes were 14 nm. It's Haswell (2013) that was 22 nm.
you caught me there, I *was* going to put 14nm but fell for not checking my sources first (stupid me!)
I had a Phenom II x2 that served me well. Never had any issues with it, actually. It was a Rev. B, however. In fact, I recently bought a Zen 3 to replace it.


No, read the ISA for it. It's only 8 instructions. The way tiles are read & written is by loading or saving them from/to a memory address. That's an example of communicating through memory, not memory-mapped hardware.
The Micron AP processor was meant to achieve an order of improvement in how much computing you could do given the memory barrier of general purpose compute. The idea was to insert a small bit of logic into what was essentially RAM chips, and to avoid spending all of the DRAM chip's budget in transferring gigabytes of data, when only relatively light processing was required on it.

Today I'd say that kind of job has really gone into the DPUs, which might use P4 or a similar data-flow language to do logic before it even reaches a general purpose CPU destination (might kill it before, if it finds it holds no value).

But it was generally a rather fascinating architecture and I even downloaded the SDK and an emulator at one point. Some Micron guys were flown into Grenoble for a demonstration with Bull and I happened to be there, too, if I remember correctly.
 
The Micron AP processor was meant to achieve an order of improvement in how much computing you could do given the memory barrier of general purpose compute. The idea was to insert a small bit of logic into what was essentially RAM chips, and to avoid spending all of the DRAM chip's budget in transferring gigabytes of data, when only relatively light processing was required on it.
I was very careful to say "memory address", because AMX obviously goes through the same cache hierarchy as the rest of the CPU. So, you're not necessarily bottlenecked by DRAM, since the data could still be in cache.

The way AMX is probably used, in practice, is to process a chunk of data where the output is still small enough to remain in L2 cache, then they use either AMX or AVX-512 to do the next processing stage.
 
I was very careful to say "memory address", because AMX obviously goes through the same cache hierarchy as the rest of the CPU. So, you're not necessarily bottlenecked by DRAM, since the data could still be in cache.

The way AMX is probably used, in practice, is to process a chunk of data where the output is still small enough to remain in L2 cache, then they use either AMX or AVX-512 to do the next processing stage.
Please zoom out to 10.000 ft above ground ;-)

The main issue behind Micron's attempt with the Automata Processor was to push computing functionality into the memory itself.

Since the 1950's IT architects have been worried, that the speed of RAM was liminiting the ability to perform computing. It was made worse by the von Neumann or Princeton architecture, which had code share the same RAM as data, which was an obvious performance disadvantage compared to the Harvard proposal of separating both. But the pure Harvard architectures were quickly becoming impractical, because having code and data in the same memory space made things so much easier, including replacing operators with software systems and using compilers instead of manual configuration.

But it became famously known as the "von Neumann bottleneck" in the 1960's (which I consider extremely unfair and if he had still been around by then, he would have been the first to ensure this would never happen).

CISC was first used to lessen the pressure on ferrit core RAM, but finding ever denser ways of encoding more powerful operations had its limits and was paid in CPU complexity dividents. Lots of caches have allowed returning to a partial Harvard (RISC) architecture, which separated code from data at least within them. But ultimately, when you have TBytes of data in RAM, what you can do with that is mostly bottlenecked by how much you can shovel between CPUs and RAM: and that doesn't easily expand by orders of magnitude, unless you turn the computing paradigm on its head.

And that is what the Automata Processor was all about: pushing much more of the computing into the DRAM itself (to MANAGERS, doing the rote jobs solvable via NFA), up to the point where the CPU would then as a DIRECTOR take the intermediate knowledge thus gathered to act on it accordingly.

The idea was to vastly improve the amount of logic processing for every Joule of energy expended.

And the major solution to that has become another: process all data in real-time, as it is being generated e.g. via DPUs or other intemediate processing elements, which avoid storing [raw]data, but only pass along perhaps half-baked knowledge, as long as it still had potential value (instead of just storing it first and then process it afterwards).

Of course you still need to store the knowledge that you can't just recompute from the raw data. And you still have to consolidate that knowledge with insights you gain from transient new data. And one might still argue that a large part of that processing could be done more energy efficient with an automata processor than a general purpose CPU.

But it would incur the cost of managing the AP paradigm, which is a significant cost. And currently it seems like that work is more being shifted into the network data plane where DPUs (using NFA like processing to implement a data flow calculus) are doing it.

There have also been similar initiatives that have tried to use the bottom layer in HBM (or HMC) stacks, which was typically only using a much older process technology, but it still hurt companies to put nothing but passageways into the ground floor of a high-rise building and effectively only use the surface area of the elevators (silicon thru vias), while the floor space without vias (where the DRAM cells were located on the higher foors using the denser processes) would be void of any functionality.

They figured that even when you could only put simple logic there, that logic would at least be able to execute without tying down the host-side (building forecourt) memory bus, only the internal via based bus, which typically had a 4:1 or better oversubscription.

IBM had tried something like that ~15 years ago with FPGAs and I remember Hynix was practically begging engineers to come forward and tell them what to implement into the ground floors for the HBM stacks: they'd even sponsor putting any good idea in there for free, because that bottom floor was already fully paid, whether it contained logic or not.

I guess ML simply took up most available mental bandwidth, but most ML specific architecture enhancements also try to reduce the cost of moving the data by reversing the direction and move the compute closer to the data.
 
  • Like
Reactions: bit_user
Please zoom out to 10.000 ft above ground ;-)
Doesn't really have anything to do with AMX, but still an interesting post. Thanks for taking the time to type it out.

Lots of caches have allowed returning to a partial Harvard (RISC) architecture, which separated code from data at least within them.
It took me a while to notice that, but separate L1 caches for code & data really are a partial form of Harvard Architecture.