News AMD unwraps Ryzen AI 300 series ‘Strix Point’ processors — 50 TOPS of AI performance, Zen 5c density cores come to Ryzen 9 for the first time

Admin

Administrator
Staff member
Oh look, it turns out all the rumors were true
300, so it's 100 more than Intel's 200
New naming scheme, but mixed into the older naming scheme
yay🙄
 
We dont have AI, its the biggest scam there is.
We have algorithms that learn what we teach them, then can either repeat or formulate responses based on what we teach it it.

It mimics patterns.

True AI would be able to make decisions on what its begin taught. AI would be able to decide for itself if what it is being taught is right or wrong, correct or incorrect.

Right now you can "teach" AI whatever you want because its not intelligent, its just a pattern recognizer and repeater.
 
True AI would be able to make decisions on what its begin taught. AI would be able to decide for itself if what it is being taught is right or wrong, correct or incorrect.
Don't be such a party pooper... googles ai is telling you to put glue on your pizza so you better put it on your pizza!
 
We dont have AI, its the biggest scam there is.
We have algorithms that learn what we teach them, then can either repeat or formulate responses based on what we teach it it.

It mimics patterns.

True AI would be able to make decisions on what its begin taught. AI would be able to decide for itself if what it is being taught is right or wrong, correct or incorrect.

Right now you can "teach" AI whatever you want because its not intelligent, its just a pattern recognizer and repeater.
Alright.

So how are we able to make decisions on what we were taught without past experience, self-taught or transferred?

People forget how much of human intelligence isn't actually there when we are born.
 
  • Like
Reactions: bit_user
We dont have AI, its the biggest scam there is.

Next thing you're gonna tell us is that AR & VR Aren't REAL !! 😱 🤯

Would calling it something like an "Algorithmic Inferencing' machine help you with the PR Marketing bits they use for selling it to the plebes ? 🤨 😜

PS, wait 'til ya' find out there's no actual Thunder in Thunderbolt hardware. 🤪
 
  • Like
Reactions: bit_user
AMD unwrapped its new Ryzen AI 300 series, codenamed Strix Point, today at Computex... ..AMD’s new XDNA 2 engine that enables running AI workloads locally.

While it's a nice bump and promising (especially the option for 3 discrete workloads across CPU/NPU + iGPU + dGPU), it would've been far more impressive if the true 9 series HX platforms had been showcased with proper memory sized and better GPU option (with more VRAM). Especially since 12/24 still likely loses in most work/workstation scenarios to the 7945HX's 16/32 , which also didn't get a refresh when the 8040 series came out.

However, it should give us an idea of how Strix Halo could do with twice the memory bandwidth and 2.5x the CUs plus add 32MB local MALL cache (might help make up for PCie 4 vs 5) as a starting point... all before putting a capable GPU in there.

321 total TOPs for an R9 AI 370 HX + nV 4070 is OK , but as they say... twice (or thrice) as much for twice the price is Real Nice. 😎

Entertaining that ASUS early launch material highlights "45 TOPS NPU" for it's TUF A16 platform instead of 50....
7NmV53V

ibb (dot) co/7NmV53V

Did that evolve since they put that slide together or just a Qualcomm type Typo? 🧐
 
  • Like
Reactions: bit_user
The article said:
Block BF16, a new data format that provides the full accuracy of FP16 with many of the same compute and memory characteristics of INT8. AMD says Block FP16 is plug-and-play; it doesn’t require quantizing, tuning, or retraining the existing models.
My guess is that it's basically like texture compression. I've heard of mobile NPUs using compression on the weights, but not really seen AMD or Nvidia talk about it (other than handling a limited degree of sparseness). It always seemed to me like a natural thing for GPUs to do, given their texture units are already in the datapath and already have hardware support for texture compression. ...except this isn't a GPU!

The article said:
It isn’t yet clear if Block FP16 is certified as an IEEE standard.
No. ...I'm 99.9% sure it's not.

But, that's irrelevant if it's basically invisible to software, as they claim. The main reason for IEEE standards is to have some consistency between hardware implementations, so that software doesn't have to introduce a ton of special cases, one for each hardware implementation.
 
Last edited:
  • Like
Reactions: KnightShadey
My guess is that it's basically like texture compression. I've heard of mobile NPUs using compression on the weights, but not really seen AMD or Nvidia talk about it (other than handling a limited degree of sparseness). It always seemed to me like a natural thing for GPUs to do, given their texture units are already in the datapath and already have hardware support for texture compression. ...except this isn't a GPU!


No. ...I'm 99.9% sure it's not.

But, that's irrelevant if it's basically invisible to software, as they claim. The main reason for IEEE standards is to have some consistency between hardware implementations, so that software doesn't have to introduce a ton of special cases, for each hardware implementation.

Yeah, could be, but I'm doubtful because it seems like a stumbling block scaling size & speed long term.

I got the feeling it was more like how BFloat16 (the other B[F]16 🤓 ) deals with FP32 for rounding up/down, so it just becomes an 'efficient'/elegant math work around, that can be used in any scenario, rather than adding silicon for de/compression, but that's just my take/guess on it.

It's something I wish they spent more time on, but I suspect we'll get that in the deep dives in the coming day/weeks before launch.

My initial read on it was to react "OH, this is how AMD is going to react to not having intel's AVX FP16 support" which of course is helpful for Ai workloads.

Who knows it may be the best of both worlds giving them speed down low for consumer Ai applications/platforms, which are definitely more about doing 95% of the job in 1/3 the time rather than speed up the full-fat FP16/32 by 20-50% (usually closer to 10% because at that level you are still heavily memory bound/restricted [more so if you're compression dependent IMO ]). 🤔🤷🏻‍♂️

Too me BFP16 like adding a small turbo to a small engine in a light sports car (Lotus 7) .... boom huge impact , while FP16 is like adding another turbo to a Chiron... OK, improvement, but not dramatic, and not where the majority of the market will be for Copilot+ level Ai PCs for the next coupla years.

The more precise faster seems like still the domain of commercial applications.... (where dedicated racks of precision are still telling people to eat rocks 🥸 🤣 ).

Now if only Block FP16 were as exciting as the French movie District BF13. 🤡

(* after reading your other post in the other thread, I think I added too many unnecessary analogies and aphorisms, when you likely would've been fine with just half the words in the second paragraph. THG has definitely added more depth in the forums since I left over a decade ago. Wish there were more deep-dive folks then. Now... to try and curb the dad humour.... must resist. 🥸 😉 )
 
Last edited:
"AMD hasn’t yet shared the full TOPS rating for its chips with the CPU and GPU added in."

why not?
It has already been posted in a few places based on typical configuration. But one stumbling block for AMD is likely that a large part of that estimate relies on the performance of the 4070 (or whatever GPU, but for now RTX 40series), and it also represents the largest number in the PR TOPs discussion, so not something you would want or need to promote in an event/launch about your hardware (not green's), especially since these mid-level laptops get destroyed by a single higher end RTX or RX card from long ago in a graphics card far far away back.

This is about efficient, not max numbers, and saying the R9 AI 370 HX platform + 4070 gets about 300-333 total Top TOPS (number floating around being 321) , would just invite the "Well that's not even equally too..." comparison, kinda missing the point. Similar to people moaning about laptop Ryzen9 / GF 4090 vs Desktop R9 /RTX 4090 etc.
 
Last edited:
Yeah, could be, but I'm doubtful because it seems like a stumbling block scaling size & speed long term.

I got the feeling it was more like how BFloat16 (the other B[F]16 🤓 ) deals with FP32 for rounding up/down, so it just becomes an 'efficient'/elegant math work around, that can be used in any scenario, rather than adding silicon for de/compression, but that's just my take/guess on it.
According to this, it's basically what I thought:

That wikipedia entry currently isn't terribly clear, but I think it a lot like the texture compression formats I've read about. The idea is to determine the scaling factor based on the block, and then have each element of the block encode its relative magnitude.

OH, this is how AMD is going to react to not having intel's AVX FP16 support" which of course is helpful for Ai workloads.
No, you've got it backwards. BF16 is more favorable for AI. Intel went back and added support for the IEEE 754.2008 16-bit format because it's more useful for non-AI signal processing. They claim it's useful in 5G basestations.
 
According to this, it's basically what I thought:

That wikipedia entry currently isn't terribly clear, but I think it a lot like the texture compression formats I've read about. The idea is to determine the scaling factor based on the block, and then have each element of the block encode its relative magnitude.

Maybe... if yesterday's WiKi update is correct, but it doesn't look right as it references MX micro-scaling which is more involved with FP8/Int8 to Sub-8 , although even FP32 to sub-8 is possible with optimal conditions and added complexity, but with increased loss potential. It just doesn't sound like FP16 v Int8 balancing implied by AMD. 🤔
If so yes it would be similar to texture compression, but IMO similarly a short term solution until you could move beyond those baseline resource limits.

While the goal is similar to BFloat 16 in what it's trying to accomplish (smaller/quicker) the cost in transistors and precision loss/errors (or serious specific optimization) seems to be a bad trade-off, unless you so value RAM/VRAM over precision. Which might very well be the case on shared memory and especially large res outputs that can cripple a 16-24GB dGPU... but again that didn't seem like the example shown. Dunno. 🤷🏻‍♂️



No, you've got it backwards. BF16 is more favorable for AI. Intel went back and added support for the IEEE 754.2008 16-bit format because it's more useful for non-AI signal processing. They claim it's useful in 5G basestations.
We shall see, it was more about streamlining the architecture and using the transistor budget vs bloat , it's the other side of the coin of the discussion above generalized support vs focused workloads, tradeoffs for both approaches.

Again, deeper insight into what AMD's Block FP16 as an example will provide insight into their thinking of leveraging existing vs using die space (especially if a bit of free headroom from process shift) to add specialization.

While other are annoyed at a lack of pricing, I'm annoyed by the lack of depth. Oh well will have to wait either way... 😔
 
We shall see, it was more about streamlining the architecture and using the transistor budget vs bloat
Nope. BF16 is cheaper to implement, since floating point multipliers are dominated by the mantissa product, which scales as a square of the number of bits in the mantissa. This was cast as one of the selling points for BF16 over conventional FP16.

it's the other side of the coin of the discussion above generalized support vs focused workloads, tradeoffs for both approaches.
FP16 was designed to better address a broad range of applications. BF16 isn't useful for a whole lot beyond AI.

GPUs long had some nominal support for FP16, and Intel even had support for fp32 <-> fp16 conversion as far back as Ivy Bridge. I think only mobile GPUs actually devoted much silicon to optimizing fp16, however. It's been a formal datatype in OpenGL Shader Language since like 4.0, or so. I think interest in using it for graphics was the specific motivation for IEEE 754 to incorporate it in the 2008 revision.

While other are annoyed at a lack of pricing, I'm annoyed by the lack of depth. Oh well will have to wait either way... 😔
You could try searching the patent database or following more breadcrumbs from the wikipedia page pointing to research people have done on this subject. I'd guess there are probably some recent academic papers on applying block floating point to AI workloads. I just don't have enough of a stake in the matter to bother chasing it down.
 
Last edited:
Nope. BF16 is cheaper to implement, since floating point multipliers are dominated by the mantissa product, which scales as a square of the number of bits in the mantissa. This was cast as one of the selling points for BF16 over conventional FP16.

Again to be clear that would be BFloat 16, not the linked BFP16 above.

But I think you're also confusing the argument for the CPU for the more focused NPU, That logic is fine for the single focus NPU cluster but not the CPU where these myriad of extensions become an issue. And yes CVT16 / F16C is the old school AMD extension rolled into AVX.

I wasn't using it as a either/or trade off of Bfloat16 vs FP16 + complex , I was talking about removing the extraneous, while leaving what is more in line with the current & future role of the CPU, especially if it is the one acting as the gatekeeper of the other potentially shared resources/workloads across CPU/NPU/iGPU to be managed in differently optimized software environments.

It can even get additionally greasy if it were even just disparate NPU to NPU since there isn't just one standard for BFfloat16 rounding (nV vs ARM vs Google vs IEE, etc). Sure it's an unlikely scenario , but we will have an nV GPU sharing space with an AMD CPU+NPU and possibly another BF16 format, shouldn't cause any problems, but if nV is promising CopilotRT* on their GPUs, does that simplify thing or make it worse for developers? 🤔

That extraneous bloat grows with the core count which makes sense as to why intel made an attempted decouple them, but obviously have been caught wrong footed in the transition. However, long term it's still the right strategy IMO.

Realistically I don't ever expect the CPU to take over for NPUs or GPUs so the transistor budget priorities aren't the same even if they share the same die (less so as discrete chiplets)., Guaranteed that the architecture now is the least efficient/effective this will be, and experience will lead to architecture optimizations that are further targeted, not broadened.

You could try searching the patent database or following more breadcrumbs from the wikipedia page pointing to research people have done on this subject. I'd guess there are probably some recent academic papers on applying block floating point to AI workloads. I just don't have enough of a stake in the matter to bother chasing it down.

Nah, I'm fine, I've got more than enough, and would just ask colleagues who deal with this daily far more than I. So again we shall see once
 
We dont have AI, its the biggest scam there is.
I tend to agree.
We have algorithms that learn what we teach them, then can either repeat or formulate responses based on what we teach it it.
That's how we work, too, most of the time. On the other end we tend to believe we're gods, which isn't always healthy.
It mimics patterns.

True AI would be able to make decisions on what its begin taught. AI would be able to decide for itself if what it is being taught is right or wrong, correct or incorrect.
Evidently you define "true AI" in a somewhat personal manner.

True and false, right or wrong are no absolutes, but mostly the result of social code evolution or darwinism.
With biological intelligence one would expect that the classification of true, false, right and wrong correlate at least with some humans, typically those around us.

With "true AI", that human perspective could very easily be lost and I'd rather take the [greed inspired] scam, that what you seem to aim for.
Right now you can "teach" AI whatever you want because its not intelligent, its just a pattern recognizer and repeater.
And perhaps that's how far we should really take this.
 
My guess is that it's basically like texture compression. I've heard of mobile NPUs using compression on the weights, but not really seen AMD or Nvidia talk about it (other than handling a limited degree of sparseness). It always seemed to me like a natural thing for GPUs to do, given their texture units are already in the datapath and already have hardware support for texture compression. ...except this isn't a GPU!
In terms of space savings, yes. In terms of operation, obviously not. Not sure if texture compression could just operate on stores or if they would be read-only, too.

And that's perhaps the one thing to keep in mind: model weights are read-only during inference and it's the job of the compiler/model transpiler to find, translate and bundle the weights which can be put into blocks.
No. ...I'm 99.9% sure it's not.
It should be transparent, because by the time the data winds up in the registers, it will have a normal BF16/FP16 or whatever representation. So the in-memory representation my never be standardized, but for all the operational parts it should be undistinguishable.
But, that's irrelevant if it's basically invisible to software, as they claim. The main reason for IEEE standards is to have some consistency between hardware implementations, so that software doesn't have to introduce a ton of special cases, one for each hardware implementation.
Here I'd say "invisible" may be only half true in the sense that a model loader/transpiler might transform BF16/FP16 weights into the block format without the upper layers having to care about that. And indeed the MAC code would just load and compute on data that is transformed to full 16 bit data on the fly by loads.

IEEE doesn't (yet?) support each and every split between mantissa and exponent bits and their interpretations, but here that's hidden by load logic that simply copies the block common bits and merges them with the individual weight bits as data gets loaded.

Disclaimer: all of the above is my own deductions and could be complete rubbish.
 
  • Like
Reactions: bit_user
Here's a DIE annotation.

Four "Zen 5" cores, each with a 1 MB dedicated L2 cache, share a 16 MB L3 cache. The eight "Zen 5c" cores share a smaller 8 MB L3 cache, in what could be a separate CCX (they also have 1MB L2 cache).

GPL20sMWcAAN_Dv
 
  • Like
Reactions: bit_user
Strix Point will have 2 CCXs, with the majority of the L3 cache shared by the Zen 5 big cores.
 
Last edited by a moderator: