News New RISC-V microprocessor can run CPU, GPU, and NPU workloads simultaneously

ezst036

Honorable
Oct 5, 2018
576
494
11,920
Hopefully the designs come together with the ultimate socket and the ultimate motherboard otherwise it is the ultimate thing we cannot use.

Or the ultimate UEFI/BIOS update, chip uses potential socket AM5/LGA1700?
 
  • Like
Reactions: artk2219

Findecanor

Distinguished
Apr 7, 2015
259
180
18,860
Whatever it is, at least it is both buzzword and superlative-compliant ...

The description reminded me of Tachyum's promises of a "universal processor", which we also know very little about. Although people have compared RISC-V's vector unit with GPU's compute units for a while now.

I could find only one of the "14" patents: an application for a Dynamic processing memory core on a single memory chip.

In descriptions of the company, it supposedly in the "embedded", "low power" sector.
Is their target to be GPUs in smartphones?
 
Last edited:

ekio

Reputable
Mar 24, 2021
88
113
4,710
Wow, a core that can handle cpu/gpu/npu tasks...
If this is efficient, that means that most of the transistors can be used instead of a small part of the die.
But I bet this kind of tech will be killed by a bigger company that doesn't want any threat to their old business model.
 

Conor Stewart

Prominent
Oct 31, 2022
24
12
515
Whilst it claims to be a RISC-V processor and have a "RISC-V ISA compliant CPU", will the chip as a whole actually fully comply with the RISC-V ISA and it's extensions? I would be very surprised if they are able to create this and have the whole thing fully compliant. They do only say it is "RISC-V based".

as its design combines the capabilities of a CPU and GPU into a single-core architecture. This isn’t like the typical designs from Intel and AMD where there are separate CPU cores and GPU cores.
The chip uses the open-source RISC-V ISA for CPU and GPU
There seems to be a lot of contradiction or confusion. Why do they need to specify they use RISC-V for the CPU and GPU if the CPU and GPU are actually combined into a single core and uses a "single-core architecture".

The CPU/GPU cores can be meshed together into a muti-core design
Note the plural here in "cores" and that they say "CPU/GPU cores", if it was one single core then wouldn't they just say their hybrid core can be meshed into multi core designs?

RISC-V microprocessing chip architecture that combines a RISC-V CPU core with vector capabilities and GPU acceleration into a single chip.
Again this seems to imply that the CPU cores and the GPU acceleration are separate. Interestingly they call them CPU and GPU cores but when referring to what they have made they only refer to it as a "hybrid chip" why not call them hybrid cores? From this quote they also say it combines a CPU core with a GPU acceleration into a single chip, this seems to me to also be implying that the CPU and GPU are actually separate.

Also very important is what it says in the image:
Proprietary HAL for low level access
If it is fully RISC-V based and compliant then why do they need a proprietary HAL?

Also in the image:
RISC-V ISA compliant CPU
They also say elsewhere that the CPU and GPU are RISC-V ISA so is the GPU not compliant?

Too much of what they say could be seen as vague or slightly contradicting itself so something doesn't seem quite right. They really should clarify exactly what they mean.

It seems to me they may be using separate CPU and GPU cores but claiming it is a single core architecture because they are both RISC-V based.
 
Last edited:

bit_user

Polypheme
Ambassador
A completely open-source 'ultimate' chip?
@JarredWaltonGPU can these guys ever learn? "Open source" would be if they put the hardware design up on a public github repo for anyone to download, layout, and send to a fab to have it manufactured. That's not at all what they did, any more than it's what Intel or AMD does.

As the new design is based on RISC-V, anyone can utilize the architecture without having to pay instruction-set royalty fees — unlike x86 and ARM.
This part is better, but "architecture" is an overloaded term. It could refer to:
  1. System architecture - the part of the machine specification that tells the operating system how to boot the CPU, configure the page translation tables, how interrupts are handled, etc.
  2. SoC architecture - the high-level design of a SoC, including its interconnect network, I/O, cache hierarchy, and core configuration.
  3. Instruction Set Architecture (ISA) - specifies the instruction op codes, their semantics & behavior, the register files, instruction scheduling constraints, etc. Basically, the details central to writing programs for the cores that implement it.
  4. Micro-architecture - the detailed, internal design of the cores, including things like the way they're pipelined, their branch-predictor, the micro-ops the front-end decodes to, details of the micro-op cache, if they have one, their physical register file, if it's distinct from the logical register file specified in the ISA, etc. Basically, all the internal details that software doesn't usually need to worry about.

So, what's open about RISC-V is the ISA (#3) and I believe at least some core parts of the system architecture (#1).
 

bit_user

Polypheme
Ambassador
As for the claim of this being a ground-breaking, first-of-its-kind approach ...eh, sort of.

Think Silicon has a line of NEOX iGPUs, targeted at embedded markets, which is RISC-V based:

I think their claims are slightly more modest, in that they don't claim to be the only kind of core you need - just that you could run generic RISC-V code on the iGPU, if you wanted to.

More recently, Imagination has announced something similar:

I think the Jack-of-all-trades makes the most sense for small, low-cost embedded cores, where you really don't want to waste die space on a separate set of CPU and GPU (and AI) cores. That's where I think it makes the most sense to prioritize flexibility and area efficiency over performance or energy-efficiency.
 
@JarredWaltonGPU can these guys ever learn? "Open source" would be if they put the hardware design up on a public github repo for anyone to download, layout, and send to a fab to have it manufactured. That's not at all what they did, any more than it's what Intel or AMD does.


This part is better, but "architecture" is an overloaded term. It could refer to:
  1. System architecture - the part of the machine specification that tells the operating system how to boot the CPU, configure the page translation tables, how interrupts are handled, etc.
  2. SoC architecture - the high-level design of a SoC, including its interconnect network, I/O, cache hierarchy, and core configuration.
  3. Instruction Set Architecture (ISA) - specifies the instruction op codes, their semantics & behavior, the register files, instruction scheduling constraints, etc. Basically, the details central to writing programs for the cores that implement it.
  4. Micro-architecture - the detailed, internal design of the cores, including things like the way they're pipelined, their branch-predictor, the micro-ops the front-end decodes to, details of the micro-op cache, if they have one, their physical register file, if it's distinct from the logical register file specified in the ISA, etc. Basically, all the internal details that software doesn't usually need to worry about.

So, what's open about RISC-V is the ISA (#3) and I believe at least some core parts of the system architecture (#1).
Tweaked the tagline, though it's worth noting that JPR says the company will open-source the material. It hasn't happened yet, AFAIK, but it could potentially happen. Maybe. Unless the company pulls an OpenAI and shuts down all the "open" aspects. LOL
 

Pierce2623

Upstanding
Dec 3, 2023
152
132
260
Tweaked the tagline, though it's worth noting that JPR says the company will open-source the material. It hasn't happened yet, AFAIK, but it could potentially happen. Maybe. Unless the company pulls an OpenAI and shuts down all the "open" aspects. LOL
It was kinda mind-blowing how quickly OpenAI went from a non-profit with open source models and a heavy emphasis on “safe” AI to a for-prophet behemoth.
 

bit_user

Polypheme
Ambassador
TL;DR: maybe they're contributing some new RISC-V instruction opcodes to the standard, but it would be awesome & very journalistic of you guys to shoot X-Silicon a quick email and ask exactly what's meant by this "open source" talk.

Tweaked the tagline, though it's worth noting that JPR says the company will open-source the material. It hasn't happened yet, AFAIK, but it could potentially happen.
Thanks for pointing that out. I've now read the JPR report and that's definitely a muddled mess.

Unfortunately, I can't find any press release on the newswire that JPR is reporting. Furthermore, a visit to the News tab of x-silicon's site just pulls up an even older article by JPR.

So, given that all we have to go on is the JPR article, let's spread out the entrails and try to see if they're trying to convey any coherent message, or if we can at least find where JPR got confused.

"X-Silicon Inc. (XSi) revealed its open-standard, low-power C-GPU architecture, combining GPU acceleration with a RISC-V vector CPU core and tightly coupled memory for a low-power, single-processor solution. It is an open-sourcing of its unified RISC-V vector CPU-with-GPU ISA and offers register-level hardware access via a hardware abstraction layer (HAL)."
First, nothing I can find about X-Silicon says they're trying to establish their architecture as an open-standard. I guess the first sentence could allude to the set of standard RISC-V instructions they implemented, but it's a confusing use of "architecture".

Then, it goes on to talk about open-sourcing their ISA, which is basically nonsense. I guess they could mean they implemented new RISC-V instructions that they're trying to get included in the ISA.


"For over 20 years, the industry has been seeking an open-standard GPU ..."
Here, I suppose he's talking about each GPU being a proprietary implementation, at the hardware level. Of course, there are open-standard APIs for programming them, like OpenGL, OpenCL, and Vulkan. However, with no standard for the actual hardware, you're reliant on the manufacturer to provide support for those or other APIs.


"X-Silicon Inc (XSi), a San Diego based start-up founded in March 2022, unveiled its latest innovation: the open-standard, low-power C-GPU architecture, merging GPU acceleration into a RISC-V vector CPU core with tightly coupled memory, offering a low-power, single-processor solution. XSi’s approach introduces open-sourcing of its unified RISC-V vector CPU-with-GPU ISA and provides register-level hardware access through a hardware abstraction layer (HAL). That, says the company, empowers OEMs and content providers to tailor drivers and applications with unusual customization, diverging from the closed solutions of competitors."
Again, talking about open-sourcing the ISA. The following sentence talks about the implications for software that's running on the chip. If they were really open-sourcing the hardware he would be spinning a different value proposition.


"XSi’s open-standard, low-power C-GPU architecture and NanoTile platform suggest a paradigm shift in GPU technology. With its support of open standards, customizable hardware access, and approach to dynamic content rendering, XSi thinks it will set a new standard for GPU architecture, empowering developers and OEMs to unlock unprecedented levels of performance and efficiency in graphics rendering and AI/ML-compute applications."
Again, it's not clear if the "open-standard" architecture merely refers to the use of RISC-V, or if there are new instructions they've implemented that they're trying to contribute back to RISC-V, where others can also implement them.


"The company reports the RISC-V ecosystem is reacting positively to the launch of a new compute-graphics company that is fully committed to furthering the open-standard ecosystem."
More general excitement about RISC-V.


The company plans to make its software development kits available to a select set of early development partners later this year.
Ooooh... yeesh. That's not very open, eh? If the programming model of the hardware were an open standard, they'd be talking about when they're going to publish it. If they were opening even more, they'd be talking about like a github repo they're going to publish or open up.

Maybe. Unless the company pulls an OpenAI and shuts down all the "open" aspects. LOL
Best-case scenario, they've created some new RISC-V instructions they're now trying to standardize. I think that's about the extent of their openness.

If you search the entire article for the word "source" or "sourcing", the only hits are basically 2 repetitions of this construct:

"open-sourcing of its unified RISC-V vector CPU-with-GPU ISA"
In both cases, talking about the ISA. That could mean what I said above, but it's quite a stretch to read anything more dramatic into it.

I also did a little bit of web-searching, to see if there was anything special about this company, or if they had publicly said anything else about open source. I found neither.
 
Last edited:

bit_user

Polypheme
Ambassador
There is also another one by Think Silicon targeted without Vector support for low end devices.
Uh, according to this, it sure does have vector support!

architecture.jpg


It doesn't even make sense to have a GPU architecture that lacks any kind of vector support. Perhaps wherever you read that about its vector support was complaining that they didn't implement the standard RISC-V vector instructions? Those were only standardized somewhat recently - maybe not in time for the first NEOX?

P.S. I just noticed your profile pic is someone standing in front of a Think Silicon logo. If that's really you, would you care to tell us more?
 
Last edited:

bit_user

Polypheme
Ambassador
Wow, a core that can handle cpu/gpu/npu tasks...
If this is efficient,
No. GPUs tend to be simple, in-order cores. If you build a GPU with complex, out-of-order cores, in order to make it better at general-purpose computation, then it will be less compute-dense at GPU (and AI) tasks. There's a fundamental tension between the two kinds of microarchitectures, which is why this path isn't well-trodden.
 

bit_user

Polypheme
Ambassador
If it is fully RISC-V based and compliant then why do they need a proprietary HAL?
Any SoC has a lot of configuration and control of its various fixed-function aspects. So, when they say:

"XSi’s approach introduces open-sourcing of its unified RISC-V vector CPU-with-GPU ISA and provides register-level hardware access through a hardware abstraction layer (HAL)."

...probably what they're talking about is access to those specialized hardware registers that fall outside the realm of what the RISC-V standard covers.

That does not mean it's necessarily not fully RISC-V compliant. For instance, both AMD and Intel's CPUs are fully x86-64 compliant, but they each have specialized registers that govern various aspects of the system.

According to this post (and I've seen similar claims, elsewhere):

"much of the AMDGPU driver code base is so large because of auto-generated header files for GPU registers, etc. In fact, 1.79 million lines as of Linux 5.9 for AMDGPU is simply header files that are predominantly auto-generated. It's 366k lines of the 2.71 million lines of code that is actual C code."

Source: https://news.ycombinator.com/item?id=24748719
 
Last edited:

dimar

Distinguished
Mar 30, 2009
1,044
64
19,360
I wonder if it would be feasible or even possible for motherboard companies to come up with universal CPU brand independent and highly upgradable motherboard that would support any type of known single or multi CPU including the old ones for experimentation and learning purposes. This way you could pop Intel AMD ARM and whatever universal CPU. I guess new standards would have to be invented. How cool would that be?
 
I wonder if it would be feasible or even possible for motherboard companies to come up with universal CPU brand independent and highly upgradable motherboard that would support any type of known single or multi CPU including the old ones for experimentation and learning purposes. This way you could pop Intel AMD ARM and whatever universal CPU. I guess new standards would have to be invented. How cool would that be?
The last name for that was Socket 7. Then Intel copyrighted their CPU's pinout.
Before that, you actually had systems where the mainboard was a "simple" series of connectors, and you would have CPU on a daughterboard, storage on another etc.
And you could have a Motorola 68k and a PowerPC in the same system, or a PowerPC and a x86 - then it depended on how you could access the "child" CPU from the "main".
 
No. GPUs tend to be simple, in-order cores. If you build a GPU with complex, out-of-order cores, in order to make it better at general-purpose computation, then it will be less compute-dense at GPU (and AI) tasks. There's a fundamental tension between the two kinds of microarchitectures, which is why this path isn't well-trodden.
So this is what I've wondered about. Like, at a low level, there are execution resources for INT, FP, vector workloads, plus branching, etc. If XSi (or anyone else) could build separate front-ends for GPU/CPU/NPU, and separate back-ends for the same, but then share all of the execution resources in the middle... maybe it works well? I mean, there are loads of questions with this sort of stuff, but certainly there's nothing that proves the way we've been doing things for decades is actually the best approach for this new paradigm.

As for the JPR stuff, I did remove "open-source" from the tagline and reference JPR as using that term. It's probably the same misunderstanding with their author as with ours, and there's no open-sourcing actually planned. Because I'm sure the end goal for XSi is to sell its custom silicon designs. Companies wouldn't have to license the instruction set, but they're definitely going to pay to use XSi designs is my take.

The fact that JPR is one of the main sources of information here isn't super helpful either. JPR is an analyst firm, and at times analysts are paid to write white papers and "financial prospectus" papers for a company. This feels like it might be in that vein. I think I mentioned to someone when looking at this yesterday that it felt more like XSi casting out a line to see if it could hook any angel investors.

So, color me very skeptical, because we hear lots of companies making grandiose claims about building a better processor. Rarely do those claims pan out into physical hardware that actually achieves those claims.
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
So this is what I've wondered about. Like, at a low level, there are execution resources for INT, FP, vector workloads, plus branching, etc.
Modern, general-purpose CPU cores have poor compute-density, because they devote disproportionate resources for superscalar and out-of-order execution. All of that specialized hardware for detecting & tracking dependencies between instructions, then dispatching and executing non-dependent instructions concurrently, not to mention branch prediction... takes a lot of die space and a lot of power, without doing any of the real computation that the program requested.

To put it another way, I'm sure if we plot CPU IPC vs. transistors, you get a logarithmic plot. Each order of magnitude more transistors unlocks an incremental amount of IPC. GPUs, on the other hand, scale performance far more linearly per transistor, because they can improve performance pretty much by just adding more simple cores, rather than having to be super-creative about how to extract more performance from a small number of cores the way CPUs do. For GPUs, this isn't only about the amount of parallelism in the workloads, but it's also software-visible differences like having a weakly-ordered memory model and direct-mapped local SRAM, which has significant impact on how they're programmed.

maybe it works well? I mean, there are loads of questions with this sort of stuff, but certainly there's nothing that proves the way we've been doing things for decades is actually the best approach for this new paradigm.
Trust that modern CPU designers have done a very good job of building general-purpose cores and that GPU architects have done a very good job of building fast GPUs. To try and fuse the two approaches is certainly going to make each less optimal for its intended use. Also, eliminating the programmability differences I mentioned would have costs.

As for the JPR stuff, I did remove "open-source" from the tagline and reference JPR as using that term. It's probably the same misunderstanding with their author as with ours, and there's no open-sourcing actually planned.
Thank you. I would be very interested in knowing if they did add any new instructions (e.g. for texture lookups, etc.) they're contributing to the RISC-V standard. I'm not sure where to find a list of proposed ISA extensions, but I'll do some poking around. In the mean time, maybe you could shoot them an email, if you get bored.
 
Last edited:
Modern, general-purpose CPU cores have poor compute-density, because they devote disproportionate resources for superscalar and out-of-order execution. All of that specialized hardware for detecting & tracking dependencies between instructions, then dispatching and executing non-dependent instructions concurrently, not to mention branch prediction... takes a lot of die space and a lot of power, without doing any of the real computation that the program requested.
That's my point, though: You could build a bunch of execution pipelines that can do FP8, FP16, FP32, FP64, and all the INT variants as well. These would be separate from all the fetch, decode, scheduling, etc. Obviously that stuff still needs to happen, but perhaps it could be decoupled from the execution resources somewhat, and in an ideal case maybe most of the execution gets focused on GPU/AI workloads while the CPU doesn't need as much — but what it does need gets a higher priority.

Anyway, I know Nvidia and AMD overlap the use of their execution resources with various workloads. It's not like the AI Accelerator that AMD added in RDNA 3 is a whole bunch of extra stuff on top of what was already there — it's just a different way of accessing the same resources for large matrix or vector operations. Nvidia also shared some resources between Tensor cores and other units in the past (registers mostly, IIRC?), but I don't know what they're currently doing.

Making all of this work, and work well, is obviously not going to be simple. Maybe it's not even possible. The "one core to rule them all" sounds like a pipedream honestly. But I'm not 100% convinced something couldn't be done that would prove interesting and potentially useful. We shall see.

Trust that modern CPU designers have done a very good job of building general-purpose cores and that GPU architects have done a very good job of building fast GPUs. To try and fuse the two approaches is certainly going to make each less optimal for its intended use. Also, eliminating the programmability differences I mentioned would have costs.
Oh, I definitely trust them more than XSi, but I also know that the whole space of chip design is, like mathematics, infinite. No one can ever know and understand it all, and we keep discovering new stuff. So, I'm skeptical at best, but I'm saying there's a (very minute) chance XSi can do something different that works well.

I think quantum computing revolutionizing things in my lifetime is more likely than the existing way of doing CPUs, GPUs, etc. being supplanted. But either or both of those could happen. 🙃
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
That's my point, though: You could build a bunch of execution pipelines that can do FP8, FP16, FP32, FP64, and all the INT variants as well. These would be separate from all the fetch, decode, scheduling, etc. Obviously that stuff still needs to happen, but perhaps it could be decoupled from the execution resources somewhat, and in an ideal case maybe most of the execution gets focused on GPU/AI workloads while the CPU doesn't need as much — but what it does need gets a higher priority.
I think I follow what you're saying, if you mean the chip just has a big pool of compute pipelines that could receive work from either a small number of complex front-ends or a larger number of simple ones. I like to think about that sort of thing, but I guess the fatal flaw in it is that it takes time & energy the more you ship data around the chip. Since latency and power are limiting factors in the performance of modern chips, designers have an incentive to try and keep cores as compact as possible.

A long time ago, when I first started programming a VLIW machine, I did some reading on the subject and came across an idea called a transport-triggered architecture (TTA). Basically, the author cited data movement as one of the bottlenecks in scaling up VLIW even further (I think VLIW is more susceptible to this than normal OoO, because VLIW needs precise & fairly deterministic timing guarantees). So, the idea was to model the CPU core as a network, where the register file and computation units were asynchronous nodes. At least, that's what I think I took away from it.

I wonder if there's an OoO core that ever picked up that idea and went with it. I expect not, because this was right around the time SSE came along, and I think that sort of decoupling becomes a lot more expensive with wide vectors. Having a smaller number of wide vector engines that are closely-coupled is probably a more area-efficient way of maximizing performance, anyhow.

I know Nvidia and AMD overlap the use of their execution resources with various workloads. It's not like the AI Accelerator that AMD added in RDNA 3 is a whole bunch of extra stuff on top of what was already there — it's just a different way of accessing the same resources for large matrix or vector operations. Nvidia also shared some resources between Tensor cores and other units in the past (registers mostly, IIRC?), but I don't know what they're currently doing.
Yes, RDNA3 uses WMMA (Wave Matrix Multiply+Accumulate) instructions from the same wavefronts that run their shader pipelines and using the same registers. This approach seems to work well enough that I'm a little puzzled how/why XDNA can supposedly achieve such an efficiency advantage over RDNA3. For a small block on a SoC, I think it makes a lot of sense to follow the approach of using the same resources for GPU and AI, but I must be missing something, since all of the phone SoCs and now both Intel and AMD have chosen the path of building dedicated NPUs.

BTW, I find it interesting that both Intel and AMD are using IP from their acquisitions in their NPUs. In Intel's case, it was Movidius; in AMD's Xilinx. Both NPUs are basically arrays of little vector DSP cores and local SRAMs, which really don't strike me as being that different than GPUs.

Oh, I definitely trust them more than XSi,
Above, I mentioned there's a narrow market niche I see where the fused CPU/GPU/NPU approach makes sense. The following conditions must be true:
  • Your need for single-threaded general-purpose compute is relatively modest.
  • You definitely need an iGPU.
  • You care about AI also, but maybe as a future capability and therefore aren't sure it's worth a dedicated accelerator.
  • Your application is more cost-sensitive than power-limited.
  • Optionally: you want to customize the graphics or AI pipeline using a (mostly) standard development & debugging toolchain.

This limits it to things like kiosks and devices or appliances with a built-in display. Possibly set-top boxes, but that might be a stretch. But, with so many appliance-type devices just relying on a phone app for their GUI, I don't really know how big that market is.