News Newly Revealed RISC-V Vector Unit Could Be Used for AI, HPC, GPU Applications

The VU is made up of several 'vector cores', which are comparable to GPU cores from AMD, Intel, and Nvidia
Using a general-purpose ISA in a GPU only makes sense for specialized applications. If your goal is to build a highly-scalable accelerator which can effectively compete with purpose-built GPUs, then you will befall the same fate as Xeon Phi. General-purpose ISA drags in too much overhead that purebred GPUs don't have to deal with.

Moreover, if they're trying to efficiently tackle AI workloads, then they'll need matrix-multiply hardware. Vector-level acceleration is no longer enough.

only Tenstorrent is developing high-performance RISC-V IP that can be used to build processors and AI accelerators.
Ah, but they didn't get rid of their TenSix cores. Those are the main workhorse of Tenstorrent's accelerators. From the linked article:

"In addition to a variety of RISC-V general-purpose cores, Tenstorrent has its proprietary Tensix cores tailored for neural network inference and training. Each Tensix core comprises of five RISC cores, an array math unit for tensor operations, a SIMD unit for vector operations, 1MB or 2MB of SRAM, and fixed function hardware for accelerating network packet operations and compression/decompression."

In addition to the matrix/tensor unit, the local SRAM is also key. That's something which doesn't fit in well with general-purpose CPUs. They can have cache, but cache has additional latencies and worse energy efficiency.
 
So I assume this Gazzillion technology should remove the latency issues that can occur when using CXL technology to enable far away memory to be accessed at the supercharged rates that it was designed to deliver.

According to the company, the Gazzillion technology was specifically designed for Recommendation Systems that are a key part of Data Centre Machine Learning.

So by supporting over a hundred misses per core, a SoC can be designed which can deliver highly sparse data to the compute engines without a large silicon investment. The core can also be configured from 2-way up to 4-way to help accelerate the not-so-parallel portions of Recommendation Systems.

Also, with its complete MMU support, Atrevido should also be Linux-ready, while supporting cache-coherent, and multi-processing environments from two and up to hundreds of cores.
 
So I assume this Gazzillion technology should remove the latency issues that can occur when using CXL technology to enable far away memory to be accessed at the supercharged rates that it was designed to deliver.
They said:
"to fetch all this data from memory, we have our Gazzillion technology that can handle up to 128 simultaneous requests for data and track them back to the correct place in whatever order they are returned."​

So, two questions I have:
  1. Is this specific to the core or the memory controller?
  2. How does it compare with the OoO load queues in the latest x86 and ARM CPUs?

I'm guessing they mean the core. I'll just pick one point of comparison, but data on Zen 4 and recent ARM cores probably shouldn't be too hard to find:

"Intel claims the load queue has 240 entries in SPR. I assume the same applies to Golden Cove. We measured 192 entries."

Source: https://chipsandcheese.com/2023/01/...egister-file-checking-with-official-spr-data/

They have more to say about it, if you're interested. However, it tells us that if there's anything special about "Gazzillion technology" it might be that it's large relative to the size & complexity of their core - not in the absolute sense.

And no, I don't think it's nearly enough to hide CXL latency. GPUs use SMT for latency-hiding, which I think is the only approach with a chance of hiding that much latency. For good throughput, most of your data shouldn't be on the remote side of a CXL link. A quick glance at GPU memory bandwidth tells you it doesn't have nearly enough to act as a substitute. Gaming GPUs have up to 1 TB/s; datacenter GPUs have more like 3.2 TB/s. A 16-lane CXL link only gives you 64 GB/s - off by more than an order of magnitude.

According to the company, the Gazzillion technology was specifically designed for Recommendation Systems that are a key part of Data Centre Machine Learning.
That's called marketing.
; )
 
Canonical was smart developing Ubuntu for Risc.
Canonical is a parasite. They mostly just ride the coat tails of Debian and don't contribute nearly as many upstream patches or projects as Redhat or Suse.

Besides kernel contributions, most of the userspace Linux ecosystem (Gnome, systemd, pipewire, NetworkManager, etc.) is being developed by Redhat. I don't love everything they're doing, but there are too few others.

IIRC, the only thing Canonical recently did was their own container packaging (Snap) that's incompatible with other efforts (Flatpak, AppImage).
 
Last edited:
So, two questions I have:
  1. Is this specific to the core or the memory controller?

I guess core, or maybe even both ! Ok, let me go through that other link which you gave as well.

Assuming if Gazzillion technology can handle highly sparse data with long latencies and with high bandwidth CXL memory systems that are typical of current machine learning applications, then this could add tiny buffers in key places to ensure that the cache lines keep the core running at full speed rather than waiting for data.

Btw, Atrevido supports 64-bit native data path and 48-bit physical address paths.
That's called marketing.
; )

Of course ! Or more like a promotion. 😃