News DOE Enters Partnership to Use World's Largest Chips With 1.2 Trillion Transistors on 46,225 Square Millimeters of Silicon

bit_user

Polypheme
Ambassador
The DOE's buy-in on the project is incredibly important for Cerebras, as it signifies that the chips are ready for actual use in production systems.
Um, I think the DoE invests in a lot of experimental tech. I wouldn't assume it necessarily means the tech is yet ready for end users.

Also, as we've seen time and again, trends in the supercomputing space often filter down to more mainstream usages, meaning further development could find Cerebras' WSE in more typical server implementations in the future.
Also, we've seen plenty of supercomputing tech that didn't filter down, like clustering, Infiniband, silicon-germanium semiconductors, and other stuff that I honestly don't know much about, because it hasn't filtered down. In fact, the story of the past few decades has been largely about the way that so much tech has filtered up from desktop PCs into HPC.

That's not tot say nothing filtered down - it's gone both ways. But the supercomputing industry used to be exclusively built from exotic, custom tech and has been transformed by the use of PCs, GPUs, and a lot of other commodity technology (SSDs, PCIe, etc.). Interestingly, it seems to be headed back in the direction of specialization, as it reaches scales and levels of workload-customization (such as AI) that make no sense for desktop PCs. I'd say this accelerator is a good example of that trend.

In particular, the problem with wafer-scale is that it will always be extremely expensive, because die space costs a certain amount per area. The better your fault-tolerance is, the less sensitive you are to yield, but it's still the case that die area costs a lot of money, as does their exotic packaging.

Cerebras tells us that it can simply use multiple chips in tandem to tackle larger workloads because, unlike GPUs, which simply mirror the memory across units (data parallel) when used in pairs (think SLI), the WSE runs in model parallel mode, which means it can utilize twice the memory capacity when deployed in pairs, thus scaling linearly.
This is silly. Of course you can scale models on GPUs in exactly the same way they're talking about.

Cool tech - and fun to read about, no doubt - but, this is exactly the sort of exotic tech that will remain the preserve of extreme high-end, high-budget computing installations.
 
Last edited:

bit_user

Polypheme
Ambassador
Now imagine a wafer-scale GPU!
Apart from the cost issues I mentioned above, graphics has different data access patterns than AI. That's a big part of their pitch.

Graphics needs fast random-access and is somewhat difficult to partition (unless you simply replicate the data, which makes the architecture less efficient in terms of power, performance, and cost).
 

InvalidError

Titan
Moderator
Apart from the cost issues I mentioned above, graphics has different data access patterns than AI. That's a big part of their pitch.
It is only "part of the pitch" because Cerebras uses SRAM, works particularly well for AI because each node in the neural network has a finite amount of data to keep track of. For a wafer-scale GPU which requires far more data accessibility and bandwidth multiplication, it would make sense to stack it with wafer-scale HBM.

Graphics needs fast random-access and is somewhat difficult to partition (unless you simply replicate the data, which makes the architecture less efficient in terms of power, performance, and cost).
Having the option of duplicating data between channel is exactly why GPUs have multiple channels, nothing new there, sacrificing spatial efficiency for increased bandwidth and concurrency via duplication. It is also part of the reason why a given game uses an increasingly large amount of VRAM the more VRAM you have., more free space to duplicate stuff in, may as well use it to help balance load across memory channels and reduce average memory controller queue depth.
 

bit_user

Polypheme
Ambassador
For a wafer-scale GPU which requires far more data accessibility and bandwidth multiplication, it would make sense to stack it with wafer-scale HBM.
Whether it's SRAM or HBM, you're still talking about a distributed-memory GPU. If we've seen an example of that, especially with a mesh interconnect, I must've missed it. Feel free to enlighten me.

The reason GPUs look the way they do (i.e. having cache hierarchies and big crossbars or otherwise massive internal buses) is that global memory is accessed pretty randomly.

Having the option of duplicating data between channel is exactly why GPUs have multiple channels, nothing new there, sacrificing spatial efficiency for increased bandwidth and concurrency via duplication. It is also part of the reason why a given game uses an increasingly large amount of VRAM the more VRAM you have., more free space to duplicate stuff in, may as well use it to help balance load across memory channels and reduce average memory controller queue depth.
That's an interesting theory. I've not encountered any support for that, in OpenGL, but I'm not familiar with Direct3D or Vulkan. So, if you have some good evidence of this, I'd be genuinely curious to see it.

Not that I can't believe it, but I've never actually heard of that practice. Furthermore, GPU memory topologies are something I find rather intriguing, because they're critical to efficiency and scalability. So, I've paid some attention to what has been disclosed about different GPUs - and it's not been much. That's not to say big game developers don't get a lot more info under NDA, but it's definitely not a detail that the GPU designers are publicizing in a way that would be required for most software to exploit.

AMD is pretty open about the details of their GPUs, not least because their Linux driver stack is almost entirely open source. Here's the RDNA architecture whitepaper - the most they say about the GDDR6 memory topology is that the memory controllers each have their own L2 slices and are 64-bit. The L2 cache lines are 128 bytes, but it's not clear whether or how the GDDR6 banks are interleaved - knowledge that would be critical for duplicating & load-balancing resources as you suggest.

https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

Moreover, here's a D3D (DX12) memory allocator library they made, and nowhere do I see anything about replication or duplication:

https://github.com/GPUOpen-LibrariesAndSDKs/D3D12MemoryAllocator/blob/master/src/D3D12MemAlloc.h

Here's a whitepaper which discusses RDNA performance optimizations in depth, with quite a bit of time spent on memory. Except, it's all focused on LDS (Local Data Share - on chip memory local to each Workgroup Processor) and caches.

https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf

Also, note that they do make mention of a "Scalable Data Fabric", in both GCN (Vega) and RDNA (RX 5700). That contradicts the idea of distributed memory, as then you'd more likely be talking about a scalable compute fabric, with somewhat localized memory.
 
Last edited:

InvalidError

Titan
Moderator
That's an interesting theory. I've not encountered any support for that, in OpenGL, but I'm not familiar with Direct3D or Vulkan.
What is an interesting theory? Memory (read) bandwidth amplification? I hope not since there are countless examples of that starting with the venerable book, dead tree edition - the more copies you have, the more people can concurrently read it regardless of distance from the original source.

You won't find support for this in any API, it is a driver optimization thing: drivers know how much free VRAM there is and its location since they are in charge of managing VRAM and scheduling shaders, drivers detect that a memory channel consistently has a longer queue than others, duplicates some high-read-traffic assets (only reads can easily be amplified, which is fine since assets are generally static and drivers can manage memory coherency whenever an API call wants to modify them) in spare VRAM on whatever channels have the shortest queues and point some proportion of shaders to those duplicates to alleviate the bottleneck. If software requests more VRAM, drivers can pick which duplicates to scrap to most effectively service the new allocation request and existing workloads. There is no reason for the API to be aware of how drivers manage "unused VRAM", especially when you consider that drivers have to manage VRAM on a system-wide basis. I'd expect applications being able to force copies would be between sub-optimal and highly detrimental in most cases since forced copies would leave the GPU with less malleable VRAM to work with and increase its reliance on system RAM (much worse than sub-optimal VRAM memory queue depth) for overflows.

In an age where AMD, Nvidia and Intel are leapfrogging each other just about every meaningful generation with fancier texture compression techniques to get more mileage out of available memory bandwidth, using spare VRAM for bandwidth amplification is a wild strawberry - an ankle-high-hanging fruit.

This is similar to Windows using spare RAM for file system caching: software is generally unaware that the file system cache exists and the file system cache still counts as "free" as far as usual free memory reporting is concerned. The memory is being used but there is no reason for you to know about it since it can be freed whenever needed.
 

bit_user

Polypheme
Ambassador
What is an interesting theory?
I don't like to repeat myself, but I'll do so with emphasis: "if you have some good evidence of this, I'd be genuinely curious to see it. "

There are GPU performance analyzer tools that I'd imagine should be able to reveal this, if it's actually happening. So no "driver conspiracy theories" or flights of fancy, please.

I sometimes run into a D3D driver engineer for one of the Big Three GPU makers. I'll ask him about it and see if he'll tell me anything.
 

bit_user

Polypheme
Ambassador
He should be able to at least tell you that user-land texture pixels are often represented in VRAM more than once for performance enhancement purposes.
Okay, so I take it you've got nothing more than pure speculation. Thanks for being so forthright; such an exemplar.

The main reason I participate in these forums is to exchange knowledge. I learn things and try to help educate others by sharing what I learn. Sometimes, things get petty, but that's my aspiration. Perhaps this sheds some insight into why I'm so bothered when people dissemble, spread misinformation, or misrepresent speculation as established fact - because it directly undermines both of my goals. I hope you can appreciate that perspective.

Anyway, I don't see him regularly, don't know him terribly well, and I don't know how much he'll tell me... but I'll ask.
 
Last edited:

InvalidError

Titan
Moderator
Okay, so I take it you've got nothing more than pure speculation.
It is the only way that a wafer-scale GPU could work.

It is also fundamentally what SLI/CF do: copy assets to all GPUs then split the workload between them.

A wafer-scale GPU could do the same thing, just far more efficiently thanks to having much faster links and tighter integration between GPUs.
 

bit_user

Polypheme
Ambassador
It is the only way that a wafer-scale GPU could work.
Sure, I'll give you that. I would question how efficiently conventional GPU workloads would scale to so many chips, but it would probably be a great architecture for something like global illumination ray tracing.

It will be very interesting to see if the next generation of GPUs finally embrace multi-chip, and what approaches they use.

Anyway, speculating isn't bad and just because something is speculation doesn't necessarily make it wrong. It should just be characterized as such. You're obviously very knowledgeable and intelligent, and I understand it can be tempting to try to win an argument because you think you can. All I'm asking is just to be clear about the known facts vs. educated guesses.

For my part, I'll update this thread if Mr. D3D Driver Guy has anything useful to say. I'll probably see him in 3 weeks, but no guarantees.
 

bit_user

Polypheme
Ambassador
Getting things back on track, I found some more notable details (including power estimates!), here:


This truly is an amazing development. Every superlative I can muster barely does it justice. For AI, it could be every bit as revolutionary as GPUs were.

The biggest problem I see is they can't scale it down, to make a lower-cost, lower-power version. Because memory is distributed along with the compute, a smaller version (e.g. 1/4th the size) would only be able to handle models a quarter as big. Maybe a future version can address this by stacking some DRAM and doing some paging in/out of SRAM, but that does start looking more like an array of GPUs, bringing along with it the drawbacks they leveled at them.
 

InvalidError

Titan
Moderator
Sure, I'll give you that. I would question how efficiently conventional GPU workloads would scale to so many chips, but it would probably be a great architecture for something like global illumination ray tracing.
The biggest problem with multi-GPU scaling is bandwidth to stitch the cores as one, 32GB/s high latency over PCIe 4.0 is nowhere near good enough. A wafer-scale GPU on the other hand could easily have 100+GB/s low-latency interconnect to each of its eight immediate neighbors (1.6TB/s aggregate), which should work quite well with tiled rendering.
 

bit_user

Polypheme
Ambassador
The biggest problem with multi-GPU scaling is bandwidth to stitch the cores as one, 32GB/s high latency over PCIe 4.0 is nowhere near good enough. A wafer-scale GPU on the other hand could easily have 100+GB/s low-latency interconnect to each of its eight immediate neighbors (1.6TB/s aggregate), which should work quite well with tiled rendering.
A lot hinges on that word - efficiency. It burns potentially a lot of power to replicate all data, everywhere. Tiled rendering also has certain overheads that translate into algorithmic inefficiencies, depending on how you distribute that work - it's ideally implemented using a muti-pass approach, but that creates additional communication and synchronization - not great, for scalability (or, at least, scaling efficiently). From a data coherency perspective, it's definitely a win, but Vega only saw something like a performance 10% jump from enabling it. And, how much do you really care about data coherency, if you're copying all data to be local, everywhere? If you're trying to solve both load-balancing and minimize synchronization, there's no magic bullet.

In a modern renderer, there's actually a lot more synchronization than you'd expect. Over the years, increasing numbers and varieties of synchronization mechanisms have crept into GPUs. And, speaking of multi-pass, I don't honestly know how many passes are in modern renderers, but you can bet it's more than 1.
 

InvalidError

Titan
Moderator
It burns potentially a lot of power to replicate all data, everywhere.
How often do textures change in a typical game or 3D application? At level/zone loading time and that's generally about it, so mass data propagation is a relatively rare event and there is no memory coherency needed on fundamentally read-only stuff, that's something you only need to worry about on writes and only when it cannot be software-managed or made intrinsic in some other way. That's why the GART layer allows drivers to skip hardware-level cache coherency overhead in favor of managing it themselves. Software-managed CC is the preferred method on larger systems where snooping bandwidth would otherwise overwhelm interconnects.
 

bit_user

Polypheme
Ambassador
How often do textures change in a typical game or 3D application?
I think you are a couple decades out of date, on 3D renderer design and construction.

Clearly, you're thinking of games like these: http://www.vintage3d.org/games.php#sthash.h9x8cOMb.dpbs

not games like these: https://www.tomshardware.com/reviews/nvidia-geforce-rtx-2080-super-turing-ray-tracing,6243-2.html

If you're genuinely interested in learning more about modern renderers, here's a book I'd recommend:



I sort of know one of the authors and the low-quality printing issue (the reason for most of the poor reviews) has been resolved.
 
Last edited:

bit_user

Polypheme
Ambassador
He should be able to at least tell you that user-land texture pixels are often represented in VRAM more than once for performance enhancement purposes.
I asked, and you're not going to like his answer.

He said that, as far as he knows, GPUs always interleave their memory channels, at a rather fine granularity. He said the way to get the fastest memory I/O is just to do a linear walk through the address space. He said you can verify this, yourself, with a compute shader program (as I thought, but I didn't have a chance to try).

He wasn't specific about the granularity, except I think he said something about a 256-byte read being enough to touch all memory controllers. That would suggest maybe interleaving at L2 cacheline granularity.

I then specifically asked about assets being duplicated in GPU memory - either by games or at the driver level - and he said the only thing he's ever heard of like that is where some console games will have multiple copies of certain assets on disk, to mitigate HDD seek/access time.

My take on it is that inteleaving is a good solution, from a software perspective. It gives both optimal performance for a single shader doing linear accesses, as well as effectively load-balancing I/O across many shaders. It's only from a hardware perspective that it's bad, since it requires all shaders to have reasonably fast access to all memory controllers.
 
Last edited:

bit_user

Polypheme
Ambassador
Granularity likely goes along whatever tile size textures get broken down to so one piece of texture can be pulled in one read burst.
That's a function of the texture compression format, of which there are many. But, it definitely makes sense to try to align these.

I don't know if it's a coincidence, but the Polaris whitepaper says:
The block size is dynamically chosen based on access patterns and the data patterns to maximize the benefits. The peak compression ratio is 8:1 for a 256-byte block.
(emphasis added)​
 
Last edited:

waynes

Honorable
Mar 4, 2013
29
0
10,530
This site is pretty <Mod Edit> now. My phone is running super hot since getting here and I got people I don't know contacting me telling me they have missed calls from me while in the site, which I never made. The ads suck up the side space and it jumps as they load in scrolling down, and the options and news here is limited. I miss the days I could get email from the original site owner.
 
Last edited by a moderator:

waynes

Honorable
Mar 4, 2013
29
0
10,530
This site is pretty <Mod Edit> now. My phone is running super hot since getting here and I got people I don't know contacting me telling me they have missed calls from me while in the site, which I never made. The ads suck up the side space and it jumps as they load in scrolling down, and the options and news here is limited. I miss the days I could get email from the original site owner.

Now, these sea of processor designs. You could do an article. There were intellisys seaforth misc processors and those people moved to greenarray chips. They were looking at doing something in the sever feild early last decade.
 
Last edited by a moderator: