News Dual NVMe Card With Eight E1.S SSDs Achieves 55 GB/s Performance

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
Very true. But most companies are not going to want to support these situations anyway. One client is developing not just a new PCIe card, but ASICs to go on it. If you know a server maker where that does not void the warranty, I would love to know. Eventually, of course, it's going off to UL before it is marked. But design (and possibly magic smoke escaping, much as no one wants it) comes first.
Wait, your retired Air Force, right? So you probably already know that you can not have a manufacturer come in, and certainly can't send whole systems out, when classified data is involved. I know that you all handle repairs in house for classified systems, because of the fact that unlike everyone except the US government, you didn't even send bad hard drives back when I worked for one of the major US computer manufacturers.
 
Very true. But most companies are not going to want to support these situations anyway. One client is developing not just a new PCIe card, but ASICs to go on it. If you know a server maker where that does not void the warranty, I would love to know. Eventually, of course, it's going off to UL before it is marked. But design (and possibly magic smoke escaping, much as no one wants it) comes first.
MS/Amazon/Google/HP/Dell would be more than willing to sit with you and consult.
 
Wait, your retired Air Force, right? So you probably already know that you can not have a manufacturer come in, and certainly can't send whole systems out, when classified data is involved. I know that you all handle repairs in house for classified systems, because of the fact that unlike everyone except the US government, you didn't even send bad hard drives back when I worked for one of the major US computer manufacturers.
Just last week, we had a company rep come in and replace a bunch of problematic CoolerMaster power supplies.
In a secure area, escorted by me personally.

8 systems, took about 3 hrs start to finish.
 
If you're looking for specific recommendations, then I'd suggest opening a new thread in the appropriate forum, which I think would be one of:

When you post, do specify a budget and any other constraints you're working under. I think it's okay of you drop a link here, so we can follow you to the new thread.
Thank you very much for the link suggestions. I am going to see what I can find.
 
One of the reasons that a number of the AI models out there are so resource intensive is that for some of them they are (more or less completely for some) written in Python. Since it's a scripted language it's easier to make changes without waiting for huge files to recompile. But it is also slow to run.
No, I don't think so. If you write a custom layer in Python, then you could have CPU bottlenecks. However, I think that's not very common and it's not as if you have to re-code the thing every time you train your network. I don't even know how common it is for people to make their own custom layers, as you'd think most of the useful layer types would already have optimized implementations in Pytorch and other popular frameworks.

To do just this project effectively, you need to be able to quickly load different models and data sets and work on them. We are talking about a lot of data. The better implementations for llama, the permissive use (not exactly open source) large language model from Meta and others based on it use around 65 - 70 billion tokens, which tends to work out to somewhere around 70 to 80 Gigabytes. For each different training data set.
But that's for training with thousands of GPUs costing $30k each. Once you consider the cost of the GPUs involved, and the communication infrastructure needed by them, the storage infrastructure is pretty much a minor detail. That's why it's Nvidia making bank off the AI boom, not so much Dell/EMC and the other storage experts.

Trying to redesign so that training can be performed on an office network with centrally stored data, speed is important. Especially since if I want to finish in a few years, I need to be able to test multiple designs, each on multiple data sets, at the same time.
You're trying to optimize the wrong problem.

A client is trying to use iterative design to produce a new aircraft design for DARPA. Obviously I am not going into detail. But it is a vast amount of modeling. And they need to run it in house for security reasons. Small company, it will be a breakthrough for them if they can make something compelling enough that they get the contract.

Another wants to simulate chemical plant designs for some complicated multi step process and would prefer to do this in house if they can.
It's nice if you can do it on the cheap, but whatever you do should have server-grade reliability. So, I wouldn't even contemplate any hardware that didn't have ECC memory, or using storage solutions with a potential "write hole". I guess that's the nice thing about buying a solution from a 3rd party vendor who specializes in scalable storage solutions with mission-critical reliability. You pay more, but at least you know you're deploying a solution that's been properly engineered from end-to-end.
 
MS/Amazon/Google/HP/Dell would be more than willing to sit with you and consult.
Perhaps. I don't know if Amazon or Google are the best options when I don't want to use their hosting, but maybe I am not aware of the possibilities, and HP also has some possibilities.. Dell certainly has some very nice server stuff. I did speak with an IBM rep, but at the time they were aggressively trying to grow their cloud business, and most of the legacy side didn't seem like a good fit, but I should probably ask more. I do like some of the server and workstation options from Asus. They offer a PCIe version of the Coral tensor processor as well. So far they seem like the best place I have found for prebuilt systems. And for things that don't require security or warranty invalidating hardware additions, there's a pretty good array both prebuilt and also prevalidated build it yourself options. I like the way that ASRock has easy to access lists of CPU, RAM and storage options tested with their motherboards.
I don't want to be different just to be different, or build from scratch where it doesn't make sense, so I appreciate the feedback.
 
A NAS with 400 Gbps Ethernet? For what? Serving all the video ads to the entire US West Coast?
: D

The bandwidth numbers are just impossibly high, for it to make sense as networked storage. Whatever people do with these, it's not that.


I know that's the standard answer for what people do with workstations, but I'm still having trouble wrapping my mind around what exactly any of those involves that requires quite so much raw bandwidth.

The only thing I can come up with is that you're using this for analyzing some scientific or GIS-type data set that won't fit in RAM.
Actually, there is a llama implementation that can use SSDs for a swap file. I think that it's either Kobold or Oogabooga. I suppose that this could be a less bad option, but really at this price for that purpose, even for a cheapskate like me it seems to make more sense to get 128+ GB of DDR 4 or preferably 5 system RAM if you can't afford the GPU you need.
 
Wait, your retired Air Force, right? So you probably already know that you can not have a manufacturer come in, and certainly can't send whole systems out, when classified data is involved. I know that you all handle repairs in house for classified systems, because of the fact that unlike everyone except the US government, you didn't even send bad hard drives back when I worked for one of the major US computer manufacturers.
Very different from my experience, but I agree that it makes more sense to do it this way for a situation like you described. I probably need to get up to date on current rules and best practices. Thank you.
 
No, I don't think so. If you write a custom layer in Python, then you could have CPU bottlenecks. However, I think that's not very common and it's not as if you have to re-code the thing every time you train your network. I don't even know how common it is for people to make their own custom layers, as you'd think most of the useful layer types would already have optimized implementations in Pytorch and other popular frameworks.


But that's for training with thousands of GPUs costing $30k each. Once you consider the cost of the GPUs involved, and the communication infrastructure needed by them, the storage infrastructure is pretty much a minor detail. That's why it's Nvidia making bank off the AI boom, not so much Dell/EMC and the other storage experts.


You're trying to optimize the wrong problem.


It's nice if you can do it on the cheap, but whatever you do should have server-grade reliability. So, I wouldn't even contemplate any hardware that didn't have ECC memory, or using storage solutions with a potential "write hole". I guess that's the nice thing about buying a solution from a 3rd party vendor who specializes in scalable storage solutions with mission-critical reliability. You pay more, but at least you know you're deploying a solution that's been properly engineered from end-to-end.
So I could easily be wrong. But have you looked at this analysis:

 
So I could easily be wrong.
Exactly how do you think Nvidia is raking in $Billions on H100 sales, if AI is so heavily CPU bottle-necked? Wouldn't it be Intel who would be swamped with demand and a 1.5 year manufacturing backlog?

You know that deep learning has quite a lot of the smartest wet brains working on it. It's been a buzzword for probably 6 or 7 years, now. Do you think it didn't occur to anyone that using Python might be slow?

Finally, consider that PyTorch absorbed Caffe2, which was the successor to the Caffe framework - both of which are C++ -based. Do you think PyTorch would've achieved such dominance and succeeded in eclipsing Caffe and subsuming the C++ interface of Caffe2, if it were as slow as you seem to think? Not only that, but there are frameworks not based on Python - how did they not dominate the Python-based ones?

These are just some sanity-checks that you should be doing, before you jump to such sweeping conclusions. You don't even need to delve down to the level of graph compilation, LLVM, or runtime code generation to suspect that all may not be as it seems.

But have you looked at this analysis:

When you search for something that seems to confirm your existing conclusion, it's called confirmation bias, which can lead to self-deception.

Instead, what you should do is search for answers to the original question. In fact, by searching for confirmation that Python is slow, you've missed two fundamental points, as well as some tertiary details:
  1. No, deep learning doesn't predominantly rely on Python performance.
  2. Even code which does rely on Python performance isn't necessarily interpreted. There are multiple ways you can compile Python to run as native code (i.e. not just bundling an interpreter, but actually compiling it like C++).
  3. Python performance is a moving target. By looking at data from 2017, you're missing the continual optimizations that have been made to the official Python interpreter. Also, performance aggregates are just that. If you have a more specific workload in mind, then you should look at benchmarks for that workload.

What I find funny is that your source isn't even fundamentally addressing the issue of Python performance or from a reputable source with expertise on such matters. What's worse is that it cites a paper published 6 years ago. You don't seem to weight the quality of your source nearly as much as the fact that it agrees with your preconceptions. Did you even check the cited paper to look at its methodology or whether the article might be taking its findings out of context?
 
Last edited:
Exactly how do you think Nvidia is raking in $Billions on H100 sales, if AI is so heavily CPU bottle-necked? Wouldn't it be Intel who would be swamped with demand and a 1.5 year manufacturing backlog?

You know that deep learning has quite a lot of the smartest wet brains working on it. It's been a buzzword for probably 6 or 7 years, now. Do you think it didn't occur to anyone that using Python might be slow?

Finally, consider that PyTorch absorbed Caffe2, which was the successor to the Caffe framework - both of which are C++ -based. Do you think PyTorch would've achieved such dominance and succeeded in eclipsing Caffe and subsuming the C++ interface of Caffe2, if it were as slow as you seem to think? Not only that, but there are frameworks not based on Python - how did they not dominate the Python-based ones?

These are just some sanity-checks that you should be doing, before you jump to such sweeping conclusions. You don't even need to delve down to the level of graph compilation, LLVM, or runtime code generation to suspect that all may not be as it seems.


When you search for something that seems to confirm your existing conclusion, it's called confirmation bias, which can lead to self-deception.

Instead, what you should do is search for answers to the original question. In fact, by searching for confirmation that Python is slow, you've missed two fundamental points, as well as some tertiary details:
  1. No, deep learning doesn't predominantly rely on Python performance.
  2. Even code which does rely on Python performance isn't necessarily interpreted. There are multiple ways you can compile Python to run as native code (i.e. not just bundling an interpreter, but actually compiling it like C++).
  3. Python performance is a moving target. By looking at data from 2017, you're missing the continual optimizations that have been made to the official Python interpreter. Also, performance aggregates are just that. If you have a more specific workload in mind, then you should look at benchmarks for that workload.

What I find funny is that your source isn't even fundamentally addressing the issue of Python performance or from a reputable source with expertise on such matters. What's worse is that it cites a paper published 6 years ago. You don't seem to weight the quality of your source nearly as much as the fact that it agrees with your preconceptions. Did you even check the cited paper to look at its methodology or whether the article might be taking its findings out of context?
I am aware of pretty much everything here, and unless you are a statistition I probably have more experience in the field. I did mention being a biologist. So yes, I am aware of confirmation bias. But at this point I am fairly sure that we are talking past each other. Still, I will re-examine all of the points that you made.
 
Exactly how do you think Nvidia is raking in $Billions on H100 sales, if AI is so heavily CPU bottle-necked? Wouldn't it be Intel who would be swamped with demand and a 1.5 year manufacturing backlog?

You know that deep learning has quite a lot of the smartest wet brains working on it. It's been a buzzword for probably 6 or 7 years, now. Do you think it didn't occur to anyone that using Python might be slow?

Finally, consider that PyTorch absorbed Caffe2, which was the successor to the Caffe framework - both of which are C++ -based. Do you think PyTorch would've achieved such dominance and succeeded in eclipsing Caffe and subsuming the C++ interface of Caffe2, if it were as slow as you seem to think? Not only that, but there are frameworks not based on Python - how did they not dominate the Python-based ones?

These are just some sanity-checks that you should be doing, before you jump to such sweeping conclusions. You don't even need to delve down to the level of graph compilation, LLVM, or runtime code generation to suspect that all may not be as it seems.


When you search for something that seems to confirm your existing conclusion, it's called confirmation bias, which can lead to self-deception.

Instead, what you should do is search for answers to the original question. In fact, by searching for confirmation that Python is slow, you've missed two fundamental points, as well as some tertiary details:
  1. No, deep learning doesn't predominantly rely on Python performance.
  2. Even code which does rely on Python performance isn't necessarily interpreted. There are multiple ways you can compile Python to run as native code (i.e. not just bundling an interpreter, but actually compiling it like C++).
  3. Python performance is a moving target. By looking at data from 2017, you're missing the continual optimizations that have been made to the official Python interpreter. Also, performance aggregates are just that. If you have a more specific workload in mind, then you should look at benchmarks for that workload.

What I find funny is that your source isn't even fundamentally addressing the issue of Python performance or from a reputable source with expertise on such matters. What's worse is that it cites a paper published 6 years ago. You don't seem to weight the quality of your source nearly as much as the fact that it agrees with your preconceptions. Did you even check the cited paper to look at its methodology or whether the article might be taking its findings out of context?
My point was more that these models are run on GPUs in the first place because the way that they operate makes them constrained primarily by the speed of memory access. Video memory is both fundamentally faster in a serial bit transfer sense, and built with more lanes of access.
I commented that the existing approach is slow. It involves using what are still fairly broadly designed tools like pytorch, numpy, etc. It's a great place to start, and popular at least in part because it's easier. But I am not sure why you think that I believe that the constraint is computational power, since it's really not - whether we are talking CPU or GPU. It's faster to run a model on a GPU that has built in tensor processors, of course. But even that is largely because they are designed to need fewer trips in and out of memory for the data. The issue is that I don't know of a way to minimize this problem using Python, but it should be possible in C. Because unless I am missing something, as annoying as straight C is to program in, it is better for tricks to optimize rather low level structure. Of course, it would be even better if I knew someone who actually wanted to create code in Assembly for every possible combination of CPU and GPU. But I don't:). Frankly, C is enough of a time sink.
But if you know a way to do this sort of thing in Python, and I am just showing my ignorance with this whole conversation, I sincerely would like to know.
 
It's faster to run a model on a GPU that has built in tensor processors, of course. But even that is largely because they are designed to need fewer trips in and out of memory for the data. The issue is that I don't know of a way to minimize this problem using Python, but it should be possible in C. Because unless I am missing something, as annoying as straight C is to program in, it is better for tricks to optimize rather low level structure.
You're still making bad assumptions. I guess there's no way to tell you what you need to know, and you're just going to have to find it for yourself.

it would be even better if I knew someone who actually wanted to create code in Assembly for every possible combination of CPU and GPU.
That you seem to believe there aren't already libraries optimized to the hilt, for each CPU and GPU, is what I just can't get over. Like, no one ever thought of that!

BTW, you don't need to optimize for every pairwise combination - another assumption you're making without actually knowing how they interact. The CPU and GPU are sufficiently decoupled that it wouldn't buy you anything.

But if you know a way to do this sort of thing in Python, and I am just showing my ignorance with this whole conversation, I sincerely would like to know.
I'm done trying to feed you answers. If you want to know how PyTorch works, you should look for a good explanation and not make simplistic assumptions.
 
Last edited:
Status
Not open for further replies.