[SOLVED] Are there any dual socket AMD EPYC Rome motherboards planned with 12 16x PCIe slots?

visionscaper

Reputable
Apr 16, 2017
6
0
4,510
I'm considering to build a new multi-GPU system to train deep learning models. I'm hoping to be able to use the rumoured Nvidia RTX 3090 GPUs, hence want to chose a motherboard and CPU that also supports PCIe 4.0.

In this context, for a system with 6 GPUs, I could use an AMD EPYC Rome CPU and ASRock ROMED8-2T motherboard. The EPYC Rome has support for 128 PCIe lanes and ROMED8-2T has 7 16x PCIe slots, 6 if you use one slot for NVMe drives. To be able to deal with the rumoured 3 slot width of the 3090 GPUs, I was thinking to use this Hydra VIII Modular 6.5U Case for up to 10 GPUs .

If I would want to build a system with more than 6 GPUs, but retain 16x lanes for each GPU, I was hoping to find a dual socket motherboard for AMD EPYC ROME, effectively disclosing 2x128 = 256 PCIe lanes, such that I might be able to scale up to 12 GPUs (or more). Unfortunately I was not able to find such a motherboard.

So, my question is, does such a dual socket motherboard for AMD EPYC Rome, with 12 16x PCIe slots exist? Or is planned by some manufacturer?
What are the other options, if there are any?

Interested to hear your thoughts!
 
Solution
The problem with the number of PCIe slots you want is the size of the board would be impractical. I think you would be better served to look at distributed rendering and create a cluster of simpler nodes.
Hi @kanewolf,

The problem with the number of PCIe slots you want is the size of the board would be impractical. I think you would be better served to look at distributed rendering and create a cluster of simpler nodes.

I think this is indeed the issue. The problem becomes finding an interconnect solution that is fast enough (for a reasonable price). E.g. some sort of infiniband connection.

Do you have any experience with fast interconnects?
 
Hi @kanewolf,



I think this is indeed the issue. The problem becomes finding an interconnect solution that is fast enough (for a reasonable price). E.g. some sort of infiniband connection.

Do you have any experience with fast interconnects?
25GE has become a widespread technology. I would look there. Infiniband (IB) is much more specialized and more difficult to get help from others.
You could even start with a NIC that supports 10/25 GE -- https://www.intel.com/content/www/u...-25-40-gigabit-adapters/xxv710-da2-25gbe.html Start with 10GE and move to 25GE if needed.
 
25GE has become a widespread technology. I would look there. Infiniband (IB) is much more specialized and more difficult to get help from others.
You could even start with a NIC that supports 10/25 GE -- https://www.intel.com/content/www/u...-25-40-gigabit-adapters/xxv710-da2-25gbe.html Start with 10GE and move to 25GE if needed.

Hmm, interesting, would the bandwidth not be too low between the nodes? I don’t have experience with multi node training. Thinking about it, likely, only the final weight updates need to be transferred back to a primary node, after that the updated weights need to be synced with the secondary nodes. So maybe a bandwidth on par with PCIe is not required between the nodes... this needs more thought/research.

Thanks for the answer!
 
Hmm, interesting, would the bandwidth not be too low between the nodes? I don’t have experience with multi node training. Thinking about it, likely, only the final weight updates need to be transferred back to a primary node, after that the updated weights need to be synced with the secondary nodes. So maybe a bandwidth on par with PCIe is not required between the nodes... this needs more thought/research.

Thanks for the answer!
There are articles about setting up deep learning on clusters -- https://determined.ai/blog/gpu-cluster-p1/
 
There are articles about setting up deep learning on clusters -- https://determined.ai/blog/gpu-cluster-p1/

Ah, yes, there are a lot of posts about that, but very little about inter-node bandwidth requirements for deep learning model training in clusters. I just found this post by Lambda Labs, "A Gentle Introduction to Multi GPU and Multi Node Distributed Training". In this post they also describe how they have designed their Lambda Hyperplane V100 server to be high bandwidth nodes in a cluster setup: with 4 Infiniband adapters to create a total inter-node bandwidth of ~42GB/s (yes, that is with a big B!)!

Apparently one of the advantages of using Infiniband is that one can do memory transfers without involving the CPU (GPU Direct RDMA), which lowers the latency. Unfortunately you are right that this stuff is more obscure and harder to get help/instructions for.
 
Ah, yes, there are a lot of posts about that, but very little about inter-node bandwidth requirements for deep learning model training in clusters. I just found this post by Lambda Labs, "A Gentle Introduction to Multi GPU and Multi Node Distributed Training". In this post they also describe how they have designed their Lambda Hyperplane V100 server to be high bandwidth nodes in a cluster setup: with 4 Infiniband adapters to create a total inter-node bandwidth of ~42GB/s (yes, that is with a big B!)!

Apparently one of the advantages of using Infiniband is that one can do memory transfers without involving the CPU (GPU Direct RDMA), which lowers the latency. Unfortunately you are right that this stuff is more obscure and harder to get help/instructions for.
Ethernet cards can do RDMA also. It is driver dependent. And V100 nodes are different since they aren't PCIe based.
I don't know how much cluster experience you have, in general. You could spend a little bit and use AWS nodes as a prototype.
You can also do things like having NVMe storage as a buffer on each node. Your cluster manager then loads the "next" data set while the node is working on one. You do some ping-pong buffering. You are then running from local storage and not limited by network bandwidth.
 
  • Like
Reactions: visionscaper
Ethernet cards can do RDMA also. It is driver dependent. And V100 nodes are different since they aren't PCIe based.
I don't know how much cluster experience you have, in general. You could spend a little bit and use AWS nodes as a prototype.
You can also do things like having NVMe storage as a buffer on each node. Your cluster manager then loads the "next" data set while the node is working on one. You do some ping-pong buffering. You are then running from local storage and not limited by network bandwidth.

I have experience using single-node multi-GPU setups for training with PyTorch; both on self-build DL rigs as well as in the cloud. There are AWS multi-GPU instances with different network bandwidth capacities, so I can do some experimentation there.

In my case replicas of the training data would exist on each node, so I would not need to distribute that during training. I think that different communication regimes for training in a cluster (e.g. parameter server vs all-reduce) result in different bandwidth requirements. It might very well be that in some regimes the inter-node bandwidth requirements are less stringent, hopefully allowing me to use more established networking tech, as you mentioned earlier.

Thanks for all your input @kanewolf, it was really useful!