Question Nvidia Tesla T4 compatibility

May 18, 2019
1
0
10
0
I need to build several deep learning servers for a combined CPU + cuDNN workload. Each server needs to have as much CPU power as possible and at least one Nvidia Turing card, so I have my eyes on dense Epyc 2u4n servers such as the Gigabyte H261-Z60, the Supermicro 2123BT-HNR, or the Cisco USC C4200:

https://www.gigabyte.com/Hyper-Converged-System/H261-Z60-rev-100#ov
https://www.cisco.com/c/en/us/products/servers-unified-computing/ucs-c4200-series-rack-server-chassis/index.html

Each of these servers has 4 nodes, with each node having 2 low-profile half-length PCI-e x16 slots near the rear. For example, this is what each node on the Gigabyte H261-Z60 looks like:

https://static.gigabyte.com/Product/107/6666/2018061913461032_src.png

My question is: can I put one or two Tesla T4 cards in those rear PCI-e x16 slots? These are low-profile passively cooled GPGPU cards with a TDP of 75W:

https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-product-brief.pdf

In theory these cards should work there as there is enough space, PCI-e lanes and power for it. My main concern is with cooling - there should be reasonable airflow through each card but the air will be warm from having passed through two 170W cpus.

For what it's worth, Gigabyte, Supermicro and Cisco are all careful to avoid any mention of using those slots for GPUs. I emailed Gigabyte and got the following predictable answer:

Dear Customer,

We cannot support using T4 inference card in our Computing designed system.
It is of course applicable but please understand we have not done any thermal testing and cannot provide support if issues arise related with the T4 card inside a H261-Z60.
Test at your own risk.

Best Regards,
Can someone here tell me if this setup will work, albeit unsupported, or if there are real concerns about cooling or something else?

Also, more broadly, I wonder why the Tesla T4 isn't certified for these types of systems. Isn't that the entire point of it? Every use of the Tesla T4 which I've seen was in spaces where bigger graphics cards would have worked just as well, such as:

https://www.anandtech.com/show/13619/scaling-inference-with-nvidias-t4-a-supermicro-solution-with-320-pcie-lanes

What is the purpose of those Tesla T4 cards if not for use in dense servers?
 

Eximo

Titan
Herald
That someone who wants to use them will contact Nvidia or have an engineering team sit down and spec it out like you are doing. They won't say it will work since they don't have the information to back that claim up. Liability is a thing. They are also saying they won't take the time and money to contact Nvidia for you if you can't make it work with their system.

If you think the airflow and ambient temperature in your, hopefully, chilled datacenter, is good enough, I don't see why it wouldn't. Otherwise I would somewhat agree those 2U chassis choices aren't the best for the passive GPU design.

If it were me I would seriously look at a custom workstation if this is a one off. Maybe even get some universal GPU blocks and water cool just to be sure airflow isn't a problem. Though I can't imagine the supermicro platform have issues. I might be concerned about power delivery via the PCIe slots for each card. That would be 300W pumping through the board.

Might be worthwhile to get two larger GPUs that have external power and proper cooling.
 

ASK THE COMMUNITY