[SOLVED] GTX 1080Ti in PCIe x4 (gen 4) vs. -PCIe x8 performance

Apr 27, 2020
4
0
10
Purchased a x570 motherboard (ASUS TUF GAMING x570-PLUS WiFi) that has PCIe x16 (Gen 4) on the top slot and PCIe x4 (Gen 4) (with physical space for x16) on the southbridge slot. Considering getting an additional GPU in the future and moving the 1080Ti down to the southbridge PCIe slot. It is currently in the top slot so no performance issues.

Issue: What would happen if I moved the 1080Ti to the PCIe x4 (4.0) slot? How would I go about testing the benchmarks comparing the two in Ubuntu 18.04 LTS or Windows?

I will be using the system for deep learning so gaming benchmarks would not be applicable but I would like to know how I could measure the difference in data throughput (possible bottlenecking) since a PCIe x4 slot would yield 4GB/s on the 1080Ti card and the spec sheet lists 11GBps data throughput. (I've read that moving the 1080Ti to a PCIe x8 3.0 does not yield a performance decrease; my motherboard only has 1 x16 (4.0) and 1 x4 (4.0))

Would very much appreciate constructive feedback - thanks in advance!

(edit: possibly a continuation of this)
 
Last edited:
Solution
3.0 x8 = 4.0 x4
Deep learning should also use less bandwidth than gaming so you shouldn't see any difference.

4.0 x4 is 8 GB/s not 4. Also the data throughput is vram (data on the card to on the card) not over pcie. Plus it's a theoretical number.

You can do any benchmark and see the performance.
3.0 x8 = 4.0 x4
Deep learning should also use less bandwidth than gaming so you shouldn't see any difference.

4.0 x4 is 8 GB/s not 4. Also the data throughput is vram (data on the card to on the card) not over pcie. Plus it's a theoretical number.

You can do any benchmark and see the performance.
 
  • Like
Reactions: doneanddone
Solution
3.0 x8 = 4.0 x4
Deep learning should also use less bandwidth than gaming so you shouldn't see any difference.

4.0 x4 is 8 GB/s not 4. Also the data throughput is vram (data on the card to on the card) not over pcie. Plus it's a theoretical number.

You can do any benchmark and see the performance.

thank you @k1114 for responding so quickly! - how would I run benchmarks? do you have 1-2 recommendations?
Not sure which ones are recommended by people in the know. I understand gaming benchmarks vs deep learning will be different since deep learning tasks are measured by different algorithms, training, other metrics.
 
3.0 x8 = 4.0 x4
Deep learning should also use less bandwidth than gaming so you shouldn't see any difference.

4.0 x4 is 8 GB/s not 4. Also the data throughput is vram (data on the card to on the card) not over pcie. Plus it's a theoretical number.

You can do any benchmark and see the performance.

I've come across this on Medium - running ResNet-50 benchmark for deep learning.

Thanks again! I can handle the rest on my own over the week!
 
I don't know anything about deep learning benchmarks but this one has several different tests. http://ai-benchmark.com/alpha.html
Thanks! I looked into it further and these two sources show that the number of PCIe lanes have <5% impact on performance.
  1. A Full Hardware Guide to Deep Learning
  2. Many care about the number of lanes per PCIE slot. It doesn’t matter at all with a couple of GPUs.
"
CPU and PCI-Express

People go crazy about PCIe lanes! However, the thing is that it has almost no effect on deep learning performance. If you have a single GPU, PCIe lanes are only needed to transfer data from your CPU RAM to your GPU RAM quickly. However, an ImageNet batch of 32 images (32x225x225x3) and 32-bit needs 1.1 milliseconds with 16 lanes, 2.3 milliseconds with 8 lanes, and 4.5 milliseconds with 4 lanes. These are theoretic numbers, and in practice you often see PCIe be twice as slow — but this is still lightning fast! PCIe lanes often have a latency in the nanosecond range and thus latency can be ignored.

Putting this together we have for an ImageNet mini-batch of 32 images and a ResNet-152 the following timing:

  • Forward and backward pass: 216 milliseconds (ms)
  • 16 PCIe lanes CPU->GPU transfer: About 2 ms (1.1 ms theoretical)
  • 8 PCIe lanes CPU->GPU transfer: About 5 ms (2.3 ms)
  • 4 PCIe lanes CPU->GPU transfer: About 9 ms (4.5 ms)

Thus going from 4 to 16 PCIe lanes will give you a performance increase of roughly 3.2%. However, if you use PyTorch’s data loader with pinned memory you gain exactly 0% performance. So do not waste your money on PCIe lanes if you are using a single GPU!

When you select CPU PCIe lanes and motherboard PCIe lanes make sure that you select a combination which supports the desired number of GPUs. If you buy a motherboard that supports 2 GPUs, and you want to have 2 GPUs eventually, make sure that you buy a CPU that supports 2 GPUs, but do not necessarily look at PCIe lanes.

PCIe Lanes and Multi-GPU Parallelism

Are PCIe lanes important if you train networks on multiple GPUs with data parallelism? I have published a paper on this at ICLR2016, and I can tell you if you have 96 GPUs then PCIe lanes are really important. However, if you have 4 or fewer GPUs this does not matter much. If you parallelize across 2-3 GPUs, I would not care at all about PCIe lanes. With 4 GPUs, I would make sure that I can get a support of 8 PCIe lanes per GPU (32 PCIe lanes in total). Since almost nobody runs a system with more than 4 GPUs as a rule of thumb: Do not spend extra money to get more PCIe lanes per GPU — it does not matter!
"