Question Retry errors with Intel X540-T2 10Gbit nic's

Aug 9, 2022
2
0
10
Hi

I have a problem with some network-cards, that puzzles me.
I run a Proxmox-based virtualisation system at my job.
It consists of 8 hosts each with two network cards:
A normal on-board 1-gbit intel card + a Intel X540-T2 10gbit card
The 10gbit card has its 2 connections bundled together with LACP and it connects through a some HP 1950 10gbit switches that are bundled together with IRF.
The storage for the Proxmox-cluster is two big Synology-NAS's that also connect to this 10gbit HP switch.
Some days ago I noticed that my Proxmox backup-speed had gone down on SOME of the 8 hosts - not on all of them.
And I did some testing with iperf3 and got this result:
On some hosts the receiving of data is associated with a lot of tcp-retry-errors.
ALL hosts can transmit flawlessly IF they transmit to one of the hosts, that are able to receive flawlessly, example


root@pve11:~# iperf3 -c 192.168.50.13
Connecting to host 192.168.50.13, port 5201
[ 5] local 192.168.50.11 port 40840 connected to 192.168.50.13 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.10 GBytes 9.41 Gbits/sec 7 1.28 MBytes
[ 5] 1.00-2.00 sec 1.09 GBytes 9.39 Gbits/sec 0 1.37 MBytes
[ 5] 2.00-3.00 sec 1.09 GBytes 9.39 Gbits/sec 6 1.38 MBytes
[ 5] 3.00-4.00 sec 1.09 GBytes 9.40 Gbits/sec 0 1.39 MBytes
[ 5] 4.00-5.00 sec 1.09 GBytes 9.38 Gbits/sec 7 1.40 MBytes
[ 5] 5.00-6.00 sec 1.09 GBytes 9.40 Gbits/sec 0 1.41 MBytes
[ 5] 6.00-7.00 sec 1.09 GBytes 9.39 Gbits/sec 0 1.41 MBytes
[ 5] 7.00-8.00 sec 1.09 GBytes 9.39 Gbits/sec 0 1.43 MBytes
[ 5] 8.00-9.00 sec 1.09 GBytes 9.39 Gbits/sec 0 1.45 MBytes
[ 5] 9.00-10.00 sec 1.09 GBytes 9.39 Gbits/sec 0 1.48 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 10.9 GBytes 9.39 Gbits/sec 20 sender
[ 5] 0.00-10.00 sec 10.9 GBytes 9.39 Gbits/sec receiver

but SOME hosts are not able to receive without retry-errors, example

root@pve11:~# iperf3 -c 192.168.50.12
Connecting to host 192.168.50.12, port 5201
[ 5] local 192.168.50.11 port 42404 connected to 192.168.50.12 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 931 MBytes 7.81 Gbits/sec 3796 1.06 MBytes
[ 5] 1.00-2.00 sec 798 MBytes 6.69 Gbits/sec 2657 957 KBytes
[ 5] 2.00-3.00 sec 685 MBytes 5.75 Gbits/sec 1983 492 KBytes
[ 5] 3.00-4.00 sec 820 MBytes 6.88 Gbits/sec 2250 1.58 MBytes
[ 5] 4.00-5.00 sec 908 MBytes 7.61 Gbits/sec 4489 243 KBytes
[ 5] 5.00-6.00 sec 848 MBytes 7.11 Gbits/sec 1921 267 KBytes
[ 5] 6.00-7.00 sec 736 MBytes 6.18 Gbits/sec 4022 680 KBytes
[ 5] 7.00-8.00 sec 792 MBytes 6.65 Gbits/sec 2808 257 KBytes
[ 5] 8.00-9.00 sec 840 MBytes 7.05 Gbits/sec 7461 1.38 MBytes
[ 5] 9.00-10.00 sec 718 MBytes 6.02 Gbits/sec 1563 768 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 7.89 GBytes 6.77 Gbits/sec 32950 sender
[ 5] 0.00-10.00 sec 7.88 GBytes 6.77 Gbits/sec receiver

SO a few hosts have this problem, and before I did all these iperf3-tests I was suspicious about the switch.
Now I lean more towards, that the NIC's in the involved hosts have some issue.

What are your thoughts on this?
 
I would try unplugging one of the 10gbit connections and see if it is something related to the port bonding or maybe something with how these 10g ports and the switch redundancy interact.

Doing it this way is one of the fastest methods to provide redundant connections to the server that allows for things like reboot of switches without outage. Simple spanning tree can also work but is a bit slower.

I have spend so much time trying to fix problems like this you sometime wonder if you get more issues because of a complex system than you do for a actual failure. It does though allow for things like software upgrades on equipment without causing a outage when it is done correctly.
 
Aug 9, 2022
2
0
10
Hi bill001g

Thanks for your answer!
I already tried switching to the backup-switch by unplugging the network-cable connected to the running "Master"-switch in the IRF-stack.
Then the second nic-interface in that LACP-trunk takes over as it should. The IRF system works fine, but the retry-errors persist.
Soooo, in my opinion it could still be the nic, that has trouble sending without errors for some reason.
I just went over the whole VLAN- and Trunk-structure, and I cannot for the life of me see, that there should be any logical errors in that.
And everything in the network is functional, there is no hicups occuring of any kind.
It puzzels me, still
 
Problems like this is why I used to get paid very well. It could be the nic but is can also be pretty much anything in the path. Could be bad ports in switches or issues with the interconnection between the different boards in the switch or the interconnection cables between the switches.
Then you have all the software stuff like vlans and subnets etc etc. I used to see very strange errors because the data would take one path going to the gateway and a different coming back. This was a design error where the setting for the redundant gateway did not match the routing protocol paths.

Not sure how much help I can be. It takes detailed knowledge of the network and the ability to carefully make changes to test. You can try the obvious and look at the reports and see if you are getting data errors on the ports. You might have to turn the feature on or have it send SNMP traps to a monitor server. This is where commercial switches are so nice compared to consumer stuff. Simple things like received error packets on a port will show you a bad cable or a maybe a bad port.

You of course should check the firmware levels on your switches.

After this it is a lot of digging around trying to figure out what is different about the ones that work and the ones that do not.