Question Network detection and detection of mapped drives is slow ?

D

Deleted member 2783327

Guest
I've installed Windows 10 21H2. it takes 12 seconds to detect the network. It takes a further 4 seconds to detect network drives.
On my 1809 LTSC installation network detection is almost instant, as is mapped drive detection and opening of network files.
All Infrastructure is 10G

I've done all the basics as recommended on dozens of other posts.

I've also disabled the on board 10G NIC, and plugged in a PCIe 10G NIC.
I updated the NIC driver from 2.1.21 to 3.1.6 (AQC107 and AQN107)
This is a fresh clean install, not an in-place upgrade.
I've checked I am not running SMBv1 and SMBv2 & v3 are enabled.
I've gone through the NIC configurations with things like LSO, RSC, Jumbo packets et al.

Of course, using a 1GB NIC the copy speeds are about 90MB/s, but all other issues like slow detection and slow opening of network files are still present.
I've made sure the is not a single user application running, and I even tried disabling a combination of Windows services to see if any of them were interfering.
Tried BIOS updates, driver updates and reverting to older drivers.

This is driving me nuts. I'm stuck on Server 2012 R2 which is coming out of support for my server. I don't want to be stuck on 1809 LTSC forever as well for desktops.
I'm again at a loss as what to do next to resolve this.

I'm beginning to think that Microsoft is using the same broken TCP/IP stack on 21H2 as they do for Server 2019, because most of the symptoms are the same. The only difference is file copies from my PC to Server 2012 R2 don't stall, they are now just slow at 200 MB/s - 240MB/s, instead of around 1GB/s

As an aside; I've read through dozens of posts and couldn't find a solution, but I did come across one of my own posts from a year ago.
That issue was never resolved here. It turned out that Microsoft reinvented the network stack on Server 2019 which is what causes the issues.
There is stil no known solution. It wasn't an LSO issue, though at the time that seemed like a good idea.
 

Ralston18

Titan
Moderator
This:

" I'm stuck on Server 2012 R2 which is coming out of support for my server. I don't want to be stuck on 1809 LTSC forever as well for desktops."

What are the reasons for being "stuck"?

Not my area at all (full disclosure). However; it appears that there are some constraints involved.

What are those constraints?
 
I'm not sure I understand the issue. Do you need to disconnect and connect the network physically very often? And when you do it takes time to recognize the physical connection? If you are connecting a couple times a day 15s doesn't sound like a big issue, but I can see how it would be annoying if you do it very often.

If so, please confirm whether it's the physical connection that takes time to come up or the network configuration (does the NIC show "up" quickly?). If you could try "Get-NetAdapter" in powershell while it's unavailable it would help establish the NIC state at that moment.

If the NIC takes too long to change state to "up" , you could try looking at Windows event log for clues. This would probably be related to driver and the OS network stack, so there might be little we can do.

If the NIC changes to "up" quickly and it takes time after that to be able to use it, you could try checking with Wireshark to see if there's any form of communication on that interface when it happens.
 
D

Deleted member 2783327

Guest
Thanks for your reply.

Perhaps I'm spoiled. I've gotten used to a responsive OS for over a decade. Unresponsive OSs just drive me nuts regardless of how often I connect, which is 3-4 times a day, but I also have to roll this out to 10 other people, who have the same expectations as me. I suspect within a few days they'll be asking me to roll them back to 1809.

I look at it like a semi flat tyre. Sure it's only one tire, but my handling is reduced so if it's flat I will pump it up, and take it to the tyre people to be fixed properly. Same concept. Sorry if that's a bit of a lame analagy :)

I set up a task to ping yahoo.com in a startup script. It says "Request timed out" until the network is detected. The ncsi indicator (the globe), shows not connected. If I attempt to manually map a drive while it's not connected I get an error 53 network name not found.

It also causes my VPN to fail to connect - It times out. The VPN is also connected via a login script so the PCs are always on the VPN. This is required because people are working from home.

I suspect this is the same issue that exists for Server 2019 onward. Microsoft reinvented the wheel and it's never been responsive ever since. I've gone back to Server 2012 R2 as a result of that for the server.

But it's not just the slow connection that is an issue. The infrastructure here is all 10G. On 1809 the copy speeds are anywhere from 700MB/s to 1.2GB/s. On 21H2 this drops down to 230MB/s. As a lot of large files are moved around frequently this has a noticable impact on people's productivity.

On server 2019 the file copies actually stall - as in drop to 0 b/s for long periods of time. 21H2 is not that bad, but still not as good as 1809.
Even when the connection is up it still takes several seconds to discover the mapped drives, which are then mapped in a disconnected state.

There is also the fact that programs open network files significantly slower. The largest spreadsheet here on 1809 takes 8 seconds to open on 1809. On 21h2 its 23 seconds.

There are 3 registry tweaks I've got to try for that, but I'm not optimistic about those - IIRC I tried them on Server 2019 and they didn't help.

Essentially, all aspects of the network experience seem to be affected. Slow to connect and slow once connected.

The NIC drivers have been upgraded from 2.0.18 to 2.1.21, 2.2.3.0 to 3.1.6, none of which change anything.

Telemetry is been significantly reduced. And 100's of firewall rules to prevent things connecting the internet. It's basically a whitelist. Only those programs that absolutely must connect out are allowed - email, browser, things like that.

There we no event log entries that I could find, except for OpenVPN, which logs an error when it successfully connects, but it's been doing that for years.

I'll have a go at the PS command you mentioned and get back to you.
 
Last edited by a moderator:
D

Deleted member 2783327

Guest
" I'm stuck on Server 2012 R2 which is coming out of support for my server. I don't want to be stuck on 1809 LTSC forever as well for desktops."

What are the reasons for being "stuck"?

Not my area at all (full disclosure). However; it appears that there are some constraints involved.

What are those constraints?

Server
  • Sever 2016 will not install with consumer graphics cards like the GTX 16xx and RTX 20xx series. The motherboard is X299 - meaning it does not have an iGPU.
  • Server 2019 takes 12 seconds to dectect the network.
  • File copies of files > 5gb stall. Drop to 0 b/s IIRC, for about 30 - 45 seconds, then start then stall. This is true whether the file is being copied to or from the server.
  • Clients attempting to open files on the server experience a much slower opening.
  • Accessing the Intranet website is slower from clients and on ocasion times out
  • Ther server runs coldfusion for some internal websites. MySQL is used for the databases. On 2019 some times things timed out. I had to increase the timeout values to overcome this, but that just means applications are slower at the client.
I spent 3 months going through various server tuning exercises but nothing worked, then someone at Stack Overflow (or maybe it was superuser?), said that Microsoft had re-invented the network stack and this was a known issue. To this day there is apprently no solution.

Client
  • Exhibiting the same symptoms as Server 2019, though file copies are down to about 20% - 50% of speeds compared to 1809 but don't stall.
 
D

Deleted member 2783327

Guest
Ok, so get-netadapter shows the main NIC to be UP within seconds of start up, but things like ping, ncsi, mapping of drives, and so on don't work. Where does that leave me?

And the slower overall performance?
 

Ralston18

Titan
Moderator
This( with my underline):

"Telemetry is been significantly reduced. And 100's of firewall rules to prevent things connecting the internet. It's basically a whitelist. Only those programs that absolutely must connect out are allowed - email, browser, things like that. "

Quite likely that those rules are getting in the way one way or another. Conflicting, overriding sometimes, all in all consuming system resources to resolve whether or not some operation/process will or will not be allowed to connect.

Some rules, A, B, and C. A and B work together but B and C do not. You fix B and C then A and C no longer work. Etc.,etc. for 100's of rules. Immediately problematic.

Could be being made all the worse with older versions of software and hardware that may have support issues.

Another clue - via pings resulting "Request timed out".

"In most cases, a "Request Timed Out" message is caused by a firewall blocking the connectivity. "

Reference:

https://support.logmeininc.com/cent...uest-timed-out-when-trying-to-ping-a-computer

Or simply some hop configured not to respond to ping requests.

My recommendation is to focus on discovering more about what is occurring during slowdowns and blockages.

Use Task Manager, Resource Monitor, Process Explorer (Microsoft, free) and Latency Monitor to take a very careful methodical look at your network.

Use all of the tools but only one tool at a time on any given device. Primarily to prevent the tools from interfering with each other. If it is proven that there is no interference then run the tools simultaneously.

https://learn.microsoft.com/en-us/sysinternals/downloads/process-explorer

Another tool that could help is Wireshark. There may be other tool suggestions as well.

= = = =

Overall objective being to determine or otherwise identify the specific source/cause of the problems being encountered be they hardware, software, or configuration. Or a combination.

You will need to create a very careful monitoring plan focused on specific servers and hosted computers. The plan should be observational at first.

Collect data without changing anything. Then make a single test change somewhere and again collect data.

If you do not have an overall working network diagram showing all connected devices you should create the diagram with as much information as you can about each device.

Device name, OS, version, IP address, specs, installed software, and connectivity within the network. The proverbial "big picture" view.

Use the network diagram to plan the initial observation points. The use the observation results to refocus your observations to some more specific part of the network.

I also recommend using Powershell as a diagnostic tool via the "Get" cmdlets. There are other commands that can be used to delve deeper into system hardware, software, configuration, and performance.

For the time being, just stay with the observation tools until, hopefully, the scope of the problem and perhaps the problem itself, can be narrowed down.

Key is to discover some scenario when the presence of X makes network performance slow. Without X all is well.

Takes time and effort to troubleshoot such things.

And bear in mind that there may just be some inherent latency: 12 seconds and 4 seconds just being "what it is" with respect to the current network environment. Hopefully that can be tweaked some but I would put a premium on smooth, reliable performance, and ease of maintenance (backups, recovery, updates, etc.) over saving a few seconds here and there.
 
D

Deleted member 2783327

Guest
@Ralston18. I thought about the firewall rules.
I guess I need to be clearer in the way that I explain myself. Sorry, totally my fault.
The production machines have the firewall rules. The test machine, on which I am currently testing 21H2 the firewall is disabled to make sure that wasn't causing the problem. But regardless, the rules are not that complex. A program has a main exectuable, and that is what's blocked.

I like your explanation though. Well thought out and very logical.
I am no stranger to spending time on diagnostics. I'm probably a bit of a masochist, lol.

Yes, hops can be a black hole. The first hop though is the one that doesn't respond (which should be my router), I suspect because Windows thinks there is no active internet connection, despite the adapter being "up". As soon as the ncsi shows a connection, the ping starts to respond. btw: My internet is FTTP 100/20.

The problem I have with "it is what it is" is the very definition of the frog in the simmering pot scenario. I'd like to believe that Microsoft wouldn't drop such garbage on users, but knowing Microsoft nothing would surprise me. Eg reinventing the TCP/IP stack. I'm sure they thought they had a good reason, but it's caused nightmares for many a sysadmin.

If X=1809 all is well. If X=21H2 All is not well. :)

A plain vanilla 21H2 installation with absolutely no tweaks behaves poorly. If I then go and apply a number of tweaks, one at a time of course, as you suggest, working towards the configuration of my production machine nothing changes for better or worse. That seems to align with your thought of "it is what it is".

I won't go through every point as that would just be frustrating for you guys, but I've gone to the level of changing hardware everywhere, as well as software, drivers, services, GPOs, registry, and cabling.

And if that is how it is, sorry, I'll take a pass. It's unacceptable to drive a car whith 3 wheels, I won't run an OS that's un responsive especially when I have one that is. And let's not forget, the slow discovery and connection to the network is just the start of the problems. It is slow everywhere.

I'm reasonably experienced in this stuff, and it sounds like you are too. But I'm human, I can miss things, and I generally only post when I get so frustrated from spending too much time that I sort of hope someone will say "Oh, you dummy, you forgot X".

I run process monitor, explorer and wireshark obsessively :)

I shall keep digging... Maybe I will get lucky. And if not, I can stay on 1809, or move to Linux. At least for me. For the other 10....??
 
D

Deleted member 2783327

Guest
If the NIC changes to "up" quickly and it takes time after that to be able to use it, you could try checking with Wireshark to see if there's any form of communication on that interface when it happens.

Other than the usual "Who has <ip> tell <ip> sstatus messages, wireshark shows little to no traffic. Even after the link comes up. But then, at the current time, there is nothing installed on the 21H2 PC.. I reinstalled it from scratch with just the drivers.
 
Ok, so get-netadapter shows the main NIC to be UP within seconds of start up, but things like ping, ncsi, mapping of drives, and so on don't work. Where does that leave me?

This tells us that the interface is up but the OS can't (or won't) use it yet. Just to be sure, are you using static addresses? I would also disable IPv6 and network discovery while troubleshooting (even if you need it, just to try and pinpoint the issue). Also, are you guys using 802.1x?

Other than the usual "Who has <ip> tell <ip> sstatus messages, wireshark shows little to no traffic. Even after the link comes up. But then, at the current time, there is nothing installed on the 21H2 PC.. I reinstalled it from scratch with just the drivers.

The fact that there's any traffic at all at that point suggests the issue lies higher up in the stack. Don't know if you've read this, but using 10G has some implications with TCP window size:

https://learn.microsoft.com/en-us/w...s/network-subsystem/net-sub-performance-tools
 
D

Deleted member 2783327

Guest
Yes, I've done a lot of reading about 10G. But these problems also persist on my 1G NIC.
I haven't tweaked anything on 21H2 yet, but I did go through that exercise with Server 2019. I should probably revisit it. Thanks for that tip.

IP addresses have always been assigned via MAC address reservations on the DHCP server (the router). I guess I could hard code an IP address on the test PC and see if that makes any differece. I'll let you know.

IPv6 is totally disabled here.

Initially when the OS is installed network discovery is off. It is still slow. So I turned it on, but it didn't help.

No, no RADIUS server.

Also, no Wifi. All wifi adapters are disabled, and Wifi radios on the router are turned off.

The server and test PC are connected to the same switch, so I moved the test PC to a different 10G switch, but no difference.



Anyway, this isn't the solution but I did discover two things..

Damn nvidia! They keep changing how they collect telemetry as people were continually trying to block it. Now, they are storing the display container exe in the windows driver store. And the folder name is different for every installation, just to frustrate people who try to block their data collection... Mungrels :)

Easy enough to get around with a few lines of batch code to grab the folder name.

spoolsv.exe scans the first 40 IP addresses about 1 minute after boot up. Looking for network printers I guess. Don't know how to stop that one.
 

Ralston18

Titan
Moderator
Noted:

"IP addresses have always been assigned via MAC address reservations on the DHCP server (the router) "

Look for duplicate MAC's.


MAC Addresses

From the link:

"Each MAC address is unique to the network card installed on a device, but the number of device-identifying bits is limited, which means manufacturers do reuse them. Each manufacturer has about 1.68 million available addresses, so when it burns a device with a MAC address ending in FF-FF-FF, it starts again at 00-00-00. "

Could be some error of omission or commission with respect to the MAC reservations being made.

Compare MAC's and assigned IP addresses. Ensure that there is no overlap between the allowed DHCP IP address range and any Static IP's in use.
 
D

Deleted member 2783327

Guest
(y) All clear with MAC addresses and IP addresses. No dupllcates. No 169.254.x.x addresses. No DHCP overlap.
The only static IP addresses are a network printer and the file server.

Assigning a static IP to the PC did not improve things.
 

Ralston18

Titan
Moderator
Time to go to another level.

Are you at all familiar with Powershell?

You can use Powershell and "Get" cmdlets to learn all sorts of things about hardware, software, and configuration settings.

For example:

https://learn.microsoft.com/en-us/p...teradvancedproperty?view=windowsserver2022-ps

https://learn.microsoft.com/th-th/p...indowsserver2022-ps&viewFallbackFrom=win10-ps

https://stackoverflow.com/questions...rk-adapter-wi-fi-ethernet-bluetooth-in-window

Get cmdlets are safe because they do not make any changes to the system. However you do need to be careful about longer cmdlets or scripts that may make some change dependent on what the Get finds.

The third link uses:

get-wmiobject win32_networkadapter -filter "PhysicalAdapter = true" | select *

to collect all sorts of information.

And the cmdlet itself can be easily copied and pasted at the Powershell PS> prompt. (Run Powershell as Admin.)

Run the cmdlet(s) in a test environment until you are comfortable with doing so and get a sense of the results.

The objective being to take an in depth look an the network adapter to determine if any of the configuration settings are not as expected.

On any given computer or differences between computers.

Note: You can also easily find any number of scripts that will gather all sorts of data from network computers and other devices.

Also you mentioned printers:

Powershell can get printer information as well via Get-Printer and other options.

https://stackoverflow.com/questions...-server-from-remote-computer-using-powershell

Get-CimInstance -ClassName CIM_Printer -ComputerName $arrayOfComputerName
 
D

Deleted member 2783327

Guest
Are you at all familiar with Powershell?
Yes.

You read my mind ;)

Sorry I've gone quiet. Been doing a lot of reading about TCP/IP tuning and 10G adapters. Gone into "deep research mode" :)

Have made some changes such as IRPStackSize, receive/transmit buffers, Receive Side Coalescing (RSC) and much more.

Also investigating the switches. I read somewhere that there is a low level issue with the MS510TX and MS510TXPP switches. Not firmware - something Netgear would have to tweak. It seems to address my file copies slowing down but they won't disclose what they changed (trade secrets I guess). I've also updated the switches to the latest firmware. They were a tad out of date.

Very minimal improvements. I'll keep researching.... My brain hurts...

The only setting that wasn't as expected was RSC. This had somehow gotten enabled. I disable it when I install the OS. Everything else is as expected... so far :)
 

Ralston18

Titan
Moderator
D

Deleted member 2783327

Guest
Every suggestion is a good suggestion!

Worth a look.

I checked all scenarios.
  1. No wifi is enabled. All wifi capable NICs are disabled and router radios are turned off.
  2. There are no ethernet loops either on router, switches or devices.
  3. I haven't moved any cables around, but I checked port allocation again anyway. Looks good to me.
As for broadcast storms
  1. If such were occuring I would expect it to affect devices other than only the 21H2 PC.
  2. I would also expect to see symptoms in wireshark.
  3. If the 21H2 device were causing the broadcast storm, again I'd expect to see symptoms in wireshark.
  4. I watched the diagnostics on the switches when booting up and shutting down the 21H2 PC. Nothing out of the ordinary there.
I'm about to embark on the "Jumbo frames" expedition. That promises to be a lot of fun, going by the dramas that others have documented.
 
D

Deleted member 2783327

Guest
OMG!! I think I have fixed it.. or 75% of it..

NCSI
  1. IPv6 is disabled on all PCs here. Except the 10G nic has a tick in the IPv6 option in the NIC properties.
  2. When Windows starts it seems to be using IPv6 for the connect test if the above option is ticked. This causes Windows not to detect the Internet despite it being connected. Unticking this box results in normal connection indicator within 3 seconds. This seems rather bizarre given IPv6 is blocked in the firewall, blocked by policy and blocked at the service level with DisabledComponents:0xff
  3. RSC was enabled for both IPv4 and IPv6. My setupcomplete script must have failed. Disabled both settings.
DELAY IN NETWORK DISCOVERY
  1. I was initially using 2.0.18 of the AQC107 driver, which seems to work. I tried to update to 2.1.21, which failed not compatible with 21H2. I updated to 3.1.6 which is the latest and I thought was working.
  2. I found a later AQTion BIOS. I was on 3.1.84 and there was a 3.1.88 BIOS. Went to update that but it said that the firmware 3.1.6 was not compatible with the new bios.
  3. Uninstalled the device, rebooted and then updated the NIC BIOS to 3.1.8.8
  4. Started seeing link lost errors in event log at 6seconds after reboot and reconnecting 6 seconds later.
  5. Installed v2.2.30 of the firmware and rebooted.
  6. Just for the sake of testing, I tried using the 2.0.18 AQC107 driver with the 3.1.8.8 BIOS and the delay returned. So it's a combination of updates that was required. Interesting though that 2.0.18 would install where 2.1.21 would not.
  7. I need to test 3.1.6 with the new BIOS. WIll get back to you on that one.
OUTCOME
  1. Network connected at start up
  2. NCSI showing connection witin 3 seconds.
  3. Mapped drives showing as connected immediately. No other delays
  4. No "Link lost" errors in event log.
NIC PROPERTIES
  1. Changed receive buffers to 1024 (was 512)
  2. Changed transmit buffers to 4096 (was 2048)
  3. Added LANManServer parameter IRPStackSize and set to 0x20 (32 decimal)
  4. Checked LSO and RSS were enabled - both good. Also tried disabling Interupt Moderation. No difference so re-enabled it.
  5. Turned off "Allow Windows to turn off this device to save power"
  6. Reboot and test
FILE COPIES
  1. Copies to client of 55GB file @ 565MB/s for first 10GB then drops down to 235MB/s. This might be my 970 NVMe drive throttling or the ability of the source drive, given it is a spinner.
  2. Copies from Client to server of 55GB file 700MB/s dropped to 235MB/s after around 10GB.
OTHER TASKS
  1. Update MS510TX and MS510TXPP switches to latest firmware - I don't think this was contributing to the issues though.
  2. Replaced LAN cables with new cables.
  3. Performed several tasks as suggested by replies to this thread.
  4. Tested AutoTuning at Experimental setting. Didn't change anything.
  5. Using Jumbo frames broke the network connections so put it all back to disabled. I may have missed something.
  6. Enable global flow control on all switches.
Several forums suggested setting the system to high performance power plan and setting BIOS to high performance and disabling C-States. With the price of electricity though I can't afford to do that. It could add hundreds to electricity bills each month if done across all 10 PCs.

I'm going to restore Windows from my backup image to how it was before I started making changes - to see if when I repeat the steps that what I did was the actual solution.

oh, btw: I also found a known issue with LWF's.. See link below. I'm using the WinPCap driver, not npcap but I thought that might have been the issue. Red herring. The info about the 90 second delay is interesting though.

https://learn.microsoft.com/en-US/troubleshoot/windows-hardware/drivers/lwf-causse-90-second-delay
 
Last edited by a moderator:

Ralston18

Titan
Moderator
Three thoughts:

1) Continue to be methodical and change only one thing at a time.

2) Allow time between changes. All those devices etc. communicate with each other, do discovery, and do updates accordingly. E.g. ARP.

You may fix something only to have it "undone" by doing something else a bit too soon before another device has "caught up".

3) And at some point if problems persist it may be worthwhile to make some controlled and measured changes to the power plans. I understand that doing so will add to electrical costs but I expect that your investment in time and effort to resolve the problem(s) has likely already been at some cost.

If the power plans happen to prove to be the culprit or partially involved at least you will know that and can plan ahead in some manner.
 
D

Deleted member 2783327

Guest
If the power plans happen to prove to be the culprit or partially involved at least you will know that and can plan ahead in some manner.

At this stage I've stayed away from power plan changes. But you may be right - I may need to look into that at some point.

I have mostly finished my testing and have prepared the attached summary. It's long. For those who may be interested.

https://docdro.id/LEeUm59

I would like to thank all who contributed.