Fast Internet, slow page loads only on non cached pages, only on some machines

Mar 8, 2018
6
0
10
All,

Hoping for some assistance on a peculiar problem.

I've just started managing a new environment and they've complained that their Cogent 1 Gig Internet has always been terrible. The first thing I did was upgrade all switches to managed ones, and replaced router with a Cisco ISR so we could get insights into any issues.

Now, there is a lot of improvement, but SOME users have problems with Internet, where if they haven't visited the page in a long time (or ever) the browser hangs on 'connecting'. It takes so long that occasionally the browser will even flash 'timed_out' for a second, before the page suddenly loads itself. Once loaded, all sub pages load ok, and the page can be re-visited with normal response time.

The issue is not browser specific, though we primarily use Chrome.

Speedtest comes back at about 950MB, although on the affect machines it sometimes has a problem initially finding a test server to connect to.

Sounds like DNS, right? Well, it's not =*( I've changed to OpenDNS, Google DNS, with no resolution. within the last week, I've pointed everyone from the ISP DNS to Windows DNS server, with no improvement.

Counters are running clean on all switchports. We don't see usage more than ~25MB on the high side for all of the 15 users combined.

Everyone runs 1 GB NIC cards, verified that speed/duplex are matching auto/auto on switch and NIC and that they are negotiating to 1 GB/full.

CPU/MEM/Disk I/O, network all look fine on the affected desktops, not close to maxed out. Network usage is about 7MB on the machine I am diagnosing on. I've tried disabling Windows Firewall, antivirus. Hosts are running Windows 10.

Updated latest drivers on the NIC.

Reset the 'hosts' file back to empty default.

Did 'netsh winsock reset'.

Cleared browser cache/reset to default.

Did ipconfig DNS flush.

ensured 'Internet options' has automatic/proxy settings etc disabled.

Disabled ipv6 on the NIC.

Hmmm... Anything else? Probably, but I can't remember. We also recently got a secondary ISP line, AT&T 1 GB fiber, which I think gives the same issue (but only 99.8% certain that is true).

It's clear that only certain users' machines are consistently affected, and it likely has something to do with the applications they are running, but there is absolutely nothing I can see to give me any sort of ammunition that that is actually true.

Can anyone give me a nudge in the right direction on how to effectively determine root cause? I am thinking of perhaps WireShark, but not that familiar with it, nor really clear to me how I could use it to determine the real culprit.

Thank you in advance!!!
 
Nov 24, 2018
2
0
10
Are you still experiencing the problem?

TLDR; Provide a capture from a working machine trying to connect to www.yahoo.com, as well as one that doesn't work.

Is the MTU set for 1500 default everywhere /w no devices configured for jumbo frames?

Are you doing network offloading on the NIC's?

If using VLAN's are you doing inter-vlan routing w/ SVI/VIP ACL's?

Are you running an application layer firewall or malware plaform?

Does the problem exist when using Private mode / Incognito mode on the machines?

On a working machine, I would use ProcMon from https://docs.microsoft.com/en-us/sysinternals/downloads/procmon and create a filter for the browser process. I would then compare the results to a non-working machine.

On the non-working machines, have you created a new OS user profile and verified it to not be tied to the user profile?

Are you running any proxy, DLP, IDS/IPS that could be triggering on some machines?

This may sound weird, but in the past have any of these machines been connected to X25, Token Ring or any legacy platforms?

Even though machines and switches are all set to 1gb or auto and seem to be ok, have you done a packet capture to verify? I know in an old HP shop where I worked we had a ton of issues with the inter-mixing of Extreme Networks and Cisco Switches /w HP Servers and Desktops. Everyone was complaining of a similar issue and the desktops reported that proper negotiation was working, but when I did a packet capture it showed that instead of 100/full everything was running 100/half. We also found numerous desktops that were showing 100/full, but were on a hard coded 1gb/full port. There were several other weird variations of this behavior. The crappy thing is that depending on the desktop/server model the behavior was different. We ultimately came up with a spreadsheet for matcing, then verified it was correct with packet captures. It was a royal pain in the ass since we had hundreds of servers /w teaming into redundant switch stacks.

If you have Mac or Linux hosts, do they experience the same problem? Install 'speedtest-cli' and run it from bash, does it experience the same behavior?

Since you appear to be using a cable provider and ATT fiber, both of which I run at home. I have a few other things I have ran into that sound similar. I have Palo Alto and PFSense firewalls, both of which were problematic to setup.

Comcast Business Class provided me a modem that was one that was supposed to be a simply bridge. I found out with packet captures on the cable side, they weren't properly bridging and I was not seeing arp poisioning, proxy arp issues or duplicate IP address problems on my side of the provided device. I bought my own Cisco cable modem and replaced it and never had the problem again.

With ATT fiber 1g service they provide a handful of remote gateways (RG), Pace is the one I wound up with and continue to hate. There is no real bridge mode on any of their devices, they offer pass-through, router-behind-router and DMZ+ to forward to a single IP or mac-address, depending on the RG model. If I plug directly into the RG they provided I get 900/950mb speeds, but if I enable their methods to allow my firewall/router to get the public address the speeds are 20/120mb. I ended up taking my Palo Alto offline and going with a netgate running PFSense, which had the same exact problem. I configured the RG to hand out another RFC1918 range on the ports on their device and allowed my firewall to get a rfc1918 DHCP address. The problem went away, so I was curious. After a packet capture I found that when they were passing the public range to my firewall the RG was still holding onto the ARP for the /32 address, so both L2 devices were answering. After a fiber capture the RG was causing ARP poisioning and complaints of a duplicate IP address on the capture, but not in the RG logs.

On both Comcast and ATT my Palo Alto and PFSense /w Snort see normal traffic as deceptive or spoofed due to some of the weird behavior that was occuring. I ended up just leaving the double nat and moving along since I run services in AWS /w no hosting at my house.

If you already resolved the issue, let me know since I always enjoy hearing new problems and the ultimate resolution. :)


Ken