Question Strange networking problem

Jul 15, 2022
16
0
10
I have, what I believe to be, a very strange networking problem and haven't found any resolution. The short version is this: I have several machines on a home network (connected to each other via WiFi) that each have no issues browsing the Internet, and frequently have no issues talking to each other (e.g. via SSH/PING). However, almost as frequently, one or more machines cannot reach other machines, giving errors like 'ssh: connect to host 192.168.0.24 port 22: No route to host'. When this happens, the machine that is failing to connect can access the Internet fine, and can ping the router (192.168.0.1) fine. It just cannot connect to other machines on the local area network. If I put the machine in a "ping loop", eventually, it will be able to connect (or rather get an ICMP response). At that point, if I interrupt the ping loop, I can ssh to the destination machine fine.

I frequently use the following trick, which also works: I'll use the example of one machine (MacBook Pro) with IP address 192.168.0.105, trying to connect to a Thinkpad running Ubuntu 20.04 LTS. I try to "ssh" and it fails (as above). Then, I ssh from 192.168.0.105 to a virtual host on the internet that has an OpenVPN tunnel to/from 192.168.0.24. From the virtual host, I SSH to the 192.168.0.24 machine (using, of course, a tun IP address of the VPN). Then, from 192.168.0.24, I ssh to 192.168.0.105, and once connected (or even once prompted for a password), I disconnect all sessions. At this point, I can now connect from 192.168.0.105 to 192.168.0.24 successfully.

I have several machines (Macs, Linux) on my home network. The "issue" I described above is not limited to that pair of computers. I have the same issue when (from, say a Mac at 192.168.0.20), I want to print a document. I have an HP printer connected to the local network (via WiFi) and macOS, in this case, says it cannot connect to the printer (192.168.0.23). The document is queued and sometime later (sometimes as much as 30 minutes later), it prints, once macOS has been able to connect to the printer.

So, in short, I have an intermittent network connection issue. I can "hack" it to connect (as described above), but this is highly inconvenient, and only works between the few machines (including Raspberry Pis) that have an OpenVPN tunnel to my virtual host.

Now, a few additional details since I'm sure someone will ask: I have an Internet connection to the house via cable. It comes into a Cable Modem. That cable model is connect to a TP-Link (wired) router. That router has the IP address 192.168.0.1 (remember I said that all machines have NO issues PINGing the router -- even when they can't connect to other machines on the local network). Plugged into the (wired) router, I have an ASUS WiFI router. It is that WIFI router that is servicing all the hosts on the 192.168.0.0/24 subnet. I actually have a second Wifi router (TP-Link, that is connected (via ethernet cable) to the wired router. It hands out IP addresses on the 192.168.1.0/24 subnet. I don't think it is really relevant to my issue, which is between hosts on the 192.168.0.0/24 subnet.

Can anyone suggest any things that might be causing this? Any things I can try to narrow down the problem? I'm currently suspecting that my ASUS Wifi router is "having issues", but I don't know that for sure. I have no idea why just "waiting a while" allows me to connect between hosts on the 192.168.0.0/24 subnet. Thoughts?
 

kanewolf

Titan
Moderator
I have, what I believe to be, a very strange networking problem and haven't found any resolution. The short version is this: I have several machines on a home network (connected to each other via WiFi) that each have no issues browsing the Internet, and frequently have no issues talking to each other (e.g. via SSH/PING). However, almost as frequently, one or more machines cannot reach other machines, giving errors like 'ssh: connect to host 192.168.0.24 port 22: No route to host'. When this happens, the machine that is failing to connect can access the Internet fine, and can ping the router (192.168.0.1) fine. It just cannot connect to other machines on the local area network. If I put the machine in a "ping loop", eventually, it will be able to connect (or rather get an ICMP response). At that point, if I interrupt the ping loop, I can ssh to the destination machine fine.

I frequently use the following trick, which also works: I'll use the example of one machine (MacBook Pro) with IP address 192.168.0.105, trying to connect to a Thinkpad running Ubuntu 20.04 LTS. I try to "ssh" and it fails (as above). Then, I ssh from 192.168.0.105 to a virtual host on the internet that has an OpenVPN tunnel to/from 192.168.0.24. From the virtual host, I SSH to the 192.168.0.24 machine (using, of course, a tun IP address of the VPN). Then, from 192.168.0.24, I ssh to 192.168.0.105, and once connected (or even once prompted for a password), I disconnect all sessions. At this point, I can now connect from 192.168.0.105 to 192.168.0.24 successfully.

I have several machines (Macs, Linux) on my home network. The "issue" I described above is not limited to that pair of computers. I have the same issue when (from, say a Mac at 192.168.0.20), I want to print a document. I have an HP printer connected to the local network (via WiFi) and macOS, in this case, says it cannot connect to the printer (192.168.0.23). The document is queued and sometime later (sometimes as much as 30 minutes later), it prints, once macOS has been able to connect to the printer.

So, in short, I have an intermittent network connection issue. I can "hack" it to connect (as described above), but this is highly inconvenient, and only works between the few machines (including Raspberry Pis) that have an OpenVPN tunnel to my virtual host.

Now, a few additional details since I'm sure someone will ask: I have an Internet connection to the house via cable. It comes into a Cable Modem. That cable model is connect to a TP-Link (wired) router. That router has the IP address 192.168.0.1 (remember I said that all machines have NO issues PINGing the router -- even when they can't connect to other machines on the local network). Plugged into the (wired) router, I have an ASUS WiFI router. It is that WIFI router that is servicing all the hosts on the 192.168.0.0/24 subnet. I actually have a second Wifi router (TP-Link, that is connected (via ethernet cable) to the wired router. It hands out IP addresses on the 192.168.1.0/24 subnet. I don't think it is really relevant to my issue, which is between hosts on the 192.168.0.0/24 subnet.

Can anyone suggest any things that might be causing this? Any things I can try to narrow down the problem? I'm currently suspecting that my ASUS Wifi router is "having issues", but I don't know that for sure. I have no idea why just "waiting a while" allows me to connect between hosts on the 192.168.0.0/24 subnet. Thoughts?
If you have two DHCP servers that could be the problem.
 
Jul 15, 2022
16
0
10
I’m quite sure that I’ve disabled the DHCP server in the wifi router that is servicing the hosts that are having issues with connections. All addresses should be being allocated by the single wired router at 192.168.0.1.
 
Maybe try to disconnect the second router using the 192.168.1.x subnet just to make things a bit easier to troubleshoot.

What IP do you have the lan set to on the asus wifi router. I am assuming you are using that as a AP ?

Devices on a lan only use mac addresses to talk, they only pretend to use IP addresses. Check the ARP entries to be sure you get consistent results. Something like a duplicate IP might be causing issues with mapping of ip/mac
 
Jul 15, 2022
16
0
10
I'll try disconnecting the WIFI router that is using the 192.168.1.x subnet the next time this happens and I'll post about the effect, if any.

The ASUS RT-AC3100 wifi router is connected to my TP-Link wired router at address 192.168.0.22 (hardwired address). And yes, that wifi router is set in Access Point mode.

The other wifi router is a TP-Link Archer C1900 router and it is connected to my wired router at 192.168.0.112 -- the fact that the host address has a high number as its last octet suggests that it got its IP address from the TP-Link wired router via DHCP. I'm a little surprised I didn't set that one up with a hardwired IP address as well, but I guess I didn't. I cannot connect to that router from my 192.168.0.0/24 subnet. I need to switch wifi connections to connect to the Archer C1900 via wifeless, and then I cannot via some 192.168.1.xxx address (it is probably 192.168.1.1). I haven't done that in ages. I don't think the Archer is in AP mode -- doing so requires me to set the LAN IP address of the router to an address on my 192.168.0.0/24 subnet, but when I try, the admin interface tells me that the LAN IP address cannot be on the same subnet as the WAN interface. That router is physically connected to my 192.168.0.0/24, and therefore the WAN address is on that subnet (192.168.0.112). So it won't let me change the LAN IP address to be on that subnet. Therefore, I picked 192.168.1.1 as its IP address. It does hand out DHCP addresses on the 192.168.1.0/24 subnet.
 
That would be expected for the same reason you can't get to a router from the internet. The router with 192.168.1.x treats you main network as internet. You likely can set it to allow admin from the "internet" and then you could use the 192.168.0.x ip to admin it.

Unless you have a very good reason for having multiple subnets I would not recommend you do that just because it makes some things complex.
 
Jul 15, 2022
16
0
10
I actually would prefer having one subnet, but couldn’t figure out how. One wifi router isn’t sufficient for my house — hence two. Both are plugged into the wired router. They have different SSIDs and for some reason, my ASUS router can serve the 192.168.0.0 subnet, but my Tp-Link can’t — hence the second subnet.

My WAN is actually my wired router for both. My ASUS router can have its LAN subnet the same as my hardwired router, but my TP-link can’t.
 
You should be able to run all the "other" routers as AP. Almost all routers have the AP option. You can actually use any router as a AP by changing the LAN ip to not conflict. Disable DHCP server. Then use a LAN port rather than the wan port if the router does not have AP mode.

You can set the SSID the same or different all depends on what you want to do but it is all one network since your main router is providing the IP via dhcp.
 
Jul 15, 2022
16
0
10
Thanks. I’ll try to figure out how to force my TP-Link Archer wifi router into AP mode. It doesn’t have an explicit mode setting (like my ASUS wifi router) and my attempts so far have failed to get it to act like an AP — perhaps it is because I had not disabled the DHCP server. I’ll try again. So you’re suggesting I plug my wired network into a LAN port rather than the WAN — I hadn’t thought of that. Thanks. I’ll try that.
While that will solve my issues of two subnets, I’m not sure is t is related to my connection issue, but I’ll give it a whirl.
 
Jul 15, 2022
16
0
10
I've now configured my TPLink Archer C1900 wifi router to be an access point and no longer acting as a DCHP server. So the sole DHCP server is the one in the wired TPLink Gibagit Broadband VPN Router (TL-R600VPN).

I still cannot reach my printer (192.168.0.23) from my laptop (192.168.0.25). The printer driver claims the printer is offline. Attempts to ping it or use curl on its HTTP interface are met with messages indicating it is offline (unreachable). At this moment, I can reach the linux laptop at 192.168.0.24 and the chromebox at 192.168.0.109, but this is probably because I had open SSH sessions with these two machines from 192.168.0.25 (the machine I'm on right now). If I drop those connections, I suspect after some time, I'll also find them (again) unreachable.
 
I would check the arp table and make sure the ip/mac addresses appear to be correct. The devices should be able to talk directly to each other. If you for example clear the ARP and then ping the ip it should repopulate the ARP entry. It should do this even if the devices does not actually respond to the ping. If the arp does not work then I would check the subnet masks for consistency
 

kanewolf

Titan
Moderator
I would check the arp table and make sure the ip/mac addresses appear to be correct. The devices should be able to talk directly to each other. If you for example clear the ARP and then ping the ip it should repopulate the ARP entry. It should do this even if the devices does not actually respond to the ping. If the arp does not work then I would check the subnet masks for consistency
The other possibility is that you need to reboot your devices ( including the printer ) to ensure that the DHCP request is now serviced by your one source.
 
Jul 15, 2022
16
0
10
So after waiting about 20 minutes, the print job that had been queued from my mac (192.168.0.25) to my printer (192.168.0.23) finally printed. Up to that point, the printer driver had been reporting that the printer was offline (it wasn't). Now, I can ping the printer just fine. But my linux laptop (192.168.0.24) is now failing to PING.

Code:
➜  ~ ping 192.168.0.24
PING 192.168.0.24 (192.168.0.24): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
^C
--- 192.168.0.24 ping statistics ---
3 packets transmitted, 0 packets received, 100.0% packet loss
So I ran arp, and got this:

Code:
➜  ~ arp -a
? (169.254.34.55) at ce:76:c3:80:cc:57 on en0 [ethernet]
? (192.168.0.1) at f4:f2:6d:27:11:b2 on en0 ifscope [ethernet]
imac (192.168.0.20) at 98:10:e8:f3:79:46 on en0 ifscope [ethernet]
? (192.168.0.21) at 98:9e:63:38:ba:3e on en0 ifscope [ethernet]
? (192.168.0.23) at 5c:b9:1:32:1b:d7 on en0 ifscope [ethernet]
thinkpad (192.168.0.24) at (incomplete) on en0 ifscope [ethernet]
? (192.168.0.27) at f4:f2:6d:bd:c2:9e on en0 ifscope [ethernet]
? (192.168.0.101) at d8:ec:5e:e:53:44 on en0 ifscope [ethernet]
? (192.168.0.103) at be:39:37:19:7b:98 on en0 ifscope [ethernet]
elasticsearch (192.168.0.104) at ac:67:b2:3f:2c:90 on en0 ifscope [ethernet]
? (192.168.0.106) at 4a:35:23:46:93:5e on en0 ifscope [ethernet]
? (192.168.0.108) at c0:f2:fb:34:10:57 on en0 ifscope [ethernet]
? (192.168.0.110) at d0:3:4b:6:a3:d4 on en0 ifscope [ethernet]
? (192.168.0.111) at 36:bc:f8:8c:4:43 on en0 ifscope [ethernet]
? (192.168.0.115) at d8:ec:5e:7:f5:d2 on en0 ifscope [ethernet]
? (192.168.0.116) at d8:ec:5e:8:0:4a on en0 ifscope [ethernet]
? (192.168.0.118) at 44:67:55:a:da:39 on en0 ifscope [ethernet]
? (192.168.0.119) at d8:ec:5e:e:51:f4 on en0 ifscope [ethernet]
? (192.168.0.123) at c8:d0:83:c7:d7:cb on en0 ifscope [ethernet]
? (192.168.0.124) at d8:ec:5e:8:6:46 on en0 ifscope [ethernet]
? (192.168.0.126) at d8:ec:5e:e:55:f0 on en0 ifscope [ethernet]
? (192.168.0.255) at ff:ff:ff:ff:ff:ff on en0 ifscope [ethernet]
? (224.0.0.251) at 1:0:5e:0:0:fb on en0 ifscope permanent [ethernet]
? (230.230.230.230) at 1:0:5e:66:e6:e6 on en0 ifscope permanent [ethernet]
? (239.255.255.250) at 1:0:5e:7f:ff:fa on en0 ifscope permanent [ethernet]
broadcasthost (255.255.255.255) at ff:ff:ff:ff:ff:ff on en0 ifscope [ethernet]

I made that request from 192.168.0.25, and note that my ping target (thinkpad, 192.168.0.24) is shown as "incomplete".

So I cleared the ARP table, and reran the arp command:

Code:
➜  ~ arp -a
? (192.168.0.1) at f4:f2:6d:27:11:b2 on en0 ifscope [ethernet]
? (224.0.0.251) at 1:0:5e:0:0:fb on en0 ifscope permanent [ethernet]

Then I PINGed the thinkpad:
Code:
➜  ~ ping 192.168.0.24
PING 192.168.0.24 (192.168.0.24): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
^C
--- 192.168.0.24 ping statistics ---
4 packets transmitted, 0 packets received, 100.0% packet loss

And then reran arp:

Code:
➜  ~ arp -a
? (192.168.0.1) at f4:f2:6d:27:11:b2 on en0 ifscope [ethernet]
imac (192.168.0.20) at 98:10:e8:f3:79:46 on en0 ifscope [ethernet]
thinkpad (192.168.0.24) at (incomplete) on en0 ifscope [ethernet]
? (192.168.0.111) at 36:bc:f8:8c:4:43 on en0 ifscope [ethernet]
? (192.168.0.123) at c8:d0:83:c7:d7:cb on en0 ifscope [ethernet]
? (224.0.0.251) at 1:0:5e:0:0:fb on en0 ifscope permanent [ethernet]
➜  ~

From experience, I know that I can (from 192.168.0.25) SSH to a Linode host on the internet, SSH back to 192.168.0.24 over a OpenVPN tunnel on a TUN device, ssh to 192.168.0.25, and then kill all those connections. After that, I can successfully PING/SSH to 192.168.0.24. How is this even possible? What could possibly cause this weirdness?
 
Jul 15, 2022
16
0
10
The other possibility is that you need to reboot your devices ( including the printer ) to ensure that the DHCP request is now serviced by your one source.

I did that, with no effect. However, I don't think that was the problem originally. The printer is connected (via wifi) to the ASUS wifi router (192.168.0.22), which in turn is connected to the TPLINK wired router, gateway, and DHCP server (192.168.0.1). The "other" subnet (which is no longer because, as described earlier, I got rid of it), which HAD been served by the other wifi router, was not actually reachable from the 192.168.0.0/24 subnet, and thus the DHCP server there (which is no longer enabled) could not have been used. It would have handed out addresses on the 192.168.1.0/24 subnet anyway, and my printer has always reported that its address was 192.168.0.23. Therefore, it could only have gotten its DHCP-acquired IP address from my TPLINK wired router -- which has a permanent entry using its mac address to get the fixed IP address 192.168.0.23 assigned.

In any case, I now only have one subnet (192.168.0.0/24), one DHCP server (from the wired router), and two WIFI routers -- each acting as an access point, and neither with its DHCP server enabled.

And the problem is still happening!
 
Jul 15, 2022
16
0
10
So going back the basics and only using three devices so as not to add to the confusion, we have two laptops (192.168.0.25 and 192.168.0.24) and one printer (192.168.0.23) all connected wirelessly via an ASUS router. Frequently, I find that 192.168.0.25 cannot PING/SSH 192.168.0.24 and cannot connect to the printer. Sometimes, I can connect to one of the other devices from 192.168.0.25, but not the other. All three devices can PING/CURL/SSH to other devices on the Internet -- even when I'm in the can't-connect-state. All of them, on their own, report their correct IP addresses. And then all it takes is some random, but usually less than 10-20 minutes, before connections work between devices that previously didn't connect. ARP entries show "incomplete". There is a single DHCP server in the picture, on the wired router to which the WIFI routers are connected. Sometimes, I find that although I cannot connect from 192.168.0.25 to 192.168.0.24, if I login to 192.168.0.24 directly (using a connected keyboard), I CAN ping/connect to 192.168.25. Perhaps that is always the case, because, as mentioned earlier, I can VPN into 192.168.0.24 from the internet and then connect to 192.168.0.25 that way.

Can anything of any reason why the three devices can get IP addresses via DHCP and can connect to the internet fine (always), but INTERMITTENTLY can't talk to each other?
 

kanewolf

Titan
Moderator
So going back the basics and only using three devices so as not to add to the confusion, we have two laptops (192.168.0.25 and 192.168.0.24) and one printer (192.168.0.23) all connected wirelessly via an ASUS router. Frequently, I find that 192.168.0.25 cannot PING/SSH 192.168.0.24 and cannot connect to the printer. Sometimes, I can connect to one of the other devices from 192.168.0.25, but not the other. All three devices can PING/CURL/SSH to other devices on the Internet -- even when I'm in the can't-connect-state. All of them, on their own, report their correct IP addresses. And then all it takes is some random, but usually less than 10-20 minutes, before connections work between devices that previously didn't connect. ARP entries show "incomplete". There is a single DHCP server in the picture, on the wired router to which the WIFI routers are connected. Sometimes, I find that although I cannot connect from 192.168.0.25 to 192.168.0.24, if I login to 192.168.0.24 directly (using a connected keyboard), I CAN ping/connect to 192.168.25. Perhaps that is always the case, because, as mentioned earlier, I can VPN into 192.168.0.24 from the internet and then connect to 192.168.0.25 that way.

Can anything of any reason why the three devices can get IP addresses via DHCP and can connect to the internet fine (always), but INTERMITTENTLY can't talk to each other?
Asus does have "guest" WIFI capabilities. You don't have that enabled do you? If you have it enabled, is it remembered and set to autoconnect?
 
Jul 15, 2022
16
0
10
It does actually, and I did enable it sometime ago, but none of the devices in question know of it nor it’s password. I’m pretty sure it is unrelated, but if you think I should disable it, I can do that/
 
Jul 15, 2022
16
0
10
I just tried another experiment. I found myself in the state where I couldn't connect connect from 192.168.0.25 to 192.168.0.24. Then, I used 192.168.0.25 to connect to the admin web interface of my ASUS wifi router and listed the clients. I didn't see 192.168.0.24 there. I used the keyboard on 192.168.0.24 to check its wifi connection. It still claimed 192.168.0.24 was its IP address , and the only connected interface was the wifi interface to the ASUS router. So I went to a web browser on 192.168.0.24 and successfully browsed the net. So clearly it had a connection through the ASUS router to the Internet. Its address was still 192.168.0.24. So I went back to the ASUS web interface, and the router didn't immediately show 192.168.0.24 as one of its clients, but eventually it did. I then went back to 192.168.0.25 and tried to connect to 192.168.0.24. Same issue as before -- "Operation timed out" trying to access port 22. I repeatedly tried to connect (using ssh) with the same result a few times, and then successfully connected.

I have no idea what that tells us, if anything. But why would 192.168.0.24 have that IP address (not hardwired, but acquired from the DHCP server running on my wired router at 192.168.0.1), be able to connect to the Internet and browse fine, but not show up in the client list for the ASUS WIFI router -- at least not contemporaneously with an active browser session, and not be able to be connected to from 192.168.0.25 -- and then, minutes later, all be "fine".

Oh, and I disabled the guest network on the ASUS wifi router, with no (positive/noticeable) effect.
 
So what happens if you brute force this.

Can you manually set the ip in the devices and then go in and add manual ARP entries. You really should do both sides.

What may also be interesting it to run wireshark and see what messages you are getting. You would think you always see the ARP requests since these are broadcast messages.
 
Jul 15, 2022
16
0
10
Ok, I updated the IP address config on both 192.168.0.24 and 192.168.0.25 to be fixed (not DCHP acquired). And on 192.168.0.24, I did an

Code:
sudo arp -s 192.168.0.24 a0:88:b4:33:8f:cc

And on 192.168.0.24, I did an:
Code:
sudo arp -s 192.168.0.25 f0:f8:ff:c2:45:ee:85

Then, immediately after that, I attempted (successsfully) to ssh from 192.168.0.25 to 192.168.0.24 and vice-versa. Now, this isn't conclusive, because there are certainly times when I had been able to connect from one to the other (and vice-versa) prior to this change. But, at least I do have a connection right now. I'll let you know whether I run into the "issue" again. Thanks for that suggestion. That should help narrow things down.

Can I accomplish the same packet inspection with tcpdump as with wireshark? I'm using SSH to connect between a Mac (.25) and Linux (.24) and don't have Wireshark installed on either at this point.
 
Jul 15, 2022
16
0
10
For the fun of it, I connected directly, to a Chomebox (192.168.0.28) and proved that I could access the internet via a browser. I then tried to SSH to the Chromebox from 192.168.0.25 and got the "operation timed out" error. From the Chromebox, I tried to connect to 192.168.0.25. and got the "Destination host unreachable" error. I haven't done the arp hack (hardcoding the mac/IP binding on both machines (of the other machine). I'm loathe to have to do this for the cross-product of the connections I wish to allow, but I do this pair as well, if you think it will tell us anything.
 
Jul 15, 2022
16
0
10
Regarding the need for an option to arp to make permanent, I didn't provide any, however, do see in the arp -a output that the bindings are marked permanent. So maybe that is the default? (Or maybe only for this bootload?).
 
Jul 15, 2022
16
0
10
For the fun of it, I added ARP table entries to 192.168.0.24 and 192.168.0.28 for the other, and now I can SSH without issue from one to the other. Just before I did this, I couldn't connect from either machine to the other.

So I now have two cases, where adding ARP entries manually has allow connections between the machines. What does that tell us? That the issue is with the ARP protocol going over the ASUS wifi router?