Funky esxi route behavior

oikjn

Ars Scholae Palatinae
969
Subscriptor++
Hoping the hive-mind can help me out here. I have two esxi hosts connected to NFS datastores. I created a virtual switch that has two portgroups attached to it. One for Management and one for storage. Both portgroups are setup on the default TCP/IP stack and both are setup for their own VMKernel NIC. For some reason, ONE of the two hosts has decided that it wants to talk to the NFS shares through the management NIC even though the NFS share is on the same subnet as the storage NIC. The management NIC is designated with the default gateway. I can't find a setting difference between the two hosts and the routing table appears ok to me.

Network Netmask Gateway Interface Source
--------- ------------- --------- --------- ------
default 0.0.0.0 10.55.55.1 vmk2 DHCP
10.55.55.0 255.255.255.0 0.0.0.0 vmk2 MANUAL
10.44.1.0 255.255.255.0 0.0.0.0 vmk0 MANUAL


The NFS share is at 10.44.1.123, so I would think it would want to go over the vmk0 interface, but it isn't for one host... any ideas why not?
 

oikjn

Ars Scholae Palatinae
969
Subscriptor++
Can you ping the storage from the CLI when specifying vmk0 as the source?


Is that VMK enabled for storage traffic and disabled for management? I forget the exact checkboxes.
Good question... I did think about that maybe it was a vlan config issue on a random switch port not accepting the storage network traffic, but I didn't test the ping through the interface as you suggested. I just ran vmkping -I vmk0 10.44.1.## for each NFS share on both hosts and both are able to ping the storage share IPs without issue. Regarding what is checked on the servies... all are identical with Provisioning, Management, Replication, NFC replication checked.

When you say 'has decided', you mean it was for sure working correctly and now isn't? Or maybe never was?

I would basically just double, double check everything is correct. Subnet mask on all the parts, link status etc.
Could be that I just noticed it... its probably always been like this, but always is a bit of a short time period... we just migrated from one NetApp NAS to another... I set the two up identically (or so I thought:biggreen:). I think everything is OK on the layer-2 level. Guess its time to open up that support case with VMWare to help figure this one out.
 

oikjn

Ars Scholae Palatinae
969
Subscriptor++
Stupid question, but does it mean the NIC order is the same? I forget exactly where, but you can manually select the NIC failover order.
I'm not sure about that... are you talking about the physical NICs on the virtual switch? The issue isn't with them, but withe the VMKernel NICs sending traffic out the one VMKernel NIC that requires the data to transit a gateway instead of sending the data out the VMKernel NIC that is on the same VLAN as the storage. Both VMKernel NICs are on the same virtual switch and both are members of the default TCP/IP stack.
 

Danger Mouse

Ars Legatus Legionis
38,092
Subscriptor

I think this is it.

I seem to recall either on YellowBrick Road or another VMWare blogger site, that it could happen in 5.5 and even 6.

One would think there's no way it should be happening on 7.x and your environment.
 

oikjn

Ars Scholae Palatinae
969
Subscriptor++
hmmm... well I"m on 8.0 here and I would hope it would have been long settled by now. the virtual switchgroup had two ports and one was set to standby. I set it to active, but I'm not seeing any difference. I have a ticket opened with VMware at this point. I'll see what they come up with and post an update here when I have something to report.
 

Paladin

Ars Legatus Legionis
32,552
Subscriptor
ESXi can be surprisingly weird with networking because of its linux underpinnings. (I've had repeated history with Linux sending traffic out the wrong interface, etc.)

I just had one the other day that is on the latest 7.x release with a single configured vswitch with failover links (2 ports, one active and one standby). Both were linked up and looking fine. Management access to ESXi was fine but the VM, which was configured for the same subnet etc. would not work until we unplugged one of the links. I'm guessing something on the switch config was the real culprit but I didn't get to see that part of it. I just suggested that everything looked fine but it might not hurt to try disconnecting one of the links and see what happens and bingo. That brought it back online. 🤷‍♂️ Gotta love when your reliability feature makes things unreliable.
 

oikjn

Ars Scholae Palatinae
969
Subscriptor++
@Paladin yes, I'm suspecting this is something like that. I've now got too many tickets open on what all might be the same issue with too many vendors to track... each esxi host has four ports split into two port groups, 2 ports for VM traffic and 2 for mgt/storage. Both were setup with one active and one in standby... I recently noticed thousands of STP flapping port notifications where the active VM data link would go down. Over the past couple days, one host would have that happen or the other. I got this to stop by making both ports active :confused: But this setup has been stable without issue for a long time, so I don't know what the deal is.

This all started because suddenly my NFS datastores on my NetApp AFF units decided that they would take 15+ minutes to commit a snapshot... normal IO/throughput on the datastores appear fine, but snapshots take 15+ minutes to complete all of a sudden. So yea... Between Veeam, NetApp, VMWare, and Fortinet... something isn't playing right anymore.
 

oikjn

Ars Scholae Palatinae
969
Subscriptor++
no update on the network issue, but I think I found the source of the snapshot problem at least... I had implemented snapshot.alwaysAllowNative = TRUE and removing that setting got the snapshots to create at the normal speed again. I"m assuming that is a VAAI functionality issue I'll have to work out. I had moved all VMs off the NFS datastores, so until I move things back on, I can't get the NFS traffic to verify it is now going through the path it is supposed to go.