A few days back i got involved in a virtual machine migration (VM) issue. The VMs were suppose to be migrated from one vSphere cluster to another without any chance of mounting one of the vSphere clusters datastore(s) to the other vSphere cluster. The VM(s) couldn’t be powered off during the migration so the option we had was using the Migrate option from the vSphere Web Client and select “Change both host and datastore”:
We had to move about 10 VMs and 6 of them completed without any errors but for the additional 4 VMs the process was interrupted randomly. The same VM could be running for either 10 minutes or 2 hours meaning it could fail for any of its VMDKs. VMs with only 1 VMDK failed as well. When changing both host and datastore one VMDK is migrated at the time and not several VMDKs at the same time as when you just change datastore and not host.
The error we got was:
- A general system error occurred:Migration to host <x.x.x.x> failed with error Already disconnected (0xbad002e)
- Migrating the active state of virtual machine
The initial investigation included:
- IP information for all vMotion interfaces including
- IP Addresses
- Subnet masks
- Network connectivity between all potential hosts
- Limited the migration process to specific sender and receiver ESXi hosts
- VM vmware.log file in the VM home directory
- vmkernel log file on both sender and receiver ESXi host.
- ESXi hosts for dropped packages using esxtop during the migration process
- Network switches for dropped packages
- Spanning Tree configuration, which was actually not in use but on our list of things to go through.
- Verifying latest NIC driver was used by running the following command:
What got us on the right track was checking for errors on the ESXi host network adapters using the following command:
- ethtool -S vmnic0 | grep -i error
- The command gave us quite a lot “rx_missed_errors”
When i have seen this before it has been related to “Flow Control” not being turned on. When the networking team turned on Flow Control on the switches the rx_missed_errors counter did not increase and the Storage vMotion processes completed successfully on the first attempt.