vCloud Director Management cluster disaster recovery

A while ago i needed to perform a vCloud Director (vCD) management cluster disaster recovery. There are four main sections in the blog post:

  • Infrastructure – describes the infrastructure used by the vCloud Director Management cluster environment.
  • Rollback reasons – describes why the HP LeftHand P4300 storage volume rollback was carried out.
  • Rollback preparation – describes the steps taken before the HP LeftHand P4300 storage volume rollback was carried out.
  • Rollback procedure – describes the steps taken during and after the HP LeftHand P4300 storage volume rollback was carried out.

Infrastructure

The customer vCD environment consists of:

  • One Management cluster running management virtual machines backed by 2 ESXi hosts running version 5.1.
  • One Resource cluster running only vCD based virtual machines backed by 5 ESXi hosts running version 5.1.
  • The storage is provided på HP LeftHand P4300 to which the ESXi hosts connects using iSCSI.

Logical overview of the vSphere clusters backing the vCD environment.
Screen Shot 2013-04-17 at 14.41.08
The Management cluster consists of the following virtual machines:

  • vCenter Server version 5.1.0b 947939, including all vCenter related components, for the Management cluster.
  • vCenter Server version 5.1.0b 947939, including all vCenter related components, for the Resource cluster. 
  • vCloud Network and Security Manager (vCNS Mgr).
  • vCD cell running vCloud Director version 5.1.1  868405.
  • MSSQL version 2008 R2 where all the databases lives.
  • NFS server for vCD.
  • HP Failover Manager (HP FOM) for HP LeftHand. This virtual machine runs on ESXi local storage.

The ESXi hosts, in both the Management cluster and the Resource cluster, uses an iSCSI connection to the HP LeftHand P4300 storage. Different volumes for the Management cluster and the Resource cluster.
A HP LeftHand P4300 based snapshot is taken of all the HP LeftHand P4300 storage volumes once a day meaning we got a new crash consistent version of all virtual machines every day.

All the virtual machines placed in the Management cluster, apart from the HP FOM, runs on the same HP LeftHand P4300 storage volume. The volume is not accessed or used by any other ESXi hosts and does not contain any other virtual machines apart from the Management cluster once.

Rollback reasons

There were a lot of reasons for performing the HP LeftHand storage volume rollback and a few of them includes:

  • Misconfigurations in vCloud Director.
  • Misconfigurations in vCNS Mgr (VXLAN).
  • No vCNS Mgr backup exists.
  • Error in the vCNS Mgr UI meaning we could not perform the required VXLAN configurations.
    Screen Shot 2013-04-17 at 14.58.06
    Even though this looks really bad the VXLAN functionality is not influenced by the sad look in the UI. Everything already configured and up and running continues to work.

Rollback preparation

These are the steps taken before the HP LeftHand P4300 storage volume rollback.

  1. Determine the rollback point in time – 2 days in our case.
  2. Determine how many virtual machines have been deployed or deleted in vCD after  our decided rollback point in time. I used the scripts provided by Alan Renouf which can be found here.
    The PowerCLI script to list deployed virtual machines (change the red text to your specific environment):
    Get-VIEvent -maxsamples 10000 -Start (Get-Date).AddDays(-2) | where {$_.Gettype().Name-eq “VmCreatedEvent” -or$_.Gettype().Name-eq “VmBeingClonedEvent” -or $_.Gettype().Name-eq “VmBeingDeployedEvent”} |Sort CreatedTime-Descending |Select CreatedTime, FullformattedMessage
    The PowerCLI script to list removed virtual machines (change the red text to your specific environment):
    Get-VIEvent-maxsamples 10000 -Start (Get-Date).AddDays(-2) |where {$_.Gettype().Name-eq “VmRemovedEvent”} |Sort CreatedTime -Descending |Select CreatedTime, FullformattedMessage
  3. There were 3 vApps (8 virtual machines) created after our determined rollback point in time and the date we did the rollback.
    All vApps could be deleted before the rollback apart from 1 vApp containing one virtual machine. I added the vApp we needed to preserve to the vCD catalog and “Downloaded” the vApp Template to my management server.
    Screen Shot 2013-04-17 at 13.04.54
  4. There were 8 vApps (14 virtual machines) deleted after our determined rollback point in time and the date we did the rollback. This means we have to remove the vApps from the vCD UI when the rollback is finished.
  5. Verified that the MSSQL backups taken every night to local disk in the virtual machine running MSSQL existed.
  6. Shut down all the vCD vApps.
  7. Shut down all the virtual machines running in the Management cluster, apart from the HP FOM, and made a clone/copy of them (just in case everything goes wrong)
  8. Remove the Management cluster ESXi hosts connections to the HP LeftHand P4300 storage.
    1. Shut down 1 ESXI host
    2. Remove the iSCSI connection for the ESXi host running the HP FOM virtual machine. 

Rollback procedure

These are the steps taken during and after the HP LeftHand P4300 storage volume rollback.

  1. Highlighted the appropriate Volume in HP LeftHand P4300, right clicked and selected Rollback, selected the appropriate and pressed ok.
  2. Connected the vSphere Client to the Management cluster ESXi host still powered on,  connected the ESXi host to the HP LeftHand P4300 and performed a storage adapter rescan.
  3. Started the virtual machine running the databases and verified that all databases were successfully recovered reading the MSSQL log file.
  4. Powered on the second ESXi host in the Management cluster.
  5. Started the Management vCenter Server, verified functionality and searched log files for errors. Had to unregister and register a few Management cluster virtual machines since their current placement didn’t match the rollbacked vCenter Server database information. 
  6. Started the virtual machine running vCenter Server for the Resource cluster, verified functionality and search log files for errors.

  7. Started the virtual machine providing the vCD NFS area.
  8. Started the vCNS Manager virtual machine and verified the correct VXLAN configuration was present and that the UI worked as expected.
  9. Started vCD and searched the log files for errors.
  10. Stopped (force stop) the 8 vApps marked as partially running based on mismatch between existing virtual machines in vCD and vCenter Server virtual machines.
  11. Delete, from vCD, the 8 vApps stopped in the above section.
  12. Uploaded the “Downloaded” vApp Template to vCD.
    Screen Shot 2013-04-17 at 13.10.34
    When uploading the vApp to vCD it is temporarily stored in the NFS area which is accessible through the directory /opt/vmware/vcloud-director/data/transfer/ on the vCD cell.
  13. Added the newly uploaded vApp Template to “My Cloud”.
  14. Consolidated the virtual machine in the newly created vApp.
  15. Started the newly created and consolidated vApp.
  16. Started all other vApps.

Overall this was a very successful progress so many thanks to VMware and HP for delivering functional, reliable and trustworthy software and hardware components. The HP LeftHand P4300 snapshot mechanism was very self explaining and easy to use.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s