Organisations taking advantage of cloud services to transform their businesses is continually gaining momentum. Like many, we’ve enjoyed the benefits of using services in multiple clouds to dramatically evolve the services that we can deliver to and for our clients.
What we’ve found however, is that there are several workloads that make sense for us (at least for now) to run in our datacentres. An element of this is that these systems that are monolithic in nature, which again at least for the moment reside in big VMs rather than the ideal distributed architecture that we eventually want to get to.
These systems are equally as critical as what we run in the cloud, and if one of our datacentres experiences some kind of spontaneous combustion we will need to recover with minimal downtime and without data loss so as to meet our service level agreements. While re-deploying somewhere else with Ansible is the go-to, stateful systems need a VM based recovery.
One of the on-premises platforms we have is VMware, which is using external storage replication between the datacentres.
This article describes how we went about evaluating, testing and putting into production an automated solution for VMware failover.
VMware Site Recovery Manager (SRM)
One positive we can take out of COVID-19, is that as we moved to a distributed working model, we adopted a renewed focus on creating a backlog of things we wanted to improve and worked through them as a team. Keeping the team engaged and working on proactive initiatives in addition to making sure BAU ran perfectly while remote was something we fortunately saw a lot of success with.
Improving our VMware failover method is one of these proactive initiatives that we were able to see through to completion.
We’d recently migrated the systems that didn’t go to public cloud to a new network and VMware 6.7 environment across our two Equinix datacentres. The failover of workloads between the sites was semi-automated, but we wanted to be able to hit the “make it so” button in an emergency and have it all happen for us. The first, and most obvious thing to do is to consider what the vendor can provide. VMware provide VMware Site Recover Manager (SRM) which contains advanced failover capabilities, as well as integration with storage to coordinate a site failover. We have been involved in some successful implementations of SRM with our clients and depending on the requirements would always entertain SRM as a viable option. VMware SRM is a comprehensive tool and comes at a cost that we also chose to avoid.
Within Advent One the entire lifecycle management of our environments is achieved with Ansible. It’s a fantastic tool for deployment, configuration management and ad-hoc automated workflows tasks. While SRM makes sense for many, we didn’t feel it would make sense to us to adopt another tool given that we have a capable toolset at our disposal.
The position that we took was that if we can’t make it work with our existing tools, SRM would be there waiting for us, because not automating a critical process for us was completely out of the question.
Our existing disaster recovery process consisted of a combination of tools:
Power CLI to automate the VMware processes
Bash scripts to integrate the storage components
The process was very complex, and any meaningful improvement in functionality would result in more complexity. The tough decision was made to let go of our mountain of shell scripts, even though they have served us well for well over a decade, and develop a more efficient and practical solution.
Among the options considered was to start again using either of:
Govc, a go based vSphere cli built on govmomi
Python using pyvmomi
Both appeared attractive, particularly Python, as we use it a lot internally and have some development capability to make it happen. The decision was made not to use either and was based on one simple thing – was this the best use of time? Given it could be done more easily (and faster) with Ansible we elected to start with Ansible.
Red Hat Ansible Automation
Without going into too much detail on what is Ansible, in short it’s an automation tool which parses “playbooks” written in YAML which can be used for provisioning, configuration management, application deployment, network management amongst a variety of automation use cases. Ansible Tower is a web interface which provides several enterprise features beyond what the command line tool Ansible provides.
We have a very mature Ansible Tower deployment, which we use for automating as much of our managed service operations as we can. Being able to throw another use case at it is something that made sense for us to do, given that we use Ansible for provisioning, configuration management, patching etc in the environment.
Ansible modules, which provide the integration between Ansible playbooks and the systems they are managing are available for VMware vSphere and a variety of external storage. These are the two key elements for our environment to be able to automate to do the DR failover. If at any point the environment changes, or we wish to reuse this in another environment which may have different storage, minimal work would be required which is also going to be handy for us.
Using Ansible Facts to pick up the pieces
Moving on to the actual implementation the first consideration was the storing of Ansible Facts. Ansible Facts are gathered as part of a playbook run and can be queried as part of a playbook execution. A simple example is using Ansible Facts to identify the location of a VM’s VMX file for import. The obvious challenge this presents is that during a DR we may not be able to gather facts about the production environment as it may well be engulfed in flames.
Ansible Facts are stored in json format and can be easily queried. Ansible also can query data from web services, so our solution was to have a webserver running in the management environment (which Ansible also builds if it doesn’t exist) and post ansible facts to it every night.
During DR we can simply query the webserver to identify the state of the environment and use those facts to reconstruct it. Obviously, this webserver wouldn’t live in an environment that you are trying to recover so it should be stored externally. A perfect use case for public cloud.
Advent One Implementation
Our implementation was quite simple. We have a management environment that runs in the public cloud. We have a variety of management, monitoring and metrics tools running in here, but the two that are relevant to this process are an Ansible Tower cluster and our webserver acting as a Fact repository. Equinix Cloud Exchange (ECX) provides our connectivity to the public cloud and Equinix Metro Connect provides our connectivity between the datacentres. What’s also not pictured here is Equinix Connect, which provides internet to both sites.
The following tasks are triggered by Ansible automation.
Ansible gathers facts from vCenter and our Storage and posts them to the webserver.
This is a playbook dr_facts.yml in our DR failover git repository. This runs on a tower schedule.
To failover, the dr_activate.yml playbook is run via tower. We can collect some information via a survey (is this a DR test or not for instance) as well as passing credentials and it does the following:
Maps the DR replica storage volumes (or a storage snapshot of them in a DR test) to the ESXi hosts at DR
Re-signatures the datastores, renames them and mounts them on the hosts
Lookup in the webserver the location of the .vmx file of each virtual machine and import it
Generates a new UUID on the VMs, so that you aren’t asked if you moved or copied it at power on.
If it’s a DR test it checks if production is running, and if it’s not powers up the VMs in the sequence we need. This sequence is stored in our variables, so we know what VMs to power on first.
Failback To failback, the dr_deactivate playbook is run via tower. This does the following:
Powers off the virtual machines
Exports the VMs from inventory
Puts the datastores in maintenance mode
Removes storage mappings
Execute the failover process described above on the target side to bring it all in again
All up for us to do a complete failover, it takes a few mins with a single click of the mouse.
The process is now so reliable, we are looking at running a tower schedule to do a DR test automatically once a week and cleaning it up afterwards.
Traps for young players
For most IT teams this isn’t their first rodeo in the area of formulating and executing an effective disaster recovery plan. That said, there are some traps with the Ansible approach that we identified along the way that are worth sharing.
Understand what needs to be in place for this to work in a failure state
Externalise your management tools
This is a bit of an obvious one, but our approach is to make sure that your management tools are located in a totally separate environment to the environment you are failing over.
In our case we have a combination of SaaS services and tools we manage ourselves outside of the environment.
The key ones for us are:
Management tools (Tower, Jenkins, HashiCorp Vault)
Metrics (InfluxDB, Graphite, Grafana)
Monitoring tools (Nagios, PagerDuty)
Git System (Bitbucket)
Keep a vCenter and Domain Controller somewhere else
Again, this is a little obvious, but it’s important to realise that in the DR failure scenario, you need to make sure that essential services exist in multiple locations. In this scenario a couple of key ones are to have a vCenter at each site, as well as Active Directory. Making sure a domain controller exists in each location as well as a vCenter in each location is the simple way to do this.
Be mindful of connectivity
Having an external management environment is great in theory, but you need to ensure there is resilient connectivity to both sites. In addition to that, you can failover your systems to the DR site but if connectivity can’t go with it, then it’s a waste of energy. We’ve partnered with Equinix utilizing their Metro Connect (DC connectivity), Cloud Exchange (Cloud Connectivity), and Equinix Connect (internet) to enable the connectivity we need.
What to do if there is no Ansible module for something
The first trap we came across was re-signaturing datastores when they are imported. There isn’t an ansible module that provides this ability, however we found something pretty close to what we wanted in Ansible Galaxy. We took a copy of this module, edited it to suit our needs and have been successfully using that.
We had another issue, where when we power up a VM, it asks if we moved it or copied it – due to a non-unique UUID. Again, there wasn’t an ansible module for this, however Ansible Galaxy came to the rescue again, where there was something that we could use straight off the bat as a custom module.
This highlights the benefit of people contributing content to the open source community so that others in the community can make use of shared automation. This something that we are kicking off an internal project to look at using Ansible collections as a vehicle for us to contribute something back.
Don’t wipe your data on failback.
The best thing and the worst thing about automation is that it will do exactly what you tell it to do. On failback, if you use the vmware_host_datastore module, with state=absent it will do exactly that. Your datastore will be gonskies. The approach which worked perfectly for us was to:
Power off / Export the VMs on the datastores you want to failback
Put the datastores in maintenance mode
Unmap the storage volumes from the storage system
Run a rescan of HBAs on all the esxi hosts (twice for good measure)
There are ansible modules available to all of the above.
The project for us was an overwhelming success. We can with the push of a button recover from a disaster in a few minutes.
While it’s true that bears don’t eat porridge (well maybe they do, how would I know), I’ve pinched the idea of the Goldilocks and the three bears analogy from a colleague of mine to describe exactly what I’ve found with failover options for VMware.
If the business is not prepared to make the financial and technical investment in VMware SRM, it can often be too costly and ruled out as a fit for purpose option, especially when you are already automating and have a toolset you are comfortable with.
At the other end of the scale, there are some great SDK’s out there for VMware, enabling teams to roll their own failover automation. The complexity of this really kicks in when you start to integrate with networks, external storage, operating systems and applications, making this a less attractive option due to the complexity, and risk of the solution becoming fragile. Finally Red Hat Ansible automation is really the fit for purpose (or just right) option:
Simple to use with a low barrier of entry
Extensible to storage, networks, applications, operating systems
Idempotent, with an ability to run in “check mode”
Able to be used with Tower, to externalise access control, workflows, credentials, approval processes
Where to from here?
Well, the content out in the open source community that we leveraged can be either reused, or tweaked for your own purposes is significant, and we’ve certainly seen benefit of that. What’s important is to contribute something back, so watch this space as we sanitise some content such that it can be shared. The current thinking is to create an Ansible collection and some Galaxy Roles. We have a lot of dependencies such as reliance on lookups into our vault instance, or credentials passed from tower and a few other things specific to our environment which we need to make generic, so keep an eye on our blog and Linkedin page for updates.
Finally, a shameless plug – if you or your organisation are interested in teaming up with us to work on a project like this, please contact Advent One, we’d love to hear from you.