Disaster Recovery for Azure VMS with Site Recovery

2017-05-31

Disaster recovery is, or should be, a must for for many production applications. Having the ability to recover your application in a separate geographic location should a major incident occur is vital to the continued availability of your service. Microsoft have offered a DR service called Azure Site Recovery (ASR) for some time now, but this has been focused on taking on-premises applications and providing a DR solution for these in Azure. Customers who’s primary application site is in Azure haven’t been able to take advantage of this service and so building DR solutions for service that are already in Azure has been very difficult. This is set to change with the announcement today of the public preview of Azure Site Recovery for Azure Virtual Machines. I’ve had access to the private preview of this solution and here are my thoughts.

This new service allows you to take your existing Azure production workloads and configure them for replication and recovery into a separate Azure Region. Once configured ASR will continuously replicate your virtual machines and allow you to orchestrate a recovery of these VM’s into another region in the event of a disaster. This PaaS service provides a number of benefits for users with production workloads in Azure that need a DR solution and don’t want to build their own or use third party tools. Out of the box ASR provides:

First class. native support for Azure VM’s
Replicaiton to any supported region in Azure
Cost savings, if your DR VMs aren’t active then you are solely paying the ASR fee plus storage and network egress, you are not paying for any running VM’s which is the bulk of the cost in any DR environment.
Recovery plans at the so that you can boot the VMs in a specific order and run Azure automation scripts natively integrated with ASR recovery plans
Low RTO and RPO with application level consistency
One click test of your recovery plans without interrupting production workloads
Creation of virtual networks, availability sets and storage accounts in the DR region

As far as I am aware, the ability to replicate VM’s between 2 regions using a PaaS service is unique to Azure and could be a significant time and cost saver when looking at how to deal with DR for production Azure workloads, particularly when combined with Azure Automation to run pre and post recovery tasks.

Full documentation on the service is available here.

Limitations

Despite all the benefits listed above, we do need to remember this is a preview, so there are still some limitations on the service you should be aware before tying to use it:

No support for VMs with managed disks yet
No support for Server 2016 OS
Linux support limited to certain distributions
Management is currently only through the Azure Portal, no support for command line, PowerShell or REST yet
Virtual Machine Scale sets not supported
Azure Disk Encryption is not supported
Replication groups (the ability to group VM’s so they can be replicated and recover to the same recovery point) is not yet available
Support for Sovereign Clouds coming soon

The full list of supported and unsupported configurations can be found here.

Obviously as it’s a preview it also does not have full GA SLA , however support is available and production workloads are support if they are within the qualified support matrix.

Other Resources

It should be noted that the purpose of ASR is to replicate your VM’s ( and storage they use for disks) to your DR region, it does not cover any other resources. So if your application relies on Load Balancers, Public IP, Azure SQL, KeyVault, Web Apps etc. you will need to make sure that you have either replicated or pre-created these into your DR region using other methods, so that they are available if you fail over.

It should also be made clear that ASR does not provide built in methods to configure access to your environment. For example, if you are using a public IP to access your resources then you will need to configure your recovery plan to run an Azure Automation script to associate the public IP with your resources, similar with a load balancer. We’ll cover using these recovery plans to setup these things in more detail in a future post.

Summary

Overall, I’m excited about the Azure to Azure recovery service, I think it’s going to provide a significantly simplified process to be able to do DR with environments hosted inside Azure, which is something that has been pretty hard to do for a while. Right now the preview is missing a number of key components that I personally need, mainly managed disks, encrypted disks and Server 2016, but these will come eventually. I particularly like the ability to team this up with Azure Automation to completely automate the whole process. This preview offers an opportunity to get up to speed with the service ready for when those components are ready, but until they are present I’ll have to hold off using it in production

Given all of that, let’s take a look at how you setup the ASR to protect your Azure VM’s.

[mks_separator style=”solid” height=”2″]

Protecting VMs

To start using ASR, the first thing you need to do is create a recovery vault to store your recovery metadata. You can find this under “Backup and Site Recovery (OMS)”

Important Note: You should create your recovery vault in the region you want to use as DR, not your primary region. In this example my production VM’s are in the West Europe Region, so my vault is in North Europe. You are not restricted to using the regional pairs for your vault.

Once you have created your vault you can protect your VM’s. Go into your vault and click “Replicate”. In the source select “Azure”, and then complete the fields to select the resource group where your production workload exists.

One the next screen we select which virtual machines we want to replicate. Note that VM’s with managed disks will be greyed out as they are not currently supported.

The final screen will ask you to confirm the region you wish to replicate to, this must be the same region as your vault. It will also show you what resources it is going to create in the DR region to support your deployment. These will be resource groups, vNets, storage accounts and availiblity sets. They will use the same name as your prod resources with -asr appended.

On this screen you will also see the default replication policy that has been selected, this has two values:

Recovery point retention – how long your snaphots are retained for, the default here is 24 hours
App Consistent snapshot frequency – ASR has two types of consistency, File Level, which are being taken all the time, and Application Level. Application level consistency uses volume shadow copy to ensure that applications are in a consistent state when a snapshot is taken, this can have an impact on performance and so you will want to decide how often you want to take these app level snapshots, the lowest you can go is every hour

Once your happy with this, click “create resources” to create the required resources in the ASR resource group, and then click “enable replication” and your replication policy will be enabled and VM’s start being on-boarded. This can take some time to create the required resources and protect the machines. You can check on the status of this on the Jobs tab.

The process may sit for a while in the “Enable Replication” state, you can see more details on where in that process it is by clicking on the step.

Eventually all the tasks will complete and if you go to the “Replicated Items” section you will see your VM’s listed as protected.

Your VM’s are now being replicated into your recovery vault, with continuous recovery points (one about every 5 minutes) and application consistent snapshots on your requested schedule.

Recovery Plans

Now that your VM’s are protected you can failover individual VM’s, however in a disaster it is unlikely you will need to failover a single VM, instead it will be a number of VM’s that run your application between them. To avoid having to go through each VM individually you can create recovery plans, these allow you to group VM’s together to be failed over as a group. To set this up you select the “Recovery Plans (Site Recovery) option, and then create a recovery plan.

Recovery plans also allow you to run scripts, using Azure Automation, before and after your recovery process. These can be used for things like associating Public IP’s or Load balancers with your VM’s or making changes to DNS to transfer traffic to your DR site. To add these you need to go to your recovery plan and then select the group where you want the scripts to be run. Right click on the group and select a Pre or Post action.

You can then configure the Azure Automation account and scripts to use in these tasks.

We’ll cover using these tasks in your recovery process in more detail in another article.

Failover

We’re now at the point where our VM’s are replicated, we have a recovery plan defined and we are ready to test site recovery actually works.

Testing of your DR process is vital to ensuring you are ready to handle a real event should it occur. ASR provides a simple way to test your failover without impacting your production environment, so this can be done at any time. You can test failing over a single VM, or a whole recovery plan.

Test Failover

To test a single VM:

Go to “Replicated Items” in the Recovery vault in the portal
Select the VM you wish to test
Click the “Test Failover” button

You will be presented with a window which asks which recovery point you wish to test, you will have the option of: - Latest Process – this is the very latest recovery point availible for that VM, it may not be an app consistent recovery point
Latest app consistent – As the name suggest, this is the latest recovery point that us app consistent
Custom – if you need to use an older recover point
Finally you need to select a virtual Network to failover to. The portal will encourage you not to use your production network, to avoid impacting production workloads.

When you are ready, click OK to begin the test.
You can view the status of the test in the jobs section, once the job is showing complete you should be able to access your VM in your recovery network and test it out.

To test a recovery plan, the process is identical to the above except you go to the the “Recovery Plans (Site Recovery) option in the menu, and select a recovery plan to test.

Cleanup Test Failover

Once you have tested your failover works you are going to want to clean up the test resources so you are not charged for the VM’s, ASR provides a simple process to do this.

Select the resource your tested (VM or Recovery Plan)
Click the “Cleanup test failover” button
Add any notes you wish and check the “delete virtual machines” checkbox

Click OK, the resources will be deleted. Note this will only delete the VM’s in the test, vNets and Storage accounts for recovery will remain in place.

Failover

Now that we have tested the fail over process we know the that replication is working and fail over is working. Hopefully that is the limit of what you need to do, aside from repeating your fail over tests regularly. However, if the time comes where you need to do a real failover, the process is much similar to test:

Select the resource you want to fail over
Click the Failover button
Select which recovery point you wish to recover to
You also have the option to attempt to shutdown the VM before failover, obviously this will only work if the live site is still active and VM’s accessible. If the process cannot shut down the VM’s it will continue anyway

Click OK
Again, you can monitor the process in Jobs

Commit

Once the fail over job has completed you can go in and test that everything is up and running. Once you are happy with this you then need to Commit the failover, by doing so this indicates that you have now failed over and are live in your DR location. Once you commit the machines previous recovery points are removed from the vault as these are no longer valid.

Re-Protect

Once you have failed over to the DR region and committed, your machines are no longer protected by recovery services. You will need to go through the process to protect the machines again now they are in the new region. Unfortunately there is no way to automatically re-protect machines once they failover. You are able to re-protect at any time. If you re-protect into your primary region then you will be able to re-use the existing data from before your failover.

Fail-Back

If you wish to fail back to your original region you essentially need to do a failover in reverse, make sure you re-protect your machines in the DR region, then initiate a failover back to your main region. Once committed, make sure you re-protect the VM’s again.