Customer Initiated Storage Account Failover

From the early days of Azure, it has been possible to configure storage accounts to be geo-redundant. Geo-redundant storage is replicated 3 times in the local region and then a further 3 times in the paired region, so that should the primary region go down, data can be accessed in the secondary. The big problem with this, however, is that the failover to the secondary region is under Microsoft’s control. Microsoft are the ones to decide as to whether the primary region is impaired enough to require storage to be failed over to the secondary region. As far as I am aware, this has never happened.

This limitation makes a lot of people (including myself) uncomfortable with using geo-redundant storage for disaster recovery. My view of what constitutes a DR event for my application, and what Microsoft consider DR for a whole region could be very different. If this is the case, if I need to make a decision to fail over to the secondary region because my application is down, but Microsoft doesn’t failover the storage, I cannot rely on that storage being available in a DR event (as far as I classify it). So I end up having to implement other techniques to overcome this. Microsoft being in control of DR also means that I cannot test my DR scenario if it relies on geo-redundant storage, which is also unacceptable.

For many years, the only answer to this was to use read-access geo-redundant storage (RAGRS). RAGRS allows you to read from the replica in the secondary region. You still could not control the failover, but if I needed to failover and the storage had not, I did have the option to copy the data to a new storage account in the secondary region. Obviously, with large data sets and small DR windows, this can be unworkable.

Given all this, I was pleased to see this week Microsoft launched a preview of Customer Initiated Failover for storage accounts. This new service allows you to decide as to when you want to failover the storage account, and initiate this yourself, not rely on Microsoft to do it for you. This is excellent news, as it now makes geo-redundant storage a viable option for DR, instead of having to rely on backups or third-party replication tools.

For the rest of this article, we will take a look at what this service is and how it works.

Preview Warning

This service is in preview, which means it is currently only supported in two regions - US West 2 and US West Central. It also means it is not supported in production yet. If you think this service is something you want to use, it is a great idea to get in now and start testing it out, understanding how it works and providing feedback. However, you should not put production data on this and rely on it for production workloads.

This service went GA in June 2020.

Limitations

There are some limitations with user-initiated feedback that you should be aware of before you look at using it:

  • Storage accounts configured to use Azure File Sync are not supported
  • Storage accounts using ADLS v2 are not supported
  • Storage accounts containing archive blobs are not supported
  • Storage accounts containing premium blobs are not supported

Pre-Requisites

To use PowerShell to check the replication stats, or initiate a failover you need to update to a preview version of the Az.Storage module This is now supported in the GA PowerShell cmdlets.

The latest version of the Azure CLI seems to support these commands.

Setup Account Failover

Storage Account Type

To be able to use customer initiated failover, you need to be using a storage account that supports geo-redundancy, so Geo-redundant Storage (GRS) or Read-access Geo-redundant Storage (RAGRS). Zone Redundant Storage is not supported currently, and Locally Redundant Storage does not have the replication required.

If you have an LRS or ZRS storage account you want to use you can convert this to GRS.

Enable the Preview

To use customer initiated failover you need to enable the preview on your Azure account. You can do this through PowerShell or CLI with the following commands. Bear in mind that registration can take 1-2 days.

Register Provider

PowerShell

Register-AzureRmProviderFeature -FeatureName CustomerControlledFailover -ProviderNamespace Microsoft.Storage

CLI

az feature register --name CustomerControlledFailover --namespace Microsoft.Storage

Check Registration

Once you have registered the provider, you can check on the status of your registration with these commands:

PowerShell

Get-AzureRmProviderFeature -FeatureName CustomerControlledFailover -ProviderNamespace Microsoft.Storage

CLI

az feature show --name CustomerControlledFailover --namespace Microsoft.Storage

Replication Status

Before you attempt a failover, it is important to understand how geo-replication synchronisation and failover works. When data is written to your geo-redundant storage account, it is written to the primary region first and then synchronised to the secondary region. This synchronisation is not immediate, and Microsoft does not provide an SLA on how long the lag between writing to primary and syncing to secondary will be.

When you failover, behind the scenes a DNS change is made that points your primary storage URL to the secondary region, and the replication between the two is no longer occurring. If you fail over before data has finished syncing to the secondary then you will lose this data. This is important to understand. Ideally, when you initiate a failover, you will check that all data has been replicated beforehand. Obviously in a situation where the primary is down this is not possible, so it will depend on the reason for you failing over.

You can check the replication status of your storage account using PowerShell or CLI.

PowerShell

$(Get-AzStorageAccount -ResourceGroupName <resourceGroupname> -Name <storageAccountName> -IncludeGeoReplicationStats).GeoReplicationStats

CLI

az storage account show --name <storageAccountName> --expand geoReplicationStats

Running either command should show you the GeoReplicationStats object, which provides the status of the replication. This can be Live (replication is working), Bootstrap (initial data is being copied from primary, or Unavailable. Of more use, is the time of the last sync you can use this to determine how long since sync has completed and whether there would be any data loss.

You can also view the replication status in the portal, just before you trigger a failover, which you will see below.

Failover

Now that we are sure our data is in sync we are ready to trigger a failover. The failover can be initiated through the portal, or use PowerShell/CLI.

Portal

  1. In the portal, select the storage account you wish to failover

  2. Go to the Geo-Replication section. This page should show you the location of your primary and secondary region, and the status of both.

  3. Click the “Prepare for failover (preview)” button

  4. The page that opens will display the last sync time and ask you to confirm the failover by typing “yes” in the box.

  5. Type yes in the box and click “Failover”, this will initiate the failover. This can take some time.

PowerShell

Invoke-AzStorageAccountFailover -ResourceGroupName <resource-group-name> -Name <account-name> 

CLI

az storage account failover --name accountName

Re-Replication and Failback

Once you complete a failover, the storage account in the DR region will revert to being a locally redundant storage account; it will no longer be geo-replicated. To re-enable geo-replication, you will need to change the account type back to GRS. Once you do this, you will need to wait until your data has replicated back to the other region before you are protected.

If you wish to failback to the other region, you will undertake another failover, in the opposite direction. Before you do this, you must make sure your data has finished replicating; otherwise, you will lose data.

Image Credits

Nature & Technology flickr photo by Theophilos shared under a Creative Commons (BY-NC-ND) license