Reducing Your Blast Radius

When you first start with Infrastructure as Code, it can be tempting to create one template to rule them all™ where you can deploy your whole infrastructure in a single deployment. If you’re only deploying a few resources, then this can be fine, but once you get beyond that and are deploying complex sets of infrastructure, it’s important to consider your blast radius.

So what do we mean by blast radius? We’re talking about minimising the amount of damage you can do with a mistake, error or bug in your IaC. If a single mistake can take out your whole production environment, you have a pretty big blast radius. So how do we do this? Below are some techniques we can use we creating our IaC to reduce this.

Break up your code

One of the most significant changes you can make is to break up your code. If you have one big file containing all your infrastructure, any time you make a change, you have to re-run this whole file, and so have the potential for changes occurring to any of these resources. If you break your resources into multiple smaller sets of code, you can just run the files you need to change. A common approach is to break your infrastructure into layers, having a network, storage, compute, etc. layer. Doing this lets you focus your changes on specific areas and ensure that even if something goes wrong, it only impacts a smaller set of resources.

Use modules

When you break up your code, you still need ways to chain your individual files together to deploy your whole infrastructure stack. Most IaC languages provide a way to do this:

  • ARM Templates - Nested Templates
  • Bicep - Modules
  • Terraform - Modules
  • Pulumi - Stack References

Using these techniques, you can split your code into smaller chunks of work but have a way to link them all together, pass values between them and deploy your stack in one go. As a bonus, it also makes reusing and sharing this code easier when packaged up as a module.

Understand your dependencies

One of the most common causes for catastrophic failures is around dependencies, where something is deleted or changed that other resources depend on, or a change is made to one resource, which cascades to other dependent resources. Because of this, it is crucial that you understand your dependencies and what impact any changes you make will have on other resources. Some IaC languages have tooling that can help with this. For example, the aptly named Blast Radius is a project for Terraform that can visualise dependencies with Terraform code.

Preview changes

Nearly every IaC language can preview the changes you will make before you make them. This is an essential tool to determine what impact your changes will have. Running a preview shows you what resources are being changed, and you should quickly notice changes to resources that you weren’t expecting to change. Ideally, you will review your preview before any deployment to check the changes are what you expect, and there aren’t things getting changed that you weren’t expecting.

Protect key resources

One of the most disruptive issues is when resources are accidentally deleted. This is especially concerning when using a language like Terraform or Pulumi, where if a resource doesn’t support a change you are asking for, it will delete and then recreate the resources. Sometimes this works fine, but sometimes this deletion and recreation is very disruptive, and you want to avoid it at all costs, including stopping the deployment if it happens. Many IaC languages offer a way to protect resources to prevent accidental deletion:

  • ARM and Bicep - resource locks
  • Terraform - prevent_destroy
  • Pulumi - protect

Protecting resources does make managing these resources a bit harder. If you want to delete them at some point, you must remove the protection. However, if these resources are important and will be significantly disrupted by a deletion, it’s often worth the effort.

Test your changes

If your infrastructure changes go straight to production, then at some point, you are going to break production. You should have a test environment that you can push changes to first to catch problems before they reach production. This environment should mimic the production one as closely as possible to ensure that the tests can detect any issues impacting production.

The best way to do this is to automate your tests. Once someone checks in a change to the IaC, you run a pipeline to push this change into the test environment and run some tests to see what changes and is broken.

Automate Deployments

As well as automated testing, if you can automate the actual deployments, you will reduce the chance of human error or misconfiguration of a deployment. Using a CI/CD system for your IaC means you can create automated pipelines that will run your deployments the same way every time.