State Files, What Are They Good For?

2022-10-30

Absolutely nothi… oh wait, they might have some uses.

When choosing an Infrastructure as Code (IaC) tool, there are many differentiating factors among the options available, but one that often comes up in this conversation is whether or not they use a state file. Some languages, such as Terraform and Pulumi, use a state file to track what has been deployed outside the cloud provider’s system. In contrast, other languages, like ARM templates and Bicep, don’t have a state file and rely on the actual state of the resource. Using a state file can have benefits but also adds complexity and issues that don’t exist in a stateless tool. In this article, we are going to take a look at the benefits and downsides of using a state file to help make a decision on which tool is best for you.

logos

What is a State File

A state file is a file used by the IaC tool to record information about what has been deployed by the tool. Often these are JSON files and will contain a representation of every object that has been deployed by the tool, including all the configuration options specified when creating the resource and all the outputs returned by the cloud provider.

Generally, there will be a separate state file for each IaC project deployed, and often when you are deploying the same infrastructure multiple times with different parameters, there is also a state file for each deployment.

The state file must be stored in a place accessible to the IaC tool when running the deployment. This can just be held on the machine of the user running the deployment, but as teams grow and CI/CD pipelines do deployments, there is a need to move the state file to some sort of shared storage. Depending on what the IaC tool supports, this can be a cloud provider’s storage (such as Azure Blob Storage) or maybe even a commercial version of the IaC tool like Terraform Enterprise.

Usually, there is no need for the end user to access the state file directly and view its contents; it is purely used by the IaC tool.

State File

Benefits of a State File

There are reasons why some IaC tools use a state file; otherwise, it would be a waste of time to manage this file. The state file allows the IaC tool to know the current state of resources it has deployed quickly, without needing to refer to the cloud provider and to keep track of what it has deployed. By doing this, the tool gets several key benefits.

Preview

This is usually the headline benefit of a state file, the ability to preview what you are going to do when you run a deployment and determine what impact it will have on your resources. Because the IaC tool has a record of what is currently deployed, it can compare this to the changes you are asking to make and provide information on the impact of these changes on your deployment. The tool can tell you what resources will be created, which existing resources will be changed and what these changes are, and which resources will be destroyed. This ability can be a vital feature when you need to be clear on what changes are done before they deploy and recognise where there are changes you weren’t expecting. This is especially true when a preview shows a resource will be destroyed, which you were not expecting.

The preview process has also been extended to show even more information about resources, like InfraCost,, which shows you the cost of the cloud resources you want to deploy as part of the preview.

For a long while, being able to undertake a preview operation was only possible in stateful tools like Terraform and Pulumi. However, the “What-If” option was recently introduced for ARM and Bicep, allowing a similar function for these languages. This feature is directly comparing against the cloud resources and so is slower and more prone to showing false positives. It also only works against Azure resources, whereas the preview in Terraform and Pulumi works against any resource types it can deploy.

Destruction

Another benefit of tracking what you deploy is that you can destroy it. Stateful tools know exactly what resources were deployed as part of a project because they are noted in the state file. This means that the tool can offer a destroy option where it will clean up resources that it created for you with a single command. This can be a significant benefit, especially with dev/test environments where you want to create, run some work and then quickly destroy them.

Stateless languages like ARM and Bicep have no concept of remembering what they deployed. Once you run an ARM template, it has no further interaction with the resources, so there is no way for it to go back and destroy these resources for you. You can work around this to some extent by deploying all your resources into a resource group and then deleting the resource group when you want to delete them. However, this can get complicated if you have projects sharing a resource group or you deploy some components in other resource groups. It’s even more painful if you deploy things that sit outside of resource groups like policies, permissions or even non-ARM resources like service principles.

Destroy and Recreate

Because the state file has a complete record of the resource that were created, including all the properties on the resource, it knows how to recreate this resource. This is useful if you need to change a property on a resource that does not support changing it. If you try this in an ARM or Bicep template, you will get an error stating that the resource does not support changing the property. In a stateful language, if this happens, it can delete and recreate the resource with the new setting to get it into the state you want. It can do this because it knows precisely what that resource needs to look like based on the state file.

Now, this can be a double-edged sword. In a stateless tool, you know that if you make a mistake and try and change a property that can’t be changed, it will just show an error. In a stateful language, it will recreate it; if that resource is storing important data, this could get deleted. If you pay attention to the preview, you can spot when this happens, and there are ways to prevent the destruction of resources in this manner, but you need to be careful to avoid data loss.

Sometimes you may have multiple deployment projects that need to share information between them. You might need to get an output from a resource deployed in one project to use as input into the next project. Most stateful tools have the facilities for a project to read the state file of a previous deployment and grab values from this last project to use in the current one. This can just be reading resource properties, or it can be more complex things like retrieving keys or secrets from resources created in the first project that needs to be accessed in the second. This can be a significant benefit for resources where secrets are created once and not readable after the fact, as this data can then be accessed through the state file after the fact (this comes with some downsides, so see the later section).

For stateless tools to replicate this functionality, you either need the last project to access the cloud resources directly and read the properties (which means they need access) or store these values in an intermediate tool like Key Vault or App Configuration.

Performance

When you update an existing deployment, the process can be much quicker with a stateful tool. Because the stateful tool knows the current state of the resources through the state file, it does not need to go back to the cloud provider for any resources that don’t need changes and only works with the resources that are changing.

In a stateless tool, all of the infrastructure properties must be sent to the cloud provider and run against the cloud infrastructure, even if there is only a tiny change. This can lead to a significant increase in time if there are lots of resources.

Dependency Management

The state file not only records what resources have been deployed but also what resources have dependencies on each other. This should prevent the deletion of resources where there are dependencies that have not been deleted. With a stateless tool, it is straightforward to manually delete resources that other things depend on.

Down Sides of a State File

As you saw, a state file brings many benefits, but it’s not all upsides. A state file adds complexity to your deployment process in several ways.

Storing the State File

It may seem obvious, but state files need to live somewhere. Storing it on a developer’s machine might work for a single user doing some development work, but as soon as you need more than one person working on the infrastructure or you need to deploy it to production, you need to find somewhere to store it. Most stateful tools offer various options for storing state, including cloud provider storage, databases, commercial offerings from the tool provider, etc. It’s pretty easy to set this up, but it’s another thing to manage.

If you choose to use a cloud provider backend like Azure blob storage rather than something like Terraform Enterprise, you then have the added complexity:

Needing to backup the state file to protect against accidental deletion or corruption
Securing access to the state file. As we’ll see later, there can be very sensitive data in a state file, so you need to make sure it is secure both from an RBAC and a network access level

Secrets in State

As mentioned in the benefits section, a state file will store secrets. Any resources that create sensitive data, such as storage keys, function keys, Azure AD passwords etc., that are created with a stateful tool will have this data stored in the state file. This has several benefits around managing these secrets through the IaC tool and passing them around as needed to configure other resources, so you don’t want to get rid of it. However, this now means you have a very sensitive file to manage.

Some tools, like Pulumi, have built-in encryption for state files which can make this a bit easier, but you still need to manage how that encryption will work and where the key comes from. Terraform, however, does not offer state encryption out of the box. Some plugins and community add-ons can help you achieve this, but more work needs to be done.

Even with encryption in place, these secrets still need to be decrypted at deployment time, so you also need to ensure that you keep these secrets safe from those doing the deployments if they should not have access to them.

Additional Setup Work

A significant benefit of a stateless tool like Bicep is that I can take a .bicep file, point it at a resource group and run it, and it will deploy. If I want to rerun it with different parameters to create a second environment, I just rerun it with a different value passed to the parameters. For a stateful language, I must create the state file before running anything. This means figuring out where the state file is going to go, getting access to that location, securing the file, setting up any encryption for secrets etc.

This can be a real blocker for things like development environments where you want to create environments quickly, do some work and destroy them. It is possible to automate this process to make it easier; some of the work Pulumi has done with their automation API makes this pretty straightforward, but it’s still more work that needs to be done.

Existing Resources

Importing existing resources into a stateful project is painful in every tool that supports it. There have been multiple attempts to make this easier, but at the end of the day, you are trying to integrate a one-off imperative task that needs a lot of upfront information into a declarative process designed to be run over and over. If you have existing resources that you need to bring into a project, then there will be a lot of upfront manual work that needs to be done to manage that import process.

With a stateful language, I can import existing resources easily by just creating an entry in my IaC with the same name and setting the properties to match and running the deployment.

This is for resources that need to become part of your project. If you need to reference existing resources to get some data from them, but they don’t need to be managed by your deployment, then most tools have a way to do this fairly simply.

State File Corruption

If you spend time working with stateful languages, you will experience issues with the state file having problems or getting corrupted. The most common cause of this is a situation where a deployment failed part way through, and the IaC tool thinks it is in an inconsistent state. It will usually mark certain elements in the state file as having problems, and you will need to go in and tell it what condition they are in or revert some changes it has made. Often this can be fixed relatively quickly, but it still takes time.

On some occasions, however, you can find you have got into a state where what the state file thinks is real and what it is, is very different. The error indicating that a resource already exists, but is not in the state file, despite you knowing that the IaC tool is the one that created it, is very frustrating. In these scenarios, it can be easier to delete the whole deployment, clear out the state file, and start again. This is pretty easy to do in dev but not so in production. This may mean reverting to a backup of the state file, but now you need to deal with any resources or changes to resources that are not in the state file.

Most stateful providers have a “refresh” option that will compare the state file to what is deployed in the cloud provider and attempt to reconcile them, but this doesn’t always work.

Coping with Change

A stateful tool assumes that any resource changes are done within the tool. If you make changes outside of the tool, such as directly editing a resource through the cloud provider, then by default, the IaC tool isn’t going to be aware of the change. As far as it is concerned, the resource is in the same state as the state file says it is. This can lead to drift between the state file and reality. As mentioned, there are refresh options for most providers that will compare the difference and try to reconcile them, but this can add significant time to your deployment, especially if you need to rectify some of these changes.

Terraform runs refresh by default on any deployment, whereas Pulumi does not. There are options to force a refresh (or not) in both tools, but you need to be aware of this.

In a stateless tool, every deployment is compared against what is in the cloud provider, so it always detects a change.

Conclusions

State files can be a very powerful tool and bring several benefits. However, they also bring complexity and opportunities for more things to go wrong. Consider carefully whether the benefits of the state file are useful to you, and if they are not, whether a stateless tool might make your life easier. Remember that one of the benefits of the state file is being able to use the tool it supports. Even if you do not need any of the benefits we have listed here, if you want the multi-cloud abilities of Terraform or Pulumi, then the state file is part of the package, and you may need to accept it as the cost of using those languages.

From personal experience, working with a state file can be a bit of pain, but you quickly put processes in place to manage it, create automation where it is needed, and it becomes second nature fairly quickly. So don’t let the requirement for a state file put you off trying these tools, but do consider what it is bringing you.