The Mystery of Pulumi and the Disappearing NSG Rules

Recently I’ve been working with Pulumi to deploy cloud infrastructure. If you’ve not heard of it before, Pulumi is an infrastructure as code tool similar to ARM templates and Terraform. However, the big differentiator with Pulumi is that you can use real development languages to write your templates. Rather than using a DSL’s like ARM’s JSON or Terrafom’s HCL, you can use languages like C#, TypeScript, Python and Go.

I’ve been enjoying working with Pulumi, and I’ll be writing some more about my experience in the future. If you’ve not checked it out before I recommend you do have a look at their website. For this article however, I just wanted to quickly cover an issue that caught me out when I was getting started and left me confused for a while. Hopefully, if you come across this issue when getting started with Pulimi, it will help clear it up sooner.

The Issue

I was creating a relatively simple Pulumi template that was deploying a network security group and some security group rules. As with Terraform, you have two ways to apply NSG rules, either inline inside your NSG or as separate top-level objects, but you can only use one option. I tend to stick with using separate objects, as this makes it easier to add more rules later, especially if you want to add them in a different template.

The issue I found was that when I first ran my template to create the resources, it all worked fine, the NSG was created, and rules added. However, if I then made a change to my NSG, for example, adding a tag, and then re-ran my deployment, I found that all of my network rules had been deleted!

I spent multiple hours re-running, testing out theories, until I found the cause of the problem.

The Root Cause

The root cause of this problem is a combination of the way the Azure REST API works and a design choice by Pulumi. Thanks to fellow MVP David O’Brien for pointing out this issue, after his experience with something similar.

The critical issue is the way the Azure REST API works. This issue impacts NSG’s as well as any other resource that supports both inline and child resources. When we first create the NSG Pulumi is sending multiple requests to the API, one to create the NSG, then one each for the separate rules objects, this works fine. However, when we edit the NSG, the only thing that has changed is the NSG, not any rules, so Pulumi sends the updated NSG object. However, the Azure API is expecting that the PUT request to update the NSG contains all the information about the NSG, including the inline list of NSG rules.

        {
            "type": "Microsoft.Network/networkSecurityGroups",
            "apiVersion": "2020-05-01",
            "name": "[parameters('networkSecurityGroups_nsg_name')]",
            "location": "westeurope",
            "properties": {
                "securityRules": []
            }
        },

What Pulumi has stored in its state file, however, is the original representation of the NSG - an NSG with no inline rules, and set of separate top-level rules. Because only the NSG has changed, Pulumi only sends back the NSG object. The Azure API sees and empty list and removes all the rules. In my view, this is a real flaw in providers that have the option for both inline and separate resources.

However, what makes this a bit more confusing is that the same issue does not happen with Terraform, and this is where the design decision from Pulumi comes in. In Terraform when you run an “apply” operation, it always does a refresh action, which goes and grabs the current state of the object from Azure, before using that data to send the update request to Azure. With Pulumi, they decided to default to not refreshing the state, with an option to refresh if you want to. Because of this, Terraform always get’s the updated NSG object from Azure, which does have the rules list, whereas Pulumi does not.

Pulumi made this decision to speed up “plan” and “apply” operations, and in some respects, it makes sense. The refresh action can take a long time. If you are confident that your infrastructure is solely being managed by Pulumi and no one is making changes outside (or if you do you want to replace them), then there should be no need to refresh. However, because of the way Microsoft designed the API, this now becomes critical to ensuring your deployments are consistent.

In my view, because of this flaw in the API, it would be good to make running refresh the default, so as not to catch you out. Having NSG rules, or other sub-objects like subnets, disappear when your not expecting them too, and with this not highlighted in your preview, is at best confusing and at worst dangerous. Having the option to opt-out of refreshing when you know you don’t need it seems a better approach.

The Solution

So, how do we stop this issue happening? There are only two ways to resolve this:

  1. Stop using top-level resources for things that can also be defined as sub-resources, and always declare them as inline child resources

  2. Run refresh every time by appending the -r flag to Pulumi commands

    pulumi preview -r
    pulumi up -r
    

I’ve gone with option 2, to give me more flexibility, but it’s not ideal as I now need to make sure anyone running these templates uses the refresh flag. If they forget and don’t check the resources, they may never find out that their NSG rules got removed.

There is a GitHub issue open with Pulumi to look at making refresh the default action. If you’ve been affected by this issue, I would recommend you comment on this. That said, I see this more as an Azure issue then I do a Pulumi one, so hopefully this is something that can be looked it in the future to fix the API and prevent this being an issue in the first place.