How and when to ignore lifecycle changes in Terraform
Introduction
Terraform is a very useful tool for managing infrastructure as code (IaC), it allows us to easily define our desired state via a common language (HCL or JSON), then using the Terraform engine, parse this desired state compare it to actual state and make the necessary updates our environment(s). It is obvious that the biggest benefit of Terraform is this delta based approach whereby we only ever update/create/destroy the minimum amount of things necessary and keep our environment consistent.
But what happens when Terraform doesn’t work quite the way you expect it to? Just why does Terraform sometimes cause your environments to break? And more importantly when/how can you tell Terraform to ignore things you know to be safe.
As with all my blog posts, if you are in a rush and want the TL;DR version of this blog then you’ll find that header down at the bottom for your convenience 🙂
It’s All About State
Ok, so before we go any further I just want to make sure that we cover the basics of how Terraform works and how it decides what to (and what not to) do.
In the simplest of terms, Terraform is a state based engine. When terraform runs it first looks at what you want (desired state) and scans what is in your cloud platform (actual state). Once it knows the desired state and the actual state, the next thing Terraform does is calculate a simple delta:
- If something exists in your desired state, but not in actual state — it gets created.
- If something exists and in both desired state and actual state, but is configured differently — it gets updated.
- If something exists in actual state but not in desired state — it gets deleted.
Seems simple in principle right? But sadly sometimes it is this simple state based mechanism employed by Terraform that can cause us some massive headaches.
When is a change not a change?
Ok, let’s assume we have a hypothetical scenario where we write some Terraform and deploy it:
This is a very simplified diagram, but the point I’m trying to convey is that usually you will have your desired state defined in .tf
files. Terraform also knows what it deployed last time, because it persisted that information somewhere in a .state
file. If this was your first ever deployment then that file wouldn’t exist and Terraform creates it. Then i our Cloud provider (Azure) we have the Actual state — the real manifestation of the desired state.
Now lets zoom into a thin slice of this and explore a bit further. Here is a hypothetical example of our logicapp.tf
file:
This HCL in theory will deploy our hypothetical Logic App. I’ve added some smarts in to load the ARM template of the Logic App into a local variable, and then use the MD5 of the file as the name for the template deployment. Now this file is simple enough, but lets examine line 9:
tags = ["test"]
So why have I asked you to consider this line? It’s because this attribute is a likely example of something that could potentially be changed manually or be automatically set in Azure without Terraform, knowing anything about.
I’m going through changes 🎶
Let’s assume we have deployed our Terraform shown above, and our Logic App is all deployed and is working just the way we expect it to. Next along comes some automatic process or Azure Policy that puts a number of tags onto our Logic App. Tags are just meta data, so they don’t really affect the resource themselves, but they make admin much easier. But unfortunately what happens when we next deploy our code?
Well the first thing that happens is Terraform tries to figure out the delta between what we have and what we want. Remember: our local configuration says we don’t want any tags, but the state of the resource in Azure says it has tags! So what does terraform do?
Turns out Terraform will do exactly what it thinks we want. We have said we want no tags, so it will remove the tags that are there. Whoops!
So what’s the big deal?!?!? It’s just tags
Ok, so I will admit that the example I gave above is quite contrived, but what about when the potential changes Terraform can make are more destructive?
Well it turns out that Logic Apps and their ARM templates are actually quite complicated. There are a lot of automatically generated parameters that Azure magically handles for us under the hood. A prime example of this is when you are using Secure connection parameters. From this example please consider this snippet:
The snippet taken, from the example page, shows how secure connection parameters embed themselves into the ARM as properties with the name $connections
.
So our Logic App is deployed, via Terraform, and we make an update to the schema… what happens now when when we try to reconcile the state? Well before we go there, lets take a look at our Terraform tfstate
file:
Hang on a minute! Where did that parameters
thing come from on line 20? And why does it contain $connections
set to an empty string?!?!
This is one of the many quirks of Terraform you will likely hit the more you use this technology. Remember when I said earlier that during that state reconciliation step during deployments; Terraform will pull back the state of provisioned objects it creates and store them in its state file. So to answer our question, this is the actual state representation of the Logic App we deployed. Whether we asked for it or not, this resource has some properties. Now the problem is our running Logic App in Azure needs these connections to be able to work (lets say it uses it to connect to another service such as a blob store), but now we have a problem:
Same as before, Terraform is going to reconcile state, decide that we no longer want those $connections
and wipe them out… so suddenly bang goes our previously working logic app through no fault of our own!!!
So how can I avoid this????
Well luckily dear reader, help is at hand in the form of a little known Terraform feature called “lifecycle meta-argument” (I know…. really catchy name right????)
Using this meta-argument allows us to override the default behaviour of Terraform in a number of ways, but the main one we are interested in here ignore_changes
attribute:
The
ignore_changes
feature is intended to be used when a resource is created with references to data that may change in the future, but should not affect said resource after its creation. In some rare cases, settings of a remote object are modified by processes outside of Terraform, which Terraform would then attempt to "fix" on the next run
So what does this look like in our code? Well, if we make the necessary changes to our example Terraform HCL we should end up with something like this:
As you can see, through adding the lifecycle
argument to our resource definition, we can basically prevent Terraform from ever making further changes to the parameters or tags associated with the logic app. You can think of this as a “once and only once” action now. The first time we do a Terraform apply
action, the Logic App gets created and Terraform will give us what we want, and will update everything else to do with the Logic App as normal, but if the tags or parameters change, they will be ignored.
Important
I would like to to be really clear here on a couple of points:
- The
parameters
mentioned here are not the same ones passed into the ARM template deployment so you are safe to change those and they will get deployed. - This is not an ideal situation, and seems to be an odd quirk between the Terraform behaviour and how some resources/providers function.
- Be careful where and when you use this lifecycle argument; I would highly recommend that it should be considered a last resort to prevent breaking changes.
Another Example
So far I’ve shown what can happen when Terraform accidentally leaves our working infrastructure in a broken state. But what happens when the Terraform can accidentally cause changes which cause intermittent outages and performance issues?
Take for example the following Terraform HCL:
This was a real example from a project I recently worked on. I’ve already included the necessary ignore_changes
directive, so it should begin to give you a hint as to what the issue was.
The above configuration snippet was designed to provision some subnets within a VNet, and one of the things we needed to subnet was an Azure Container Instance that was acting as a self-hosted AzDO Build Agent. In order to allow our ACI to correctly join our VNet and have the necessary permissions to access things, we grant it some service_delegation
properties.
Similar to what happens with our Logic App however, once this resource has been created, those delegations get changed behind the scenes by Azure. This means when we run our Terraform deployment again our subnet gets changed each time, resulting in our ACI losing the VNet integration for a brief window while the subnet configurations are updated.
This is a perfect example of a transient issue caused by Terraform making unexpected changes, which can cause us quite a bit of head scratching while we figure our why our build agent is suddenly telling us that it has no access to any of our network resources such as storage etc.
TL;DR
- Sometimes, through no fault of our own, Terraform can break our infrastructure.
- Terraform does its best to reconcile our desired state and the actual state of our cloud provider, sometimes this can have adverse effects.
- When resources are deployed, cloud resource providers can add additional properties/attributes, behind the scenes, without our knowledge.
- Most of these additional properties/attributes are completely benign, but others can impact functioning infrastructure or devops processes.
- When you want Terraform to ignore changes between subsequent
apply
commands you can use thelifecycle
ignore_changes
meta-argument. - The
ignore_changes
argument means that Terraform will set the value when the resource is first deployed and then forever ignore any changes to it. - The
lifecycle
meta-argument has a lot of interesting functionality in addition toignore_changes
and i’d highly recommend you give it a try. - Check your Terraform
plan
andapply
output to see that you are updating/changing only the things you expect to see.
Further Reading
- Terraform breaking Azure Logic App connections — Stack Overflow — This StackOverflow post was the original lead which helped me to discover this issue and the fix
- https://www.terraform.io/docs/language/meta-arguments/lifecycle.html — Terraform Lifecycle documentation, a very good place to start reading up on things
- Overview — Automate deployment for Azure Logic Apps — Azure Logic Apps | Microsoft Docs — If you are planning on doing deployments of Azure Logic Apps via IaC then this is the page you need to read