Human-in-the-loop infrastructure remediation

AUTHOR
Chris Reuter
PUBLISH DATE
March 17, 2025

Picture this: you are a security team, and you just adopted a posture management tool like Wiz. You hooked it up to all of your AWS accounts, and ran a scan.

BAM! You’ve got 1,588 vulnerabilities that you need to fix. Some of these don’t apply to infrastructure, and you can remediate them without worry. However, most of them are likely related to cloud infrastructure. If you are using Terraform, how are you going to remediate these misconfigurations?

Why not auto-remediation

Before infrastructure as code became popular, automatic remediation tools started to take hold. It makes total sense: if I have a vulnerability, why not fix it?

AWS Config, Cloud Conductor, and CSPMs like Wiz and Orca can automatically remediate by simply hitting the cloud provider’s API and making the change.

There are two big problems with automated remediation that follows this pattern:

  1. Infrastructure changes are now handled with infrastructure as code (Terraform, OpenTofu, Pulumi)
  2. The definition of misconfiguration is debatable

How Terraform works

First, let’s cover how Terraform works. If you’re a Terraform expert skip to the next section 👇

Terraform is declarative infrastructure as code. With it, developers can define their cloud infrastructure using code - instead of manually setting up resources in a cloud provider’s console, they write configuration files that describe their desired environment. Terraform then translates this declarative configuration into API calls, managing resources consistently.

Terraform can integrate seamlessly into a CI/CD pipeline with the terraform plan and terraform apply commands. These allow organizations to preview changes, review them, and enact them in an automated way. Using declarative infrastructure as code ensures that changes to your cloud environments are version-controllable, auditable, and repeatable - reducing the risk of manual errors and improving overall deployment reliability.

During deployment, Terraform evaluates the current state of your infrastructure and compares it to the desired state defined in your configuration files. It creates an execution plan detailing the changes required and then applies these changes in a controlled manner, updating its state file to reflect the new configuration. This state management is critical, as it helps Terraform track resources across multiple deployments, ensuring that subsequent changes are applied accurately.

Terraform drift

When configuration changes are made directly to cloud APIs, like is the case with automated remediation, it causes Terraform drift. AWS Config, Cloud Conductor, and CSPMs are not aware of infrastructure as code state. When misconfiguration are “fixed” automatically, it will immediately cause Terraform drift: where Terraform state doesn’t match the actual deployed infrastructure.

Ironically, many companies also use drift remediation tools that will automatically bring cloud APIs that have “drifted” away from the declared infrastructure as code. This puts your teams into a neverending cycle where misconfiguration are remediated, but your state is unaware, and drift is remediated, and misconfiguration are reintroduced, and on and on and on.

	flowchart LR
    A["Monitor Infrastructure"]
    B["Detect Misconfigurations"]
    C["Automated Misconfiguration Remediation"]
    D["Assess Infrastructure Drift"]
    E["Automated Drift Remediation"]
    
    A --> B
    B --> C
    C --> D
    D --> E
    E --> A

Some configuration isn’t mis

Automated remediation typically doesn’t take into account context, which is often embedded into the brains of the individuals who wrote the infrastructure as code in the first place. There are many examples of misconfiguration that may not be appropriate to automatically remediate. Let’s consider a few:

One common vulnerability is IMDS v2 compliance: requiring session tokens when querying for AWS instance metadata. However, other scripts, applications, or other tooling accessing instance metadata would need to be updated to take into account session tokens. If you automatically change your EC2 instances from IMDS v1 to IMDS v2, you are at risk of turning a “misconfiguration” into a “production incident”.

Another example is overly permissive IAM. Imagine automatically deleting IAM policies with resource = * permissions. While this is objectively NOT a best practice IAM policy, it also could be powering critical applications or services.

In both of these cases, automatically applying configuration changes is dangerous and possibly business-impacting.

Manual remediation

For the reasons outlined above, most companies favor manual remediation of infrastructure misconfiguration. However, this is a time-consuming and painful exercise in its own right.

Project management

Consider that you don’t have automatic remediation configured. Congratulations, you don’t need to worry about this causing drift now. You are now responsible for a gigantic project management exercise. You must:

  1. Identify misconfiguration that needs review from your CSPM
  2. Dump those into a spreadsheet, or create Jira tickets…or both!
  3. Classify them
  4. Identify all of the resources impacted
  5. If you’re nice, find the exact code that is broken and a suggested fix
  6. Find the responsible party and distribute to them
  7. Track status and follow-up, either by continuing to scan or
  8. Manage exceptions and track context based on each developer’s knowledge locked in their brain

Not only is this work painful and time-consuming, but it can result in negligence. Security teams, leaders, and CISOs are liable to remediate vulnerabilities. Manual remediation can turn into a black hole of nothingness, where vulnerabilities become durable and cause incidents & data breaches.

Developer time + expertise

From the developer perspective, remediation is also a painful experience. Making the connection between Jira ticket (or spreadsheet entry) and which code to change isn’t straightforward. After that, knowing what changes to make is a puzzle. Developers need to research what good looks like, and understand how that will impact downstream applications and their performance.

This can add up: with 100 misconfigurations per month at 4 hours each, and a developer annual salary of $170,000, a company is wasting almost $400,000 annually in misconfiguration triage and remediation.

Human-in-the-loop remediation

We’ve found human-in-the-loop remediation to be the most effective way to fix misconfigured infrastructure. So much so, that we built a tool for it: Resourcely Campaigns.

Human-in-the-loop remediation combines the best of both worlds: automating the hard parts of manual remediation, while avoiding the pitfalls of automated remediation. This is done by:

Scanning combined with Terraform remediation

Resourcely can scan your state and help you generate state from your existing cloud resources if they don’t have it. Scan against common vulnerabilities, or a single targeted policy.

Secret sauce: mapping policies to Terraform

After scanning, your violations are mapped to the Terraform that triggered this violation.

IDE with suggested remediation

Developers are given a development environment that shows the exact Terraform file and line(s) of code that caused a vulnerability. The exact policy with a suggested fix is displayed inline with code - reducing cycle times and manual research.

Remediation through infrastructure as code pipelines

Version control pull or merge requests are automatically created, checked for other violations as part of your pipeline, and submitted for review. This maintains the integrity of your Terraform state, so that remediation isn’t causing drift.

Integrated project management

Violations can be tracked and assigned, giving security teams the ability to report on progress at any moment.

Support for context and exceptions

Collect context and update violations immediately, or allow developers with specialized knowledge to request exceptions for application-breaking changes.

Conclusion

Infrastructure remediation is not a perfect science. The two primary methods today (manual and automated) result in wasted time and potentially breaking changes. Human-in-the-loop remediation gives security teams and developers the best of both worlds: with support for remediating infrastructure faster without breaking your Terraform environment.

Try out Resourcely Campaigns today at https://www.resourcely.io/register!

Ready to get started?

Set up a time to talk to our team to get started with Resourcely.

Get in touch

More posts

View all
November 22, 2024

The DevOps Tax on Central Teams: Livestream

Diving in to how Netflix tackled DevOps challenges
November 20, 2024

Making AWS ControlTower Account Factory easier with Resourcely

Turning the Account Factory for Terraform modules into a smart UI