Just another GPT wrapper

AUTHOR

Ryland Goldstein

PUBLISH DATE

August 13, 2024

Its weird out there

AI is a weird space to be in right now. Some argue that the entire problem has already been solved by ChatGPT, and any new offering is simply a vaporware wrapper around it. Others believe we haven’t reached anything notable yet and that the majority of interesting work is yet to come—"ChatGPT isn’t even intelligent; it’s just fancy statistics."

AI isn't the first hype-y trend to dominate the news, even in the last 3 years (cough cough crypto), but what makes it unique is the meta-ness of the discussions around it. The group hyping it seems unusually self-aware of the negative public perceptions around AI and tries to combat this by being self-deprecating and overly critical. Even within the community, it’s hard to tell whether the latest AI offering is something you could replicate with ChatGPT in a few hours or if it’s an impressive feat of human engineering.

As with most things, the truth is somewhere in the middle. Hopefully, we can elucidate that in this post by exploring an experiment conducted within Resourcely, a startup focused on secure and manageable infrastructure provisioning. Our goal was to flip the experience of creating infrastructure on its head, using AI. If we were successful, we could make infrastructure dramatically more secure and easier to deploy. Worst case we would end up with a fairly large OpenAI bill.

Setting some goals

Before we dive into the AI specifics, I'll start by setting some context about Resourcely, myself, and the goals and motivations of the experiment we conducted. Resourcely is a startup dedicated to solving the holistic problem of defining, creating, and provisioning secure and manageable infrastructure. For many developers, it’s impossible to separate the application logic they write from the infrastructure that underlies it. This leads to developers that are often responsible for designing and defining the infrastructure topography underlying their services.

The most popular modern solution for this is Terraform, which enables users to create declarative Infrastructure as Code. Terraform is great, and developers can usually create something that functionally works with it, but there is a big gap between functionally working and being practically correct. Specifically, it’s very easy for developers to create flawed Terraform that introduces security holes and leaks sensitive data.

Resourcely’s product solves this problem in two main ways:

Guardrails: Providing security and ops teams with a set of controls and tools to prevent insecure Terraform from ever making it through a PR into production.
Blueprints: Giving developers higher-level building blocks that still provide them the control they need, without exposing complexities that don’t add value and are ripe for misconfiguration.

Generally, the solution works fabulously, and Resourcely customers are very happy on both the security and development sides of the equation. However, the current solution still requires developers to understand infrastructure details that don’t actually matter to them.

It's worth expanding on this point: our thesis is that developers don’t care about the details of infrastructure; they care about the guarantees it provides and how it affects their application logic. Developers tend to think in terms of consistency, idempotence, and other similar concepts, while the infrastructure considerations that create those conditions are not well understood. In a perfect developer world, they would never write a line of Terraform. They would write their application logic and describe the guarantees and behavior they need to run it in plain English.

Frankly, that isn’t what Resourcely offers today. But what if it did?

An experiment emerges

It's a good opportunity to introduce myself. I’m Ryland Goldstein, a product leader, advisor, investor, and engineer in the startup ecosystem with a background in applied mathematics and machine learning. I’ve been friends with Travis, the CEO and co-founder of Resourcely, for some time, so when he started Resourcely, he asked me to come on as a product advisor.

A few months back, Travis and I were hanging out discussing AI (as two Silicon Valley types do in their free time), and an idea emerged that we thought could fundamentally accelerate the progress of Resourcely and massively improve the user experience. We imagined a world where developers never think about infrastructure, never write Terraform, and simply describe what they need in plain English—and everything else is done for them.

This is obviously a lofty goal, but some practical work I had been recently doing around LLM-based applications made us think it could actually work. So we decided to run an experiment.

The initial idea of the experiment was to drastically change the developer UX of Resourcely. Instead of having developers assemble groups of blueprints, we would have them describe their intent in plain English, and we would do the rest for them. We’re acutely aware of the limitations of modern AI, especially the non-determinism. In other contexts, this might be a showstopper, but Resourcely’s design offers a unique advantage with this dynamic.

Resourcely’s engine is built to synthesize and validate IaC like Terraform. This means we already had a way to guarantee that any Terraform we generate is syntactically correct (at minimum). Furthermore, Resourcely’s guardrails already go beyond basic syntax and make it possible for security and ops teams to define generalized rules about what is and isn’t allowed within Terraform and, by extension, the cloud. They also enable things to be automatically set on behalf of the developer, such as important tags on cloud resources.

These factors massively reduce the complexity of the solution to our problem. In fact, we realized that as long as the solution could eventually (with enough iterations) generate the right thing, we would be able to simply keep retrying it until it did. That meant the bulk of the problem was centered around having an LLM generate Terraform that matched the needs of the user.

Should be easy, right? Just some simple prompt engineering?

Won’t AI just generate Terraform for me?

One of my biggest frustrations with the perception of modern AI and LLMs stems from the immense gap between something working in a straightforward, limited demo and actually being usable in a real-life scenario. If there’s one thing ChatGPT is great at, it’s instilling false confidence in its ability to reliably achieve certain tasks. This is why many products are dismissed as "just a wrapper around ChatGPT."

Users see some LLM-based functionality advertised by a product and often think, "I could just have ChatGPT do that," and in some ways, they aren’t wrong. However, what I’ve noticed is that users often:

Simplicity fallacy - Validate with an unusually simple prompt that doesn’t take the entire domain problem space into account.
AI is smarter - Don’t have the domain expertise to ensure the generated artifact is correct.
Reproducibility fallacy - Forget that LLMs (at least GPT) are non-deterministic and getting the answer right once does not mean it will get it right again.

To understand this, let’s dive into using ChatGPT to generate Terraform based on a user's intent (prompt). While the LLM is trained quite well on the syntax of Terraform, it struggles considerably with the valid provider-specific values for various Terraform properties (like AWS instance types). This is partly because they are dynamic and could have changed after the training period!

Furthermore, Terraform projects are often composed of hundreds of resources (if not more). These resources do not usually exist in isolation, but are highly referential. This means that in addition to knowing the correct Terraform syntax and valid values for provider-specific Terraform properties, the LLM also needs to understand all other resources defined in the project and how they relate to each other. Therefore any solution would need to feed the LLM a list of relevant resources/properties and coax it to understand that it should rely on those resources referentially.

In our experience, even when it manages to do this in a functionally correct way, LLMs rarely uses Terraform best practices, leading to very disorganized and confusing projects. Worse, sometimes it just completely makes up new best practices nobody asked for. Even then, what happens when the number of presently generated plus existing relevant Terraform resources exceeds the LLM's context window? We are also willfully ignoring the cost dynamic, which scales directly with the amount of context you add.

While all of what I just said is true, it’s not even the biggest issue. Even if LLMs could generate perfect Terraform, it’s not enough. This is because human intent is ambiguous. Most humans do not have perfect command of their written language, and therefore it’s regularly impossible to translate that intent into exactly what they expect reliably. This is an area where modern LLMs totally fail—they do not have the “intelligence” to provide an iterative feedback loop to validate what someone is asking for. They usually just make a guess and “hope” for the best.

Any proper solution to this problem needs to have a concept of “doubt,” where it does not assume the user knows what they want and imagines that it may know better. This would also need to be coupled with multiple iterative feedback/confirmation phases to ensure that what is being executed is actually what the user intended.

So to circle back, the answer is that “yes, ChatGPT can generate valid Terraform, but only in straightforward cases where your intent is totally unambiguous and does not depend on any referenced resources.” And you also have to hope the provider values the LLM was trained on are up to date!

A high level philosophy for AI based logic

Given the complexities we've discussed, how can we reasonably expect to solve this problem using AI? There are multiple parts to the answer:

Break Things Down: Decompose tasks repeatedly until you get to a point where the LLM can reliably produce what you expect.
Assume Ambiguity: Never assume the LLM can reliably interpret user intent, and never assume the user correctly expressed their intent in all cases.
Rely on Traditional Logic: Whenever possible, use traditional algorithmic logic instead of AI. AI should only be used for parts that absolutely require it.

Deducing intent

Let’s apply this philosophy to our experiment with Resourcely. The flow begins when a user expresses intent via a prompt that will eventually result in a set of materialized Terraform files. The first thing we needed to build was a set of agents and processing logic to deduce what the user’s intent is and whether there is too much ambiguity within the language to proceed without follow-up.

Imagine the following prompt: “I want a website bucket. Arbitrary users should be able to write but not read files in the bucket. Bucket should have versioning enabled.” While it may seem straightforward, immense amounts of ambiguity lurk in this prompt.

First, the user did not explicitly state what cloud provider they want. One might infer AWS from the word "bucket," but there are other platforms that use this language. Furthermore, "bucket" has become a general term for block storage, so clarification from the LLM is almost certainly required to ensure the user wants an AWS S3 bucket.

This prompt also introduces a general dilemma: the purpose of this experiment is to reduce the cognitive load on developers. If you interpret everything as ambiguous, the developer will likely end up doing the same amount or more work compared to the manual approach. For instance, "Arbitrary users should be able to write but not read files in the bucket" could be accomplished using an ACL or other valid mechanisms within AWS.

Our intent deducer handles these situations by breaking down intents into sub-intents and creating heuristics (like confidence) for the LLM to rank the level of ambiguity. If the ambiguity crosses a certain threshold, the LLM will conduct one or more clarification passes with the user to reach a high enough level of confidence in the intent. The amount of specificity required by the developer in a clarification pass corresponds to the amount of confidence the LLM has in the original intent and the importance of the property/resource being targeted.

Intent is not the only part of the pipeline requiring a feedback loop with the user, as we will see in the next section.

Generating valid output

Assuming we’ve reached a point where we have high confidence in the user’s intent, it’s time to start generating artifacts that represent it. In our case, this has to be a multi-pass solution for several reasons:

While we have confidence in user intent, additional inputs might be required based on the materialized Terraform from the intent.
Resourcely guardrails impose limitations on Terraform resources, often resulting in additional resources needed to satisfy the guardrail.
Resourcely supports a concept of "context," which are user-specified values that can modify guardrail behavior, like environment variables.
Resourcely has a knowledge graph that contains the major clouds, their resources, the properties of those resources, and valid syntax/values for those properties.

The first step we took was creating a pipeline phase responsible for mapping high-confidence user intent to a set of proposed infrastructure resources. This may seem simple, but there is lurking complexity. First, we needed to provide the LLM with a list of the value schema for provider-values to Terraform fields. Just for AWS, the "knowledge graph" of these values is far larger than a 120k context window, so we had to break the problem down.

We implemented a pre-pass to deduce which keys in the knowledge graph are relevant based on user intent. Then, we fetched those portions of the knowledge graph and fed only the subset to the LLM along with the user intent. The relevant knowledge graph elements would then identify required fields and initiate another interactive feedback loop with the user.

To make this concrete, even though we’ve deduced that “I want a website bucket” means an AWS S3 bucket, the knowledge graph will indicate that providing a name for an S3 bucket is required by the user. There may be many such cases identified based on intent/knowledge graph cross-reference, so the user needs to address them to proceed.

Next, we generated initial Terraform artifacts. While ChatGPT is great at generating syntactically correct Terraform, it did a pretty terrible job of producing best-practice and scalable Terraform, or producing Terraform that might meet a specific organization’s standards. To solve this, we rely on Resourcley's guardrails and their own generative capability to let the LLM generate minimum Terraform that matches user intent and then Resourcely guardrails set the required properties for best practices.. In small scenarios, this is easily achieved, but generally, the more numerous your instructions are to an LLM, the higher the chance it will not take them all into account. You also quickly find that seemingly arbitrary things like ordering or innocuous wording matter.

Once we had an initial artifact, we wanted to gather context (eg: what environment production/staging/test the infrastructure is for). The context a user needs to specify is derived from the set of guardrails that will be applied to the defined infrastructure. This meant we needed to do a low-effort pass to identify relevant guardrails (from the global set of guardrails on a user’s account) for the initial artifact. Then, we collected all valid context from those guardrails and compared it to the set of pre-specified global context on the account. This left us with a set of blocking context that a user would need to specify before proceeding.

This resulted in an interactive feedback loop with the user, presented on the web UI as required form fields. Once we retrieved the context, the next step was to generate an updated set of Terraform based on the relevant guardrails and valid context values.

We’ve now finally reached a point where we might have a valid artifact. We used a series of "checker" agents to conduct a common-sense evaluation of the Terraform to ensure it matched what the user originally wanted and didn’t break any ideological rules. Syntax was not our concern at this point because we relied on Resourcely’s existing non-LLM-based engine for ensuring that our Terraform was valid and adding the relevant best practices, security requirements, etc.

Proposing a future architecture

After getting the experiment mostly working end-to-end, we were happy to say that the proof of concept was validated, at least at a high level. However, we immediately agreed that the initial architecture was far too brittle to keep, and that we would need to redesign it from first principles based on what we learned.

Here are some of the key changes we proposed:

Implementing Doubt Everywhere: At any place in the pipeline where user input is required, we need to implement doubt. You can never expect that a user gives you what you asked for or asks for what they want. As part of that, having explicit agents that try to weed out bad/garbage/nonsense inputs from user intent is crucial.
Granular Breakdown: Break down parts of the pipeline over and over again until they reach a granularity that will be reproducible and deterministic. For example, instead of trying to deduce user intent in a single pass, we broke it into a multi-phase evaluation. First, we decide which set of providers (AWS, GCP, Azure, GitHub, etc.) the user is likely targeting, and then create a resource-provider hierarchy from the intent. We validate each of those assumptions and hierarchies with the user if our confidence is not extremely high. Only then do we map the individual resource-level intents to likely resource constructs offered by the provider.

In other words, start by saying that “bucket” is probably associated with “AWS,” and only once that is validated, translate “bucket” into “AWS S3 bucket.”

Isolated Resource Generation: Have a first pass where each Terraform resource is generated in isolation, and then use another agent to exclusively stitch them together. This separates the problem of referential declaration from creating the correct resource for a sub-intent.
Validating Context and Guardrails Without AI: Use traditional methods to validate context and guardrails without the use of AI. This is not something AI is needed for, and it only adds more complexity and headache.

Conclusion

In conclusion, while AI can significantly aid in generating IaC, it requires a carefully designed architecture with iterative feedback loops and traditional logic to ensure reliability.

The journey we embarked on with Resourcely shows that while AI can bring us closer to a world where developers focus solely on their application logic, achieving this requires addressing the inherent ambiguities and limitations of AI systems. By breaking down tasks, implementing doubt, and relying on traditional logic where possible, we can create robust solutions that harness the power of AI without falling prey to its pitfalls.

Ready to get started?

Set up a time to talk to our team to get started with Resourcely.

Get in touch