Event-driven data pipelines...with Terraform?

Giving data engineers an automated UI for creating event-driven data pipelines with Terraform

AUTHOR

Chris Reuter

PUBLISH DATE

December 11, 2024

Companies are constantly looking for ways to capture and take advantage of data. Putting together a more complete picture of customers, environmental conditions, and operations can quickly become a competitive advantage.

Capturing new sources of data with traditional ETL tools can be expensive and cumbersome. A modern, growing pattern to creating data pipelines is with serverless compute, messaging queues, and object storage. A simple architecture like SNS + Lambda + S3 on AWS or Pub/Sub + Cloud Functions + Cloud Storage on GCP can allow a data team to spin up streaming data pipelines quickly.

Data teams are typically not cloud infrastructure experts and while they may want to set up new data pipelines, they likely are not comfortable with writing and deploying Terraform. In this blog, we’ll walk through how a platform team can equip a data team with the tools needed to quickly deploy their own streaming pipelines on GCP. We’ll cover the user interface, end result, and benefits to doing so.

You can find an in-depth, step-by-step tutorial in our documentation.

Proposed architecture

We’ll follow the basic structure proposed here in the GCP documentation.

Google Pub/Sub: A messaging topic that takes inputs. Let’s say we’re a retail chain, and we want to stream transactions so that we can have a real-time view of inventory in our stores. We can use Pub/Sub to receive and publish these transactions.

Google Cloud Run Function: This is simple serverless compute that we can subscribe to our Pub/Sub topic. Our function will execute each time a new message comes through our topic.

Google Cloud Storage: Object storage where we can store function results, as well as our function code itself.

Why Streamline Data Pipeline Creation

The architecture we just described may seem simple, but the devil is in the details. How would a data team know if a Pub/Sub topic should be encrypted? What number of retries or available memory should their Cloud Function have, or how can they inject environment variables?

A data engineer would spend hours or even days trying to learn Terraform and properly deploy it, with a very real possibility that the thing they deployed could be the result of a data breach or incident in the future because they misconfigured it.

Platform teams are just incentivized to streamline the creation of cloud resources for their constituents: if they don’t, they’ll be stuck troubleshooting a broken pipeline or writing one-off Terraform code every time somebody wants to start capturing a new data source.

The point is that cloud infrastructure components are built for a heterogenous group of companies, each with different requirements. Data teams at your company need guidance and recommendations for creating infrastructure, along with the flexibility to make changes when they need.

Terraform

Creating an architecture like the above with Terraform would look something like the below, which comes straight from GCP’s documentation.

This code-first implementation has no UI, no defaults, guidance, or any indication as to what a proper configuration looks like.

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = ">= 4.34.0"
    }
  }
}

resource "random_id" "bucket_prefix" {
  byte_length = 8
}


resource "google_service_account" "default" {
  account_id   = "test-gcf-sa"
  display_name = "Test Service Account"
}

resource "google_pubsub_topic" "default" {
  name = "functions2-topic"
}

resource "google_storage_bucket" "default" {
  name                        = "${random_id.bucket_prefix.hex}-gcf-source" # Every bucket name must be globally unique
  location                    = "US"
  uniform_bucket_level_access = true
}

data "archive_file" "default" {
  type        = "zip"
  output_path = "/tmp/function-source.zip"
  source_dir  = "function-source/"
}

resource "google_storage_bucket_object" "default" {
  name   = "function-source.zip"
  bucket = google_storage_bucket.default.name
  source = data.archive_file.default.output_path # Path to the zipped function source code
}

resource "google_cloudfunctions2_function" "default" {
  name        = "function"
  location    = "us-central1"
  description = "a new function"

  build_config {
    runtime     = "nodejs16"
    entry_point = "helloPubSub" # Set the entry point
    environment_variables = {
      BUILD_CONFIG_TEST = "build_test"
    }
    source {
      storage_source {
        bucket = google_storage_bucket.default.name
        object = google_storage_bucket_object.default.name
      }
    }
  }

  service_config {
    max_instance_count = 3
    min_instance_count = 1
    available_memory   = "256M"
    timeout_seconds    = 60
    environment_variables = {
      SERVICE_CONFIG_TEST = "config_test"
    }
    ingress_settings               = "ALLOW_INTERNAL_ONLY"
    all_traffic_on_latest_revision = true
    service_account_email          = google_service_account.default.email
  }

  event_trigger {
    trigger_region = "us-central1"
    event_type     = "google.cloud.pubsub.topic.v1.messagePublished"
    pubsub_topic   = google_pubsub_topic.default.id
    retry_policy   = "RETRY_POLICY_RETRY"
  }
}

Resourcely Blueprint

There is a way to give developers a guided experience for deploying cloud infrastructure, so they don’t spend excessive time or experience heartache around getting across the finish line: Resourcely Blueprints.

Blueprints are customizable templates for deploying infrastructure as code. They uplevel your Terraform into guided forms with defaults, descriptions, automated linking, picklists, and more.

Blueprint code

Blueprint code is created by platform teams, on behalf of the developers they serve. It looks remarkably like Terraform:

// Partial Blueprint code

# Pub/Sub Topic
resource "google_pubsub_topic" "{{ __name }}" {
  name = "{{ pubsub_topic_name }}"
}

# Storage Bucket
resource "google_storage_bucket" "{{ __name }}" {
  name                        = "{{ bucket_prefix }}-gcf-source"
  location                    = "{{ location }}"
  uniform_bucket_level_access = true
}

# Storage Bucket Object
resource "google_storage_bucket_object" "{{ __name }}" {
  name   = "{{ bucket_object_name }}"
  bucket = google_storage_bucket.{{ __name }}.name
  source = data.archive_file.{{ __name }}.output_path
}

# Cloud Function
resource "google_cloudfunctions2_function" "{{ __name }}" {
  name        = "{{ name }}"
  location    = "{{ location }}"
  description = "A new function for {{ name }}"
  
  // to be continued

Resulting UI

Blueprint code like the above would create an interactive UI that looks like this:

In this case, data engineers could use this UI to customize and create their own streaming data pipeline: without having to know Terraform, and without having to be an expert in Pub/Sub, Cloud Run Functions, or Cloud Storage.

Benefits

With this UI:

Platform teams stop having to be involved in manual one-off requests
New data pipelines are created consistently and securely
Data engineers can move faster, and with confidence
All the benefits of infrastructure as code are preserved, since Resourcely still emits infrastructure as code after the form is created

Conclusion

You don’t need to gatekeep cloud infrastructure. There is a way to streamline your deployment experience, guiding and unlocking developers while reclaiming your own time.

Sign up for your free Resourcely account here, follow the full tutorial to create your own streaming data pipeline template, or check out the video to see how it is done step-by-step.

‍

Ready to get started?

Set up a time to talk to our team to get started with Resourcely.

Get in touch