Companies are constantly looking for ways to capture and take advantage of data. Putting together a more complete picture of customers, environmental conditions, and operations can quickly become a competitive advantage.
Capturing new sources of data with traditional ETL tools can be expensive and cumbersome. A modern, growing pattern to creating data pipelines is with serverless compute, messaging queues, and object storage. A simple architecture like SNS + Lambda + S3 on AWS or Pub/Sub + Cloud Functions + Cloud Storage on GCP can allow a data team to spin up streaming data pipelines quickly.
Data teams are typically not cloud infrastructure experts and while they may want to set up new data pipelines, they likely are not comfortable with writing and deploying Terraform. In this blog, we’ll walk through how a platform team can equip a data team with the tools needed to quickly deploy their own streaming pipelines on GCP. We’ll cover the user interface, end result, and benefits to doing so.
You can find an in-depth, step-by-step tutorial in our documentation.
Proposed architecture
We’ll follow the basic structure proposed here in the GCP documentation.
Google Pub/Sub: A messaging topic that takes inputs. Let’s say we’re a retail chain, and we want to stream transactions so that we can have a real-time view of inventory in our stores. We can use Pub/Sub to receive and publish these transactions.
Google Cloud Run Function: This is simple serverless compute that we can subscribe to our Pub/Sub topic. Our function will execute each time a new message comes through our topic.
Google Cloud Storage: Object storage where we can store function results, as well as our function code itself.
Why Streamline Data Pipeline Creation
The architecture we just described may seem simple, but the devil is in the details. How would a data team know if a Pub/Sub topic should be encrypted? What number of retries or available memory should their Cloud Function have, or how can they inject environment variables?
A data engineer would spend hours or even days trying to learn Terraform and properly deploy it, with a very real possibility that the thing they deployed could be the result of a data breach or incident in the future because they misconfigured it.
Platform teams are just incentivized to streamline the creation of cloud resources for their constituents: if they don’t, they’ll be stuck troubleshooting a broken pipeline or writing one-off Terraform code every time somebody wants to start capturing a new data source.
The point is that cloud infrastructure components are built for a heterogenous group of companies, each with different requirements. Data teams at your company need guidance and recommendations for creating infrastructure, along with the flexibility to make changes when they need.
Terraform
Creating an architecture like the above with Terraform would look something like the below, which comes straight from GCP’s documentation.
This code-first implementation has no UI, no defaults, guidance, or any indication as to what a proper configuration looks like.
Resourcely Blueprint
There is a way to give developers a guided experience for deploying cloud infrastructure, so they don’t spend excessive time or experience heartache around getting across the finish line: Resourcely Blueprints.
Blueprints are customizable templates for deploying infrastructure as code. They uplevel your Terraform into guided forms with defaults, descriptions, automated linking, picklists, and more.
Blueprint code
Blueprint code is created by platform teams, on behalf of the developers they serve. It looks remarkably like Terraform:
Resulting UI
Blueprint code like the above would create an interactive UI that looks like this:
In this case, data engineers could use this UI to customize and create their own streaming data pipeline: without having to know Terraform, and without having to be an expert in Pub/Sub, Cloud Run Functions, or Cloud Storage.
Benefits
With this UI:
- Platform teams stop having to be involved in manual one-off requests
- New data pipelines are created consistently and securely
- Data engineers can move faster, and with confidence
- All the benefits of infrastructure as code are preserved, since Resourcely still emits infrastructure as code after the form is created
Conclusion
You don’t need to gatekeep cloud infrastructure. There is a way to streamline your deployment experience, guiding and unlocking developers while reclaiming your own time.
Sign up for your free Resourcely account here, follow the full tutorial to create your own streaming data pipeline template, or check out the video to see how it is done step-by-step.