Redpanda Connect -- The Unsexy ETL demo

It's pretty common that folks building streaming platforms still need to onboard batches of data, often provided by 3rd parties via cloud object storage. A pattern I've come across:

3rd party packages up structured data (xml, json, etc.) into archive like a compressed tarball.
3rd party uploads it to a bucket in S3, GCS, Azure Blob, etc.
Streaming platform relies on some scheduled or triggered service to process the data.
Data ends produced into a topic for downline consumption.

Using things like Apachi Nifi, Apache Flink, or Apache Kafka Connect to address step 3 is often:

overkill -- platforms like Flink can do batch processing, but work best for stateful stream transformations
painful -- you need to deploy and manage backing infrastructure due to n-tiered design (e.g. control vs. task nodes)

Enter Redpanda Connect

Since cloud providers all have some form of triggerable serverless application frameworks where some code can be invoked automatically when an object appears in a bucket, the simplest way to package up the ETL would be in a single "lambda function" app invoked on-demand and only when needed.

Luckily, Redpanda Connect (f.k.a. Benthos) scales down quite well!

So, why not just trigger a Redpanda Connect "job" using AWS Lambda?

The original Benthos project supported "serverless" deployments, but assumed you wanted to develop a pipeline and output strategy based on processing just the incoming raw event from the trigger. In this case, that means we still need to go fetch the object in S3...and currently there's no processor for that.

However! There is an AWS S3 input. We just need to programatically configure it.

High-level Design

This project provides a 2-phase approach to processing S3 objects on creation:

http.yaml -- an "outer" RP Connect configuration uses http_client inputs and outputs to receive and response to the AWS Lambda service, receiving the event when triggered and providing a response when completed. It takes the event data and calls out to...
config.yaml -- an "inner" RP Connect configuration that uses aws_s3 input and kafka_franz output along with a processing pipeline to do the unpacking and transformation of the gzipped tarballs of json documents.

Low-level Design

3rd party service generates gzipped tarball of ndjson (newline-delimitted JSON) files, each containing numerous JSON documents.
3rd party uploads the file into the S3 bucket.
S3 bucket invokes Lambda function.
If one isn't running, AWS Lambda spins up an instance.
The "outer" logic polls the HTTP endpoint to retrieve the event, extracts details (e.g. bucket, object key, etc.) and invokes the "inner" logic using a command processor, overriding config used by the aws_s3 input via cli args. The access credentials for Redpanda are retrieved using an HTTP client calling AWS's Secrets Lambda Extension.
The "inner" logic retrieves the object, processes it, and publishes to a topic configured on the kafka_franz output
The "outer" logic takes over and sends an HTTP POST to let Lambda know the result (pass/fail) using an http_client output.

Deployment

There are two phases to deployment:

Packaging up the application.
Deploying the AWS infrastructure.

Packaging

You should be able to just run the build.sh script. It will package up the Redpanda Connect configuration files (the yaml files), download a particular Linux arm64 release of rpk, and assemble the Lambda layer zip files.

For AWS, it's easiest to use the AWS environment variables for authentication. Grab those from your account and export them.

export AWS_ACCESS_KEY_ID="your-key-kid"
export AWS_SECRET_ACCESS_KEY="some-access-key"
export AWS_SESSION_TOKEN="you-long-long-session-token"

If you make config changes, make sure to re-run build.sh.

Deployment

Deployment uses Terraform / OpenTofu. All you need is an AWS account you have permissions to create infrastructure in, including IAM roles/policies, S3 buckets, and Lambda functions. You also should have a Redpanda instance accessible from the Lambda function, so go grab one at: https://cloud.redpanda.com/sign-up

To keep it simple, prepare a Terraform variables file terraform.tfvars like the one below, but with your unique details.

s3_bucket_name = "my-cool-bucket"
redpanda_bootstrap = "tktktktktktkt.any.us-east-1.mpx.prd.cloud.redpanda.com:9092"
redpanda_topic = "documents"
redpanda_username = "my-redpanda-user"
redpanda_password = "my-super-cool-password"
redpanda_use_tls = true
redpanda_delete_objects = true

Then fire off the deployment:

tofu apply

You should now be able to drop gzipped tarballs of ndjson into the S3 bucket and watch the fireworks.

Caveats

The terraform stuff probably won't pick up on zip file / layer changes. I haven't tested that part. If you make a config change and re-run build.sh, keep that in mind. It might be easiest to run tofu destroy and then tofu apply to rebuild the environment.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
layers		layers
wikimedia		wikimedia
.gitignore		.gitignore
.shellcheckrc		.shellcheckrc
.terraform.lock.hcl		.terraform.lock.hcl
README.md		README.md
aws.tf		aws.tf
bootstrap		bootstrap
build.sh		build.sh
config.png		config.png
config.yaml		config.yaml
http.yaml		http.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Redpanda Connect -- The Unsexy ETL demo

Enter Redpanda Connect

High-level Design

Low-level Design

Deployment

Packaging

Deployment

Caveats

About

Uh oh!

Releases

Packages

Uh oh!

Languages

voutilad/rp-benthos-s3

Folders and files

Latest commit

History

Repository files navigation

Redpanda Connect -- The Unsexy ETL demo

Enter Redpanda Connect

High-level Design

Low-level Design

Deployment

Packaging

Deployment

Caveats

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages