Oxbow is a simple project to take an existing storage location which contains Apache Parquet files into a Delta Lake table. It is intended to run both as an AWS Lambda or as a command line application.
The project is named after Oxbow lakes to keep with the lake theme.
Executing cargo build --release from a clone of this repository will build
the command line binary oxbow which can be used directly to convert a
directory full of .parquet files into a Delta table.
This is an in place operation and will convert the specified table location into a Delta table!
% oxbow --table ./path/to/my/parquet-files% export AWS_REGION=us-west-2
% export AWS_SECRET_ACCESS_KEY=xxxx
# Set other AWS environment variables
% oxbow --table s3://my-bucket/prefix/to/parquetThe deployment.tf file contains the necessary Terraform to provision the
function, a DynamoDB table for locking, S3 bucket, and IAM permissions.
After configuring the necessary authentication for Terraform, the following steps can be used to provision:
cargo lambda build --release --output-format zip --bin oxbow-lambda
terraform init
terraform plan
terraform apply|
ℹ️
|
Terraform configures the Lambda to run with the smallest amount of memory
allowed. For bucket locations with massive |
Building and testing can be done with cargo: cargo test.
In order to deploy this in AWS Lambda, it must first be built with the cargo
lambda command line tool, e.g.:
cargo lambda build --features lambda --release --output-format zipThis will produce the file: target/lambda/oxbow-lambda/bootstrap.zip which can be
uploaded direectly in the web console, or referenced in the Terraform (see
deployment.tf).
When running oxbow via command line it is a one time operation. It will
take an existing directory or location full of .parquet files and create a
Delta table out of it.
This repository is intentionally licensed under the AGPL 3.0. If your organization is interested in re-licensing this function for re-use, contact me via email for commercial licensing terms: [email protected]