Migrate data pipelines

This document describes how you can migrate your upstream data pipelines, which load data into your data warehouse. You can use this document to better understand what a data pipeline is, what procedures and patterns a pipeline can employ, and which migration options and technologies are available for a data warehouse migration.

What is a data pipeline?

In computing, a data pipeline is a type of application that processes data through a sequence of connected processing steps. As a general concept, data pipelines can be applied, for example, to data transfer between information systems, extract, transform, and load (ETL), data enrichment, and real-time data analysis. Typically, data pipelines are operated as a batch process that executes and processes data when run, or as a streaming process that executes continuously and processes data as it becomes available to the pipeline.

In the context of data warehousing, data pipelines are commonly used to read data from transactional systems, apply transformations, and then write data to the data warehouse. Each of the transformations is described by a function, and the input for any given function is the output of the previous function or functions. These connected functions are described as a graph, and this graph is often referred to as a Directed Acyclic Graph (DAG)—that is, the graph follows a direction (from source to destination), and is acyclic—the input for any function cannot be dependent on the output of another function downstream in the DAG. In other words, loops are not permitted. Each node of the graph is a function, and each edge represents the data flowing from one function to the next. The initial functions are sources, or connections to source data systems. The final functions are sinks, or connections to destination data systems.

In the context of data pipelines, sources are usually transactional systems—for example, an RDBMS—and the sink connects to a data warehouse. This type of graph is referred to as a data flow DAG. You can also use DAGs to orchestrate data movement between data pipelines and other systems. This usage is referred to as an orchestration or control flow DAG.

When to migrate the data pipelines

When you migrate a use case to BigQuery, you can choose to offload or fully migrate.

On the one hand, when you offload a use case, you don't need to migrate its upstream data pipelines up front. You first migrate the use case schema and data from your existing data warehouse into BigQuery. You then establish an incremental copy from the old to the new data warehouse to keep the data synchronized. Finally, you migrate and validate downstream processes such as scripts, queries, dashboards, and business applications.

At this point, your upstream data pipelines are unchanged and are still writing data to your existing data warehouse. You can include the offloaded use cases in the migration backlog again to be fully migrated in a subsequent iteration.

On the other hand, when you fully migrate a use case, the upstream data pipelines required for the use case are migrated to Google Cloud. Full migration requires you to offload the use case first. After the full migration, you can deprecate the corresponding legacy tables from the on-premises data warehouse because data is ingested directly into BigQuery.

During an iteration, you can choose one of the following options:

Offload only your use case.
Fully migrate a use case that was previously offloaded.
Fully migrate a use case from scratch by offloading it first in the same iteration.

When all of your use cases are fully migrated, you can elect to switch off the old warehouse, which is an important step for reducing overhead and costs.

How to migrate the data pipelines

The rest of this document addresses how to migrate your data pipelines, including which approach and procedures to use and which technologies to employ. Options range from repurposing existing data pipelines (redirecting them to load to BigQuery) to rewriting the data pipelines in order to take advantage of Google Cloud-managed services.

Procedures and patterns for data pipelines

You can use data pipelines to execute a number of procedures and patterns. These pipelines are the most commonly used in data warehousing. You might have batch data pipelines or streaming data pipelines. Batch data pipelines run on data collected over a period of time (for example, once a day). Streaming data pipelines handle real-time events being generated by your operational systems—for example, in