Stay organized with collections
Save and categorize content based on your preferences.
Optimal data and metadata formats for lakehouses
This document guides you through the optimal data and metadata formats as you
design your data lakehouse with BigLake.
A data lakehouse is a data architecture that combines the structure of a data
warehouse with the raw data flexibility of a data lake. This architecture
provides flexibility and scalability for a wide range of data use cases. The
Google Cloud data lakehouse solution is called
BigLake, which connects Google Cloud and open
source services to create a unified interface for analytics and AI. A data
lakehouse that's built with BigLake consists of the following key
components:
Storage capabilities: Cloud Storage or BigQuery, with
Apache Iceberg as the recommended open table format
A metastore: BigLake metastore
A query engine: BigQuery, Apache Spark,
Apache Flink, Trino, or other open source engines
A tool for data writing and analytics: various BigQuery and
open source connections
BigLake packages all of these components in a single experience
with uniform governance. For more information on BigLake
architecture and innovations, see
BigLake evolved.
Select a metastore
For your metastore, we recommend using
BigLake metastore. BigLake metastore is a fully
managed and serverless metastore for your lakehouse on Google Cloud. It
provides a single source of truth for metadata from multiple sources and is
accessible from BigQuery and various open data processing
engines, removing the need to copy and synchronize metadata between different
repositories with customized tools. BigLake metastore is supported with
Dataplex Universal Catalog, which provides unified and fine-grained access controls
across all supported engines and enables end-to-end governance that includes
comprehensive lineage, data quality, and discoverability capabilities.
Select a table format
With BigLake metastore as the metastore for your open lakehouse, you have
following choices for the format of your tables:
Choose standard BigQuery tables for data managed in BigQuery.
These tables are fully managed by BigQuery and have the most
advanced data analytics and management features. You can still connect these
tables to BigLake metastore. Choose this option for
non-Iceberg tables.
Choose BigLake Iceberg tables in BigQuery for a fully managed experience on BigQuery.
These tables are Iceberg tables that you create from
BigQuery and
store in Cloud Storage.
Like all tables that use BigLake metastore, they can be read by open source
engines or BigQuery. However, BigQuery is the
only engine that can directly write to them. Choose this option if you want
your extract, transform, and load (ETL) workflow to be managed by
BigQuery.
Choose BigLake Iceberg tables for a semi-managed experience on Google Cloud.
These tables are Iceberg tables that you create from
open source engines and store in Cloud Storage. Like all tables that
use BigLake metastore, they can be read by open source engines or
BigQuery. However, the open source engine that created the
table is the only engine that can write to it. Choose this option if you want
your ETL workflow to be managed by the open source engine.
Choose external tables for tables outside of BigLake metastore.
The data and metadata of these tables are completely self-managed, where you
fully rely on the capabilities of open table formats
(such as Iceberg, Apache Hudi, or
Delta Lake). BigQuery only has the ability to read
from these tables. Choose this option for data and metadata that you want to
manage on your own in a third-party catalog.
Use the following table to compare your table format options:
External tables
BigLake Iceberg tables
BigLake Iceberg tables in BigQuery
Standard BigQuery tables
Metastore
External or self-hosted metastore
BigLake metastore
BigLake metastore
BigLake metastore
Storage
Cloud Storage / Amazon S3 / Azure
Cloud Storage
Cloud Storage
BigQuery
Management
Customer or third party
Google
Google (highly managed experience)
Google (most managed experience)
Read / Write
Open source engines (read/write)
BigQuery (read only)
Open source engines (read/write)
BigQuery (read only)
Open source engines (read only with Iceberg
libraries, read/write interoperability with BigQuery Storage API)
BigQuery (read/write)
Open source engines (read/write interoperability with
BigQuery Storage API)
BigQuery (read/write)
Use cases
Migrations, staging tables for BigQuery loads,
self-management
Open lakehouse
Open lakehouse, enterprise-grade storage for analytics, streaming, and AI
Enterprise-grade storage for analytics, streaming, and AI
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[],[],null,["# Optimal data and metadata formats for lakehouses\n================================================\n\nThis document guides you through the optimal data and metadata formats as you\ndesign your data lakehouse with BigLake.\n\nA *data lakehouse* is a data architecture that combines the structure of a data\nwarehouse with the raw data flexibility of a data lake. This architecture\nprovides flexibility and scalability for a wide range of data use cases. The\nGoogle Cloud data lakehouse solution is called\n[BigLake](/biglake), which connects Google Cloud and open\nsource services to create a unified interface for analytics and AI. A data\nlakehouse that's built with BigLake consists of the following key\ncomponents:\n\n- **Storage capabilities**: Cloud Storage or BigQuery, with Apache Iceberg as the recommended open table format\n- **A metastore**: BigLake metastore\n- **A query engine**: BigQuery, Apache Spark, Apache Flink, Trino, or other open source engines\n- **A tool for data writing and analytics**: various BigQuery and open source connections\n\nBigLake packages all of these components in a single experience\nwith uniform governance. For more information on BigLake\narchitecture and innovations, see\n[BigLake evolved](/blog/products/data-analytics/enhancing-biglake-for-iceberg-lakehouses).\n\nSelect a metastore\n------------------\n\nFor your metastore, we recommend using\n[BigLake metastore](/bigquery/docs/about-blms). BigLake metastore is a fully\nmanaged and serverless metastore for your lakehouse on Google Cloud. It\nprovides a single source of truth for metadata from multiple sources and is\naccessible from BigQuery and various open data processing\nengines, removing the need to copy and synchronize metadata between different\nrepositories with customized tools. BigLake metastore is supported with\nDataplex Universal Catalog, which provides unified and fine-grained access controls\nacross all supported engines and enables end-to-end governance that includes\ncomprehensive lineage, data quality, and discoverability capabilities.\n\nSelect a table format\n---------------------\n\nWith BigLake metastore as the metastore for your open lakehouse, you have\nfollowing choices for the format of your tables:\n\n- **Choose [standard BigQuery tables](/bigquery/docs/tables) for data managed in BigQuery.** These tables are fully managed by BigQuery and have the most advanced data analytics and management features. You can still connect these tables to BigLake metastore. Choose this option for non-Iceberg tables.\n- **Choose [BigLake Iceberg tables in BigQuery](/bigquery/docs/iceberg-tables) for a fully managed experience on BigQuery.** These tables are Iceberg tables that you create from BigQuery and [store in Cloud Storage](/bigquery/docs/iceberg-tables#bucket-best-practices). Like all tables that use BigLake metastore, they can be read by open source engines or BigQuery. However, BigQuery is the only engine that can directly write to them. Choose this option if you want your extract, transform, and load (ETL) workflow to be managed by BigQuery.\n- **Choose [BigLake Iceberg tables](/bigquery/docs/blms-rest-catalog) for a semi-managed experience on Google Cloud.** These tables are Iceberg tables that you create from open source engines and store in Cloud Storage. Like all tables that use BigLake metastore, they can be read by open source engines or BigQuery. However, the open source engine that created the table is the only engine that can write to it. Choose this option if you want your ETL workflow to be managed by the open source engine.\n- **Choose [external tables](/bigquery/docs/iceberg-external-tables) for tables outside of BigLake metastore.** The data and metadata of these tables are completely self-managed, where you fully rely on the capabilities of open table formats (such as Iceberg, Apache Hudi, or Delta Lake). BigQuery only has the ability to read from these tables. Choose this option for data and metadata that you want to manage on your own in a third-party catalog.\n\nUse the following table to compare your table format options:\n\nWhat's next\n-----------\n\n- Learn more about [BigLake metastore](/bigquery/docs/about-blms)."]]