BigQuery DataFrames

BigQuery DataFrames provides a Pythonic DataFrame and machine learning (ML) API powered by the BigQuery engine.

bigframes.pandas provides a pandas-compatible API for analytics.
bigframes.ml provides a scikit-learn-like API for ML.

BigQuery DataFrames is an open-source package. You can run pip install --upgrade bigframes to install the latest version.

Documentation

Quickstart

Prerequisites

Install the bigframes package.
Create a Google Cloud project and billing account.
In an interactive environment (like Notebook, Python REPL or command line), bigframes will do the authentication on-the-fly if needed. Otherwise, see how to set up application default credentials for various environments. For example, to pre-authenticate on your laptop you can install and initialize the gcloud CLI, and then generate the application default credentials by doing gcloud auth application-default login.

Code sample

Import bigframes.pandas for a pandas-like interface. The read_gbq method accepts either a fully-qualified table ID or a SQL query.

import bigframes.pandas as bpd

bpd.options.bigquery.project = your_gcp_project_id
df1 = bpd.read_gbq("project.dataset.table")
df2 = bpd.read_gbq("SELECT a, b, c, FROM `project.dataset.table`")

More code samples

Locations

BigQuery DataFrames uses a BigQuery session internally to manage metadata on the service side. This session is tied to a location . BigQuery DataFrames uses the US multi-region as the default location, but you can use session_options.location to set a different location. Every query in a session is executed in the location where the session was created. BigQuery DataFrames auto-populates bf.options.bigquery.location if the user starts with read_gbq/read_gbq_table/read_gbq_query() and specifies a table, either directly or in a SQL statement.

If you want to reset the location of the created DataFrame or Series objects, you can close the session by executing bigframes.pandas.close_session(). After that, you can reuse bigframes.pandas.options.bigquery.location to specify another location.

read_gbq() requires you to specify a location if the dataset you are querying is not in the US multi-region. If you try to read a table from another location, you get a NotFound exception.

Project

If bf.options.bigquery.project is not set, the $GOOGLE_CLOUD_PROJECT environment variable is used, which is set in the notebook runtime serving the BigQuery Studio/Vertex Notebooks.

ML Capabilities

The ML capabilities in BigQuery DataFrames let you preprocess data, and then train models on that data. You can also chain these actions together to create data pipelines.

Preprocess data

Create transformers to prepare data for use in estimators (models) by using the bigframes.ml.preprocessing module and the bigframes.ml.compose module. BigQuery DataFrames offers the following transformations:

Use the KBinsDiscretizer class in the bigframes.ml.preprocessing module to bin continuous data into intervals.
Use the