- 2.24.0 (latest)
- 2.23.0
- 2.22.0
- 2.21.0
- 2.20.0
- 2.19.0
- 2.18.0
- 2.17.0
- 2.16.0
- 2.15.0
- 2.14.0
- 2.13.0
- 2.12.0
- 2.11.0
- 2.10.0
- 2.9.0
- 2.8.0
- 2.7.0
- 2.6.0
- 2.5.0
- 2.4.0
- 2.3.0
- 2.2.0
- 2.0.0-dev0
- 1.36.0
- 1.35.0
- 1.34.0
- 1.33.0
- 1.32.0
- 1.31.0
- 1.30.0
- 1.29.0
- 1.28.0
- 1.27.0
- 1.26.0
- 1.25.0
- 1.24.0
- 1.22.0
- 1.21.0
- 1.20.0
- 1.19.0
- 1.18.0
- 1.17.0
- 1.16.0
- 1.15.0
- 1.14.0
- 1.13.0
- 1.12.0
- 1.11.1
- 1.10.0
- 1.9.0
- 1.8.0
- 1.7.0
- 1.6.0
- 1.5.0
- 1.4.0
- 1.3.0
- 1.2.0
- 1.1.0
- 1.0.0
- 0.26.0
- 0.25.0
- 0.24.0
- 0.23.0
- 0.22.0
- 0.21.0
- 0.20.1
- 0.19.2
- 0.18.0
- 0.17.0
- 0.16.0
- 0.15.0
- 0.14.1
- 0.13.0
- 0.12.0
- 0.11.0
- 0.10.0
- 0.9.0
- 0.8.0
- 0.7.0
- 0.6.0
- 0.5.0
- 0.4.0
- 0.3.0
- 0.2.0
BigQuery DataFrames
BigQuery DataFrames provides a Pythonic DataFrame and machine learning (ML) API powered by the BigQuery engine.
bigframes.pandas
provides a pandas-compatible API for analytics.bigframes.ml
provides a scikit-learn-like API for ML.
BigQuery DataFrames is an open-source package. You can run
pip install --upgrade bigframes
to install the latest version.
Documentation
Quickstart
Prerequisites
Install the
bigframes
package.Create a Google Cloud project and billing account.
In an interactive environment (like Notebook, Python REPL or command line),
bigframes
will do the authentication on-the-fly if needed. Otherwise, see how to set up application default credentials for various environments. For example, to pre-authenticate on your laptop you can install and initialize the gcloud CLI, and then generate the application default credentials by doing gcloud auth application-default login.
Code sample
Import bigframes.pandas
for a pandas-like interface. The read_gbq
method accepts either a fully-qualified table ID or a SQL query.
import bigframes.pandas as bpd
bpd.options.bigquery.project = your_gcp_project_id
df1 = bpd.read_gbq("project.dataset.table")
df2 = bpd.read_gbq("SELECT a, b, c, FROM `project.dataset.table`")
Locations
BigQuery DataFrames uses a
BigQuery session
internally to manage metadata on the service side. This session is tied to a
location .
BigQuery DataFrames uses the US multi-region as the default location, but you
can use session_options.location
to set a different location. Every query
in a session is executed in the location where the session was created.
BigQuery DataFrames
auto-populates bf.options.bigquery.location
if the user starts with
read_gbq/read_gbq_table/read_gbq_query()
and specifies a table, either
directly or in a SQL statement.
If you want to reset the location of the created DataFrame or Series objects,
you can close the session by executing bigframes.pandas.close_session()
.
After that, you can reuse bigframes.pandas.options.bigquery.location
to
specify another location.
read_gbq()
requires you to specify a location if the dataset you are
querying is not in the US multi-region. If you try to read a table from another
location, you get a NotFound exception.
Project
If bf.options.bigquery.project
is not set, the $GOOGLE_CLOUD_PROJECT
environment variable is used, which is set in the notebook runtime serving the
BigQuery Studio/Vertex Notebooks.
ML Capabilities
The ML capabilities in BigQuery DataFrames let you preprocess data, and then train models on that data. You can also chain these actions together to create data pipelines.
Preprocess data
Create transformers to prepare data for use in estimators (models) by using the bigframes.ml.preprocessing module and the bigframes.ml.compose module. BigQuery DataFrames offers the following transformations:
Use the KBinsDiscretizer class in the
bigframes.ml.preprocessing
module to bin continuous data into intervals.Use the