Skip to content

AI/ML Recipes for Vertex AI, Serverless Spark and BigQuery open-source project is an effort to jumpstart your development of data processing and machine learning notebooks using VertexAI, BigQuery and Dataproc's distributed processing capabilities.

License

Notifications You must be signed in to change notification settings

GoogleCloudPlatform/ai-ml-recipes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI/ML Recipes for Vertex AI, Serverless Spark and BigQuery

AI/ML Recipes for Vertex AI, Serverless Spark and BigQuery open-source project is an effort to jumpstart your development of data processing and machine learning notebooks using VertexAI, BigQuery and Dataproc's distributed processing capabilities.

We are release a set of machine learning focused notebooks, for you to adapt, extend, and use to solve your use cases using your own data.
You can easily clone the repo and start executing the notebooks right way using your Dataproc cluster or Dataproc Serverless Runtime for the PySpark notebooks, and any environment for the BigQuery Dataframes (Bigframes) notebooks.

Open in Cloud Shell

Notebooks

Please refer to each notebooks folder documentation for more information:

Title Industry Topic Sub Topic Main Technologies
Fine-tuning Gemini to translate multiple languages Media & Entertainment Generative AI Fine tuning PySpark, Iceberg. Gemini
PDF summarization using Gemini Financial Generative AI Summarization PySpark, SparkML, Gemini, BigQuery
Movie Reviews sentiment analysis using Gemini Media & Entertainment Generative AI Sentiment Analysis PySpark, SparkML, Gemini, BigQuery
Generate description from videos Retail Generative AI Content Generation PySpark, GCS, Gemini
Product attributes and description from image Retail Generative AI Content Generation PySpark, GCS, Gemini
SMS Spam Filtering Telecom Classification Multilayer Perceptron Classifier PySpark, Spark ML, GCS
Predictive Maintenance Manufacturing Classification Linear Support Vector Machine PySpark, Spark ML, GCS
Wine Quality Classification Retail Classification Logistic Regression PySpark, Spark ML, GCS
Housing Prices Prediction Financial Regression Decision Tree Regression PySpark, Spark ML, GCS
Bike Trip Duration Prediction Mobility Regression Random Forest Regression PySpark, Spark ML, BigQuery
Customer Price Index Financial Sampling Monte Carlo method PySpark, GCS, NumPy
Banner advertising understanding Retail Generative AI Content Generation BigFrames, GCS, Gemini, BigQuery
Predict penguim weight Environmental Regression Linear Regression BigFrames, BigQuery
Toxicity classification using Gemini fine-tuned Gaming Generative AI Classification BigFrames, Gemini, Vertex AI
Contract Risk and Compliance Review Financial Generative AI Summarization BigQuery, SQL, Gemini
Asset Price Forecast using Iceberg and Prophet Finance Forecast Prophet PySpark, Dataproc Serverless, Apache Iceberg, Prophet, BigQuery, GCS
Purchase Predictions with PySpark in BigQuery Studio Retail Analytics Purchase Predictions PySpark, Spark ML, BigQuery, Dataproc, GCS
Time Series Analysis with TimesFM and ARIMA in BigQuery Retail Forecast ARIMA and TimesFM BigQuery, BigQuery ML, ARIMA, TimesFM, Python, Matplotlib
Assessing Environmental Risks to Protect Agricultural Investments Agriculture Quickstart Geospatial BigQuery, Google Earth Engine, BigFrames, GeoPandas, BigQuery ML
A Data Science Approach to Investigating Poor Product Sales Performance Retail Analytics Sales Performance Analysis BigQuery, Vertex AI, XGBoost, Pandas
Creating an Image-Based Home Search Engine Real Estate Analytics Image Search BigQuery, BigQuery ML, Gemini, GCS, SQL, Python
Identifying Customer Segments for Targeted Marketing Retail Analytics Identifying Customer Segments BigQuery, BigQuery ML, Gemini, Generative AI, SQL, K-Means Clustering

Google Cloud products quickstarts:

Title Topic Sub Topic Main Technologies
Delta format in GCS Quickstart Quickstart Delta PySpark, GCS, Delta
Dataproc Metastore Quickstart Dataproc Metastore PySpark, Dataproc Metastore
Dataproc cluster insights with BigQuery Quickstart Dataproc BigQuery, Dataproc
Bigframes Quickstart Quickstart Bigframes BigFrames, BigQuery, Gemini
Apache Iceberg on BQ Quickstart Quickstart Iceberg BigQuery, Apache Iceberg
Agent2Agent Quickstart Quickstart Agent2Agent Gemini, Google ADK, A2A, Vertex AI

Public Datasets

The notebooks read datasets from our public GCS bucket containing several publicly available datasets.

In this doc you can see the list of available datasets, which are located in gs://dataproc-metastore-public-binaries.
The documentation above has details about the datasets, and links to their original pages, containing their LICENSES, etc.

Cloud Code VSCode extension

These notebooks are available from your VSCode IDE when using the Cloud Code extension. You can go to Notebook Templates and download each template to your environment:

drawing

Usage in Vertex AI Workbench notebooks

These notebooks are available from within the Vertex AI Workbench notebooks environment.
Navigate to JupyterLab home screen and click on Notebooks to see the list of notebooks and a button for you to download/copy them into your environment.

drawing


drawing

Usage in your local environment

  1. Install gcloud cli
  2. Run gcloud init to setup your default GCP configuration
  3. Clone this repository by running
    git clone https://github.com/GoogleCloudPlatform/ai-ml-recipes.git
  4. Install requirements by running pip install -r requirements.txt
  5. For the PySpark notebooks, use one of the approaches using the Dataproc Jupyter Plugin:
    • 5.1) [PySpark Serverless Runtime on Google Cloud] Create a Runtime Template with your desired runtime config, and use it to run your PySpark notebooks.
    • 5.2) [Local runtime] Use your local PySpark runtime
  6. For the Bigframes notebooks, you do not need PySpark, just any kernel/environment, and the processing will leverage BigQuery in your GCP project

BigQuery Jupyter Plugin

We recommend leveraging the BigQuery Jupyter Plugin, which will be available in your local environment just by installing the dependency running pip install -r requirements.txt. This will enable you to:

  • Connect your Jupyterlab notebooks from anywhere to Dataproc
  • Develop in Python, SQL, Java/Scala, and R
  • Manage Dataproc clusters and jobs
  • Run notebooks in your favorite IDE that supports Jupyter using Dataproc as kernel
  • Deploy a notebook as a recurring job
  • View cloud and spark logs inside Jupyterlab
  • View your BigQuery datasets schema inside Jupyterlab
  • Manage your files on Google Cloud Storage (GCS)

Contributing

See the contributing instructions to get started contributing.

License

All solutions within this repository are provided under the Apache 2.0 license. Please see the LICENSE file for more detailed terms and conditions.

Disclaimer

This repository and its contents are not an official Google Product.

Contact

Questions, issues, and comments can be raised via Github issues.

About

AI/ML Recipes for Vertex AI, Serverless Spark and BigQuery open-source project is an effort to jumpstart your development of data processing and machine learning notebooks using VertexAI, BigQuery and Dataproc's distributed processing capabilities.

Topics

Resources

License

Contributing

Stars

Watchers

Forks