Dataproc release notes

These release notes apply to the core Dataproc service, and include:

  • Announcements of the latest Dataproc image versions installed on the Compute Engine VMs used in Dataproc clusters. See the Dataproc version list for a list of supported Dataproc images, with links to pages that list the software components installed on current and recently released Dataproc images

  • Announcements of new and updated Dataproc and Serverless for Apache Spark features, bug fixes, known issues, and deprecated functionality

Release schedule: The release of the latest Dataproc images can take up to one week to roll out to all regions. Until the rollout is complete, the latest Dataproc images may not be available in your region.

You can see the latest product updates for all of Google Cloud on the Google Cloud page, browse and filter all release notes in the Google Cloud console, or programmatically access release notes in BigQuery.

To get the latest product updates delivered to you, add the URL of this page to your feed reader, or add the feed URL directly.

October 06, 2025

Dataproc on Compute Engine: The following diagnostic properties are now enabled by default for new Dataproc clusters created with 2.0+ image versions:

Note: To disable any of these features, set the corresponding property to false during cluster creation.

To continue using the Ops Agent initialization action opsagent.sh to ingest syslogs from Dataproc cluster nodes, do one of the following:

  • Recommended: Use opsagent_nosyslog.sh since VM syslogs are emitted by default from Dataproc clusters.
  • Set the dataproc:dataproc.logging.syslog.enabled=false and continue using opsagent.sh to ingest syslogs.

Serverless for Apache Spark: Upgraded Apache Spark to version 3.5.3 in the latest 2.3 Serverless for Apache Spark runtime versions.

October 03, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.150-debian10, 2.0.150-ubuntu18, 2.0.150-rocky8
  • 2.1.99-debian11, 2.1.99-ubuntu20, 2.1.99-ubuntu20-arm, 2.1.99-rocky8
  • 2.2.67-debian12, 2.2.67-ubuntu22, 2.2.67-ubuntu22-arm, 2.2.67-rocky9
  • 2.3.14-debian12, 2.3.14-ubuntu22, 2.3.14-ubuntu22-arm, 2.3.14-ml-ubuntu22, 2.3.14-rocky9

September 15, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.149-debian10, 2.0.149-ubuntu18, 2.0.149-rocky8
  • 2.1.98-debian11, 2.1.98-ubuntu20, 2.1.98-ubuntu20-arm, 2.1.98-rocky8
  • 2.2.66-debian12, 2.2.66-ubuntu22, 2.2.66-ubuntu22-arm, 2.2.66-rocky9
  • 2.3.13-debian12, 2.3.13-ubuntu22, 2.3.13-ubuntu22-arm, 2.3.13-ml-ubuntu22, 2.3.13-rocky9

September 11, 2025

New Serverless for Apache Spark runtime versions:

  • 1.2.61
  • 2.2.61
  • 2.3.12
  • 3.0.0-RC4

September 08, 2025

Announcing the Preview release of Dataproc on Compute Engine image version 3.0.0-RC1:

  • Spark 4.0.0
  • Hadoop 3.4.1
  • Hive 4.1.0
  • Tez 0.10.5
  • Cloud Storage Connector 3.1.4
  • Conda 24.11
  • Java 17
  • Python 3.11
  • R 4.3
  • Scala 2.13

Announcing the Preview release of Serverless for Apache Spark 3.0.0-RC3 runtime:

  • Spark 4.0.0
  • BigQuery Spark Connector 0.42.3
  • Cloud Storage Connector 3.1.5
  • Conda 25.3.0
  • Java 21
  • Python 3.12
  • R 4.4
  • Scala 2.13

New Dataproc on Compute Engine subminor image versions:

  • 2.3.11-debian12, 2.3.11-ubuntu22, 2.3.11-ubuntu22-arm, 2.3.11-ml-ubuntu22, 2.3.11-rocky9

September 05, 2025

September 02, 2025

Multi-tenant clusters are now available in Preview. Many data engineers and scientists can share a multi-tenant cluster to execute their workloads in isolation from each other.

August 29, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.147-debian10, 2.0.147-ubuntu18, 2.0.147-rocky8
  • 2.1.96-debian11, 2.1.96-ubuntu20, 2.1.96-ubuntu20-arm, 2.1.96-rocky8
  • 2.2.64-debian12, 2.2.64-ubuntu22, 2.2.64-ubuntu22-arm, 2.2.64-rocky9
  • 2.3.10-debian12, 2.3.10-ubuntu22, 2.3.10-ubuntu22-arm, 2.3.10-ml-ubuntu22, 2.3.10-rocky9

August 22, 2025

August 21, 2025

Serverless for Apache Spark: Fixed a bug in Dataproc Batches that occasionally caused higher latency before an application was started.

August 19, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.146-debian10, 2.0.146-ubuntu18, 2.0.146-rocky8
  • 2.1.95-debian11, 2.1.95-ubuntu20, 2.1.95-ubuntu20-arm, 2.1.95-rocky8
  • 2.2.63-debian12, 2.2.63-ubuntu22, 2.2.63-ubuntu22-arm, 2.2.63-rocky9
  • 2.3.9-debian12, 2.3.9-ubuntu22, 2.3.9-ubuntu22-arm, 2.3.9-ml-ubuntu22, 2.3.9-rocky9

August 14, 2025

August 12, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.145-debian10, 2.0.145-ubuntu18, 2.0.145-rocky8
  • 2.1.94-debian11, 2.1.94-ubuntu20, 2.1.94-ubuntu20-arm, 2.1.94-rocky8
  • 2.2.62-debian12, 2.2.62-ubuntu22, 2.2.62-ubuntu22-arm, 2.2.62-rocky9
  • 2.3.8-debian12, 2.3.8-ubuntu22, 2.3.8-ubuntu22-arm, 2.3.8-ml-ubuntu22, 2.3.8-rocky9

Dataproc on Compute Engine: Image versions 2.2 and 2.3: The Iceberg optional component supports the BigLake Iceberg REST catalog.

Dataproc on Compute Engine: Sharing checkpoint diagnostic data: Setting the dataproc:diagnostic.capture.access=GOOGLE_DATAPROC_DIAGNOSE property during cluster creation shares all of the temp bucket contents with Google Cloud support if uniform bucket-level access is enabled on temp bucket. If object-level access control is in effect on the temp bucket, only the checkpoint diagnostic data folder corresponding to the cluster in Cloud Storage is shared.

August 11, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.1.93-debian11, 2.1.93-rocky8, 2.1.93-ubuntu20, 2.1.93-ubuntu20-arm
  • 2.2.61-debian12, 2.2.61-rocky9, 2.2.61-ubuntu22, 2.2.61-ubuntu22-arm

July 31, 2025

New Dataproc Serverless for Spark runtime versions:

  • 1.1.111
  • 1.2.55
  • 2.2.55
  • 2.3.6

Dataproc Serverless for Spark: Subminor version 1.1.111 is the last release of runtime version 1.1, which will no longer be supported and will not receive new releases.

July 25, 2025

New Dataproc on Compute Engine subminor image versions:

2.3.7-debian12, 2.3.7-ubuntu22, 2.3.7-ubuntu22-arm, 2.3.7-ml-ubuntu22, and 2.3.7-rocky9

The 2.3.7-ml-ubuntu22 image extends the 2.3 base image with ML-specific libraries.

July 15, 2025

Dataproc on Compute Engine: Starting August 18, 2025, the following diagnostic properties will be enabled by default for newly created Dataproc clusters:

Note: To disable any of these features, set the corresponding property to false during cluster creation.

New Dataproc on Compute Engine subminor image versions:

2.3.6-debian12, 2.3.6-ubuntu22, 2.3.6-ml-ubuntu22, and 2.3.6-rocky9

The 2.3.6-ml-ubuntu22 image extends the 2.3 base image with ML-specific libraries.

Dataproc now allows Dynamic update of multi-tenancy clusters.

July 07, 2025

The Cluster Scheduled Stop feature is available in preview. You can use this feature to stop clusters after a specified idle period, at a specified future time, or after a specified period from the cluster creation or update request.

July 04, 2025

New Dataproc on Compute Engine subminor image versions:

2.3.5-debian12, 2.3.5-ubuntu22, 2.3.5-ml-ubuntu22, and 2.3.5-rocky9

The 2.3.5-ml-ubuntu22 image extends the 2.3 base image with ML-specific libraries.

Serverless for Apache Spark (formerly known as Dataproc Serverless for Spark) now supports OS Login organization policy. Organizations, folders, and projects that enforce the OS Login policy can now use Serverless for Apache Spark.

July 01, 2025

New Dataproc Serverless for Spark runtime versions:

  • 1.1.110
  • 1.2.54
  • 2.2.54
  • 2.3.5

June 20, 2025

New Dataproc Serverless for Spark runtime versions:

  • 1.1.109
  • 1.2.53
  • 2.2.53
  • 2.3.4

Dataproc Serverless for Spark: Upgraded the Cloud Storage connector version to 2.2.28 in the 1.1 runtime.

Dataproc Serverless for Spark: The built-in Iceberg now supports the BigLake Iceberg REST catalog on the 2.2 runtime.

New Dataproc on Compute Engine subminor image versions:

  • 2.0.144-debian10, 2.0.144-rocky8, 2.0.144-ubuntu18
  • 2.1.92-debian11, 2.1.92-rocky8, 2.1.92-ubuntu20, 2.1.92-ubuntu20-arm
  • 2.2.60-debian12, 2.2.60-rocky9, 2.2.60-ubuntu22
  • 2.3.4-debian12, 2.3.4-rocky9, 2.3.4-ubuntu22, and 2.3.4-ml-ubuntu22.

The 2.3.4-ml-ubuntu22 image extends the 2.3 base image with ML-specific libraries.

Dataproc on Compute Engine: Upgraded the Cloud Storage connector version to 2.2.28 in the latest 2.0 and 2.1 images.

Dataproc on Compute Engine: Dataproc now automatically configures Knox Gateway configuration properties gateway.dispatch.whitelist.services and gateway.dispatch.whitelist for component web UIs within the cluster.

Dataproc on Compute Engine: Fixed a bug in trino-jvm cluster properties. To configure Trino JVM options prefixed with trino-jvm, follow these guidelines:

  • Configure JVM options starting with -XX:, without :. For JVM flags without a value, add = at the end. For example, add trino-jvm:-XX+HeapDumpOnOutOfMemoryError= as -XX:+HeapDumpOnOutOfMemoryError in the jvm.config.
  • Specify JVM options system properties with a -D prefix the same way. For example, trino-jvm:-Dsystem.property.name=value.
  • Any value containing : cannot be provided as a cluster property.

Dataproc on Compute Engine & Dataproc Serverless: Backported GH-3198 in Parquet addressing CVE-2025-46762.

June 10, 2025

New Dataproc Serverless for Spark runtime versions:

  • 1.1.108
  • 1.2.52
  • 2.2.52
  • 2.3.3

June 09, 2025

Announcing the GA release of Dataproc on Compute Engine image version 2.3:

Image Version 2.3 is a lightweight image that contains only core components, reducing exposure to Common Vulnerabilities and Exposures (CVEs). For higher security compliance requirements, use the image version 2.3 or later when creating a Dataproc cluster. Optional components can still be deployed on-demand.

The following images are the latest available 2.3 subminor image versions:

  • 2.3.3-debian12, 2.3.3-rocky9, 2.3.3-ubuntu22, and 2.3.3-ml-ubuntu22.

The 2.3.3-ml-ubuntu22 image extends the 2.3 base image with ML-specific libraries.

June 06, 2025

New Dataproc Serverless for Spark runtime versions:

  • 1.1.107
  • 1.2.51
  • 2.2.51
  • 2.3.2

Dataproc Serverless for Spark: Fixed a bug that prevented the spark.executorEnv property from correctly setting specific executor environment variables across all runtimes.

June 01, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.143-debian10, 2.0.143-rocky8, 2.0.143-ubuntu18
  • 2.1.91-debian11, 2.1.90-rocky8, 2.1.91-ubuntu20, 2.1.91-ubuntu20-arm
  • 2.2.59-debian12, 2.2.59-rocky9, 2.2.59-ubuntu22

Dataproc on Compute Engine: Fixed the ordering of log entries generated from clusters created with 2.2+ image versions by assigning timestamps closer to the log generation time.

May 30, 2025

New Dataproc Serverless for Spark runtime versions:

  • 1.1.106
  • 1.2.50
  • 2.2.50
  • 2.3.1

The support dates for Dataproc on Compute Engine image versions 2.0, 2.1, and 2.2 have been extended, as follows:

  • Image version 2.2: Supported until 03/31/2027
  • Image version 2.1: Supported until 03/31/2026
  • Image version 2.0 Supported until 09/30/2025

May 28, 2025

Announcing the General Availability release of Spark on BigQuery, which lets you create a serverless Spark session in a BigQuery Studio notebook. Use this feature to create, run, and test Spark jobs quickly and easily. For more information, see Run PySpark code in BigQuery Studio notebooks.

New Dataproc Serverless for Spark runtime versions:

  • 1.1.105
  • 1.2.49
  • 2.2.49
  • 2.3.0

Announcing the General Availability (GA) release of Dataproc Serverless for Spark runtime versions 2.3, which include the following components:

  • Spark 3.5.1
  • BigQuery Spark Connector 0.42.3
  • Cloud Storage Connector 3.1.2
  • Java 17
  • Python 3.11
  • R 4.3
  • Scala 2.13

May 23, 2025

Dataproc now supports the creation of zero-scale clusters, available in preview. This feature provides a cost-effective way to use Dataproc clusters, as they utilize only secondary workers that can be scaled down to zero when not in use.

New Dataproc on Compute Engine subminor image versions:

  • 2.0.142-debian10, 2.0.142-rocky8, 2.0.142-ubuntu18
  • 2.1.90-debian11, 2.1.90-rocky8, 2.1.90-ubuntu20, 2.1.90-ubuntu20-arm
  • 2.2.58-debian12, 2.2.58-rocky9, 2.2.58-ubuntu22

May 22, 2025

May 15, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.141-debian10, 2.0.141-rocky8, 2.0.141-ubuntu18
  • 2.1.89-debian11, 2.1.89-rocky8, 2.1.89-ubuntu20, 2.1.89-ubuntu20-arm
  • 2.2.57-debian12, 2.2.57-rocky9, 2.2.57-ubuntu22

May 12, 2025

Dataproc Serverless for Spark: Spark UI for Dataproc Serverless batches and interactive sessions, which lets you monitor and debug your serverless Spark workloads, now features Event Timeline and Task Quantile views for enhanced troubleshooting.

May 09, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.140-debian10, 2.0.140-rocky8, 2.0.140-ubuntu18
  • 2.1.88-debian11, 2.1.88-rocky8, 2.1.88-ubuntu20, 2.1.88-ubuntu20-arm
  • 2.2.56-debian12, 2.2.56-rocky9, 2.2.56-ubuntu22

May 08, 2025

May 07, 2025

Dataproc on Compute Engine: The default enabling of the following cluster properties previously announced to occur on May 10, 2025 (see the February 10, 2025 release note) has been postponed to a future date. The future date will be announced in a release note at least one month in advance of the change. Until then, these diagnostic properties will continue to be set to false by default unless set to true by the user.

  • dataproc:diagnostic.capture.enabled
  • dataproc:dataproc.logging.extended.enabled
  • dataproc:dataproc.logging.syslog.enabled

May 02, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.139-debian10, 2.0.139-rocky8, 2.0.139-ubuntu18
  • 2.1.87-debian11, 2.1.87-rocky8, 2.1.87-ubuntu20, 2.1.87-ubuntu20-arm
  • 2.2.55-debian12, 2.2.55-rocky9, 2.2.55-ubuntu22

Dataproc on Compute Engine: Upgraded NodeProblemDetector to 0.8.20 based version for 2.2 image.

Dataproc on Compute Engine: Upgraded oauth2l to v1.3.3 to address CVEs.

Dataproc on Compute Engine: Fixed an issue with Apache Hudi that caused failure in Hudi CLI.

May 01, 2025

Native Query Execution now supports reading Apache ORC complex types.

Dataproc Serverless: Backported GH-3168 in Parquet addressing CVE-2025-30065.

April 29, 2025

New Dataproc on Compute Engine subminor image versions:

2.0.138-debian10, 2.0.138-rocky8, 2.0.138-ubuntu18

2.1.86-debian11, 2.1.86-rocky8, 2.1.86-ubuntu20, 2.1.86-ubuntu20-arm

2.2.54-debian12, 2.2.54-rocky9, 2.2.54-ubuntu22

Dataproc on Compute Engine: Fixed Job ID retrieval in Dataproc job logs for clusters created with 2.0, 2.1 image versions, by ignoring timestamp prefix.

Dataproc on Compute Engine: Added an temporary object hold on the spark-job-history folder in Cloud Stroage to prevent deletion by Cloud Storage life cycling.

Dataproc on Compute Engine: Backported GH-3168 in Parquet addressing CVE-2025-30065.

April 18, 2025

April 17, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.137-debian10, 2.0.137-rocky8, 2.0.137-ubuntu18
  • 2.1.85-debian11, 2.1.85-rocky8, 2.1.85-ubuntu20, 2.1.85-ubuntu20-arm
  • 2.2.53-debian12, 2.2.53-rocky9, 2.2.53-ubuntu22

Dataproc on Compute Engine: The Spark BigQuery connector has been upgraded to version 0.34.1 in the latest 2.2 image version.

Fixed a bug in which Jupyter fails to restart upon cluster restart on Personal Authentication clusters.

April 09, 2025

Dataproc Serverless for Spark: Gemini Cloud Assist Investigations is available in Preview for the following runtimes:

  • 1.1
  • 1.2
  • 2.2

April 08, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.2.52-debian12, 2.2.52-rocky9, 2.2.52-ubuntu22

Dataproc on Compute Engine: Fixed an issue with the retrieval of an Access token when using the ranger-gcs-plugin with 2.2 images.

April 03, 2025

Dataproc Serverless for Spark: Installed CUDA, cuDNN and NCCL NVIDIA libraries in 1.2 and 2.2 runtimes.

April 01, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.2.51-debian12, 2.2.51-rocky9, 2.2.51-ubuntu22

Dataproc on Compute Engine: Hyperdisk-Balanced is now the default primary disk type when creating a cluster from the console.

Dataproc on Compute Engine: Fixed incorrectly attributed Dataproc job logs in Cloud Logging for clusters created with 2.2+ image versions. This happened when multiple Dataproc jobs were running concurrently on the same cluster.

March 31, 2025

March 28, 2025

Dataproc Serverless for Spark: Hadoop Native libraries are installed by default in all runtimes.

March 17, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.136-debian10, 2.0.136-rocky8, 2.0.136-ubuntu18
  • 2.1.84-debian11, 2.1.84-rocky8, 2.1.84-ubuntu20, 2.1.84-ubuntu20-arm
  • 2.2.50-debian12, 2.2.50-rocky9, 2.2.50-ubuntu22

Dataproc on Compute Engine: Spark upgraded to version 3.5.3 in the latest Dataproc image version 2.2.

Dataproc on Compute Engine: The latest Dataproc 2.2 image version now supports Spark data lineage.

Dataproc on Compute Engine: Added support for Enhanced Flexibility Mode (EFM) with primary worker shuffle mode on Spark for image version 2.2.50 and above.

March 14, 2025

March 10, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.135-debian10, 2.0.135-rocky8, 2.0.135-ubuntu18
  • 2.1.83-debian11, 2.1.83-rocky8, 2.1.83-ubuntu20, 2.1.83-ubuntu20-arm
  • 2.2.49-debian12, 2.2.49-rocky9, 2.2.49-ubuntu22

March 04, 2025

Dataproc is now available in the europe-north2 region (Stockholm, Sweden).

March 03, 2025

March 01, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.134-debian10, 2.0.134-rocky8, 2.0.134-ubuntu18
  • 2.1.82-debian11, 2.1.82-rocky8, 2.1.82-ubuntu20, 2.1.82-ubuntu20-arm
  • 2.2.48-debian12, 2.2.48-rocky9, 2.2.48-ubuntu22

Dataproc on Compute Engine: Explicitly disabled sha1, md5 algorithms for use with kex and kex-gss sshd features.

February 24, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.133-debian10, 2.0.133-rocky8, 2.0.133-ubuntu18
  • 2.1.81-debian11, 2.1.81-rocky8, 2.1.81-ubuntu20, 2.1.81-ubuntu20-arm
  • 2.2.47-debian12, 2.2.47-rocky9, 2.2.47-ubuntu22

February 17, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.132-debian10, 2.0.132-rocky8, 2.0.132-ubuntu18
  • 2.1.80-debian11, 2.1.80-rocky8, 2.1.80-ubuntu20, 2.1.80-ubuntu20-arm
  • 2.2.46-debian12, 2.2.46-rocky9, 2.2.46-ubuntu22

February 11, 2025

Data Lineage for Dataproc Hive is now in Public Preview, which can be enabled using the Hive Lineage initialization action.

February 10, 2025

Dataproc on Compute Engine: To help diagnose Dataproc clusters, you can set the following cluster properties to true when you create a cluster:

Note: starting May 10, 2025, these properties will be set to true by default.

February 09, 2025

February 07, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.131-debian10, 2.0.131-rocky8, 2.0.131-ubuntu18
  • 2.1.79-debian11, 2.1.79-rocky8, 2.1.79-ubuntu20, 2.1.79-ubuntu20-arm
  • 2.2.45-debian12, 2.2.45-rocky9, 2.2.45-ubuntu22

Spark UI for Dataproc Serverless Batches and Interactive sessions, which lets you to monitor and debug your serverless Spark workloads, is now available for CMEK (Customer-Managed Encryption Keys) and Assured Workloads. The Spark UI is available by default and free of cost.

February 02, 2025

January 31, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.130-debian10, 2.0.130-rocky8, 2.0.130-ubuntu18
  • 2.1.78-debian11, 2.1.78-rocky8, 2.1.78-ubuntu20, 2.1.78-ubuntu20-arm
  • 2.2.44-debian12, 2.2.44-rocky9, 2.2.44-ubuntu22
  • New Hyperdisk Balanced primary disk type available on Dataproc clusters.
  • New machine types available for Hyperdisk Balanced disk type on clusters: C4, C4A, and N4.

January 30, 2025

Dataproc on Compute Engine: Private Google Access is now automatically enabled in the configured subnetwork when creating clusters with internal IP addresses.

Dataproc Serverless for Spark: Private Google Access is now automatically enabled in the configured subnetwork when running batch workloads and interactive sessions.

January 24, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.129-debian10, 2.0.129-rocky8, 2.0.129-ubuntu18
  • 2.1.77-debian11, 2.1.77-rocky8, 2.1.77-ubuntu20, 2.1.77-ubuntu20-arm
  • 2.2.43-debian12, 2.2.43-rocky9, 2.2.43-ubuntu22

Dataproc cluster caching now supports ARM images.

Zeppelin component added to 2.1-Ubuntu20-arm images.

January 23, 2025

January 17, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.128-debian10, 2.0.128-rocky8, 2.0.128-ubuntu18
  • 2.1.76-debian11, 2.1.76-rocky8, 2.1.76-ubuntu20, 2.1.76-ubuntu20-arm
  • 2.2.42-debian12, 2.2.42-rocky9, 2.2.42-ubuntu22

Dataproc Serverless for Spark:

  • Added support for XGBoost 2.1 in 2.2 runtime.
  • Change spark.sql.maxMetadataStringLength default value to 5000 for 1.2 and 2.2 runtimes

January 13, 2025

Dataproc Serverless for Spark: On March 10, 2025, the Dataproc Resource Manager API will be enabled as part of General Availability (GA) for Dataproc Serverless 3.0+ versions.

User action will not be required in response to this API enablement change.

The Dataproc Resource Manager will be implemented as a stand-alone Google Cloud API, dataprocrm.googleapis.com. It will allow Dataproc distributions of open source software, ,particularly Apache Spark, to directly communicate resource requirements.

January 10, 2025

New Dataproc on Compute Engine subminor image versions:

  • 2.0.127-debian10, 2.0.127-rocky8, 2.0.127-ubuntu18
  • 2.1.75-debian11, 2.1.75-rocky8, 2.1.75-ubuntu20, 2.1.75-ubuntu20-arm
  • 2.2.41-debian12, 2.2.41-rocky9, 2.2.41-ubuntu22

December 12, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.126-debian10, 2.0.126-rocky8, 2.0.126-ubuntu18
  • 2.1.74-debian11, 2.1.74-rocky8, 2.1.74-ubuntu20, 2.1.74-ubuntu20-arm
  • 2.2.40-debian12, 2.2.40-rocky9, 2.2.40-ubuntu22

Dataproc on Compute Engine: Updated Dataproc Metastore (DPMS) gRPC proxy image version to v. 0.0.70

November 20, 2024

Dataproc Serverless for Spark: Spark Lineage is available for all supported Dataproc Serverless for Spark runtime versions.

November 18, 2024

Dataproc is now available in the northamerica-south1 region (Queretaro, Mexico).

November 11, 2024

Announcing the General Availability (GA) of Flexible shapes for Dataproc secondary workers which allows you to provide a ranked selection of machine types to use for the creation of VMs.

Announcing the General Availability (GA) of Spot and non-preemptible VM mixing for Dataproc secondary workers which allows you to mix spot and non-preemptible secondary workers when you create a Dataproc cluster.

October 31, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.125-debian10, 2.0.125-rocky8, 2.0.125-ubuntu18
  • 2.1.73-debian11, 2.1.73-rocky8, 2.1.73-ubuntu20, 2.1.73-ubuntu20-arm
  • 2.2.39-debian12, 2.2.39-rocky9, 2.2.39-ubuntu22

Note: When using Dataproc version 2.0.125 with the ranger-gcs-plugin, please create a customer support request for your project to use the enhanced version of the plugin prior to its GA release. This note does not apply Dataproc on Compute Engine image versions 2.1 and 2.2.

Disabled HiveServer2 Ranger policy synchronization in non-HA clusters for latest image version 2.1 and later. Policy synchronization is causing instability of the HiveServer2 process while trying to connect to ZooKeeper, which is not active by default in non-HA clusters.

October 25, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.124-debian10, 2.0.124-rocky8, 2.0.124-ubuntu18
  • 2.1.72-debian11, 2.1.72-rocky8, 2.1.72-ubuntu20, 2.1.72-ubuntu20-arm
  • 2.2.38-debian12, 2.2.38-rocky9, 2.2.38-ubuntu22

Dataproc Serverless for Spark: The Hadoop Google Secret Manager Credential Provider feature is now available in the Dataproc Serverless for Spark 1.2 and 2.2 runtimes.

Dataproc Serverless for Spark: Added common AI/ML Python packages by default to Dataproc Serverless for Spark 1.2 and 2.2 runtimes.

Dataproc Serverless for Spark: Upgraded Cloud Storage connector to 3.0.3 version in the latest 1.2 and 2.2 runtimes.

October 21, 2024

Announcing the General Availability (GA) release of Spark UI for Dataproc Serverless Batches and Interactive sessions which allows you to monitor and debug your serverless Spark workloads. Spark UI is available by default and free of cost for all Dataproc Serverless workloads.

October 18, 2024

October 17, 2024

October 14, 2024

Dataproc Clusters created with image versions 2.0.57+, 2.1.5+, or 2.2+: Secondary workers' control plane operations are made by the Dataproc Service Agent service account (service-<project-number>@dataproc-accounts.iam.gserviceaccount.com). They will no longer use the Google APIs Service Agent service account (<project-number>@cloudservices.gserviceaccount.com).

October 11, 2024

October 08, 2024

October 04, 2024

September 30, 2024

Blocklisted the following Dataproc on Compute Engine subminor image versions:

  • 2.0.120-debian10, 2.0.120-rocky8, 2.0.120-ubuntu18
  • 2.1.68-debian11, 2.1.68-rocky8, 2.1.68-ubuntu20, 2.1.68-ubuntu20-arm
  • 2.2.34-debian12, 2.2.34-rocky9, 2.2.34-ubuntu22

September 23, 2024

Dataproc Serverless for Spark: In runtime versions 1.2 and 2.2, minimized the dynamic memory footprint of the Spark application by setting XX:MaxHeapFreeRatio to 30% and XX:MinHeapFreeRatio to 10%.

Dataproc Serverless for Spark: Added the google-cloud-dlp Python package by default to the Dataproc Serverless for Spark runtimes.

Dataproc Serverless for Spark: Fixed an issue that would cause some batches and sessions to fail to start when using the premium compute tier.

September 21, 2024

Blocklisted the following Dataproc on Compute Engine subminor image versions:

  • 2.0.119-debian10, 2.0.103-rocky8, 2.0.103-ubuntu18
  • 2.1.67-debian11, 2.1.51-rocky8, 2.1.51-ubuntu20, 2.1.51-ubuntu20-arm
  • 2.2.33-debian12, 2.2.17-rocky9, 2.2.17-ubuntu22

September 16, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.118-debian10, 2.0.118-rocky8, 2.0.118-ubuntu18
  • 2.1.66-debian11, 2.1.66-rocky8, 2.1.66-ubuntu20, 2.1.66-ubuntu20-arm
  • 2.2.32-debian12, 2.2.32-rocky9, 2.2.32-ubuntu22

September 13, 2024

Dataproc Serverless for Spark: Fixed a bug that caused some batches and sessions to fail to start when using the premium compute tier.

September 06, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.117-debian10, 2.0.117-rocky8, 2.0.117-ubuntu18
  • 2.1.65-debian11, 2.1.65-rocky8, 2.1.65-ubuntu20, 2.1.65-ubuntu20-arm
  • 2.2.31-debian12, 2.2.31-rocky9, 2.2.31-ubuntu22

Dataproc on Compute Engine: The latest 2.2 image versions now support Hudi 0.15.0.

Dataproc on Compute Engine: The latest 2.2 image versions support Hudi Trino integration natively. If both components are selected when you create a Dataproc cluster, Trino will be configured to support Hudi automatically.

September 04, 2024

Dataproc on Compute Engine: Dataproc image version 2.2 will become the default Dataproc on Compute Engine image version on September 6, 2024.

September 03, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.116-debian10, 2.0.116-rocky8, 2.0.116-ubuntu18
  • 2.1.64-debian11, 2.1.64-rocky8, 2.1.64-ubuntu20, 2.1.64-ubuntu20-arm
  • 2.2.30-debian12, 2.2.30-rocky9, 2.2.30-ubuntu22,

Dataproc on Compute Engine: Apache Spark upgraded to version 3.5.1 in image version 2.2 starting with image version 2.2.30.

Dataproc on GKE runtime versions 2.0 (Spark 3.1) is deprecated.

August 26, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.115-debian10, 2.0.115-rocky8, 2.0.115-ubuntu18
  • 2.1.63-debian11, 2.1.63-rocky8, 2.1.63-ubuntu20, 2.1.63-ubuntu20-arm
  • 2.2.29-debian12, 2.2.29-rocky9, 2.2.29-ubuntu22

August 22, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.77
  • 1.2.21
  • 2.0.85
  • 2.2.21

Dataproc Serverless for Spark: Subminor version 2.0.85 is the last release of runtime version 2.0, which will no longer be supported and will not receive new releases.

August 19, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.114-debian10, 2.0.114-rocky8, 2.0.114-ubuntu18
  • 2.1.62-debian11, 2.1.62-rocky8, 2.1.62-ubuntu20, 2.1.62-ubuntu20-arm
  • 2.2.28-debian12, 2.2.28-rocky9, 2.2.28-ubuntu22

syslog is now available for Dataproc cluster nodes in Cloud Logging. See Dataproc logs for cluster and job log information.

August 15, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.76
  • 1.2.20
  • 2.0.84
  • 2.2.20

August 12, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.75
  • 1.2.19
  • 2.0.83
  • 2.2.19

July 31, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.74
  • 1.2.18
  • 2.0.82
  • 2.2.18

Dataproc Serverless for Spark: Upgraded Spark BigQuery connector to version 0.36.4 in the latest 1.2 and 2.2 Dataproc Serverless for Spark runtime versions.

July 26, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.73
  • 1.2.17
  • 2.0.81
  • 2.2.17

July 25, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.113-debian10, 2.0.113-rocky8, 2.0.113-ubuntu18
  • 2.1.61-debian11, 2.1.61-rocky8, 2.1.61-ubuntu20, 2.1.61-ubuntu20-arm
  • 2.2.27-debian12, 2.2.27-rocky9, 2.2.27-ubuntu22

Enabled user sync by default for clusters using Ranger.

Replaced Spark external packages with connector folder on Dataproc 2.2 clusters.

Fixed a bug that caused intermittent delays and failures in clusters with 3 HDFS.

July 22, 2024

Hyperdisks for Dataproc clusters are now created with default throughput and IOPS. When this behavior becomes configurable, it will be announced in a future release note.

Added support for N4 and C4 machine types for Dataproc image versions 2.1 and above. The following default configurations are now applied to clusters created with N4 or C4 machine types:

  • bootdisktype = "hyperdisk-balanced"
  • nictype = "gvnic"

When a Cluster, Job, AutoscalingPolicy, or WorkflowTemplate API resource does not exist and the requestor does not have access to the project, a 403 error code is now issued instead of a 404 error code.

July 19, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.72
  • 1.2.16
  • 2.0.80
  • 2.2.16

Note: Dataproc Serverless for Spark runtime versions 1.1.71, 1.2.15, 2.0.79, and 2.2.15 were not released.

July 18, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.112-debian10, 2.0.112-rocky8, 2.0.112-ubuntu18
  • 2.1.60-debian11, 2.1.60-rocky8, 2.1.60-ubuntu20, 2.1.60-ubuntu20-arm
  • 2.2.26-debian12, 2.2.26-rocky9, 2.2.26-ubuntu22

July 17, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.70
  • 1.2.14
  • 2.0.78
  • 2.2.14

July 12, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.111-debian10, 2.0.112-rocky8, 2.0.112-ubuntu18
  • 2.1.59-debian11, 2.1.60-rocky8, 2.1.60-ubuntu20, 2.1.60-ubuntu20-arm
  • 2.2.25-debian12, 2.2.26-rocky9, 2.2.26-ubuntu22

July 11, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.69
  • 1.2.13
  • 2.0.77
  • 2.2.13

July 08, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.110-debian10, 2.0.110-rocky8, 2.0.110-ubuntu18
  • 2.1.58-debian11, 2.1.58-rocky8, 2.1.58-ubuntu20, 2.1.58-ubuntu20-arm
  • 2.2.24-debian12, 2.2.24-rocky9, 2.2.24-ubuntu22

July 05, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.68
  • 1.2.12
  • 2.0.76
  • 2.2.12

July 03, 2024

Added Cloud Profiler support in Dataproc Serverless for Spark. Enable profiling via the dataproc.profiling.enabled=true property and configure it via dataproc.profiling.name=<PROFILE_NAME>

New Dataproc on Compute Engine subminor image versions:

  • 2.0.109-debian10, 2.0.109-rocky8, 2.0.109-ubuntu18
  • 2.1.57-debian11, 2.1.57-rocky8, 2.1.57-ubuntu20, 2.1.57-ubuntu20-arm
  • 2.2.23-debian12, 2.2.23-rocky9, 2.2.23-ubuntu22

Dataproc on Compute Engine: Apache Hadoop upgraded to version 3.2.4 in image version 2.0 starting with image version 2.0.109.

June 28, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.108-debian10, 2.0.108-rocky8, 2.0.108-ubuntu18
  • 2.1.56-debian11, 2.1.56-rocky8, 2.1.56-ubuntu20, 2.1.56-ubuntu20-arm
  • 2.2.22-debian12, 2.2.22-rocky9, 2.2.22-ubuntu22

Backported fixes for HIVE-25958 and HIVE-20220 (new configuration hive.groupby.enable.deterministic.distribution=false/true).

June 26, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.67
  • 1.2.11
  • 2.0.75
  • 2.2.11

Dataproc Serverless for Spark: To fix compatibility with open table formats (Apache Iceberg, Apache Hudi and Delta Lake), the ANTLR version downgraded from 4.13.1 to 4.9.3 in Dataproc Serverless for Spark runtime versions 1.2 and 2.2.

June 25, 2024

The Dataproc Component Gateway is now activated by default when you create a Dataproc on Compute Engine cluster using the Google Cloud console.

June 24, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.107-debian10, 2.0.107-rocky8, 2.0.107-ubuntu18
  • 2.1.55-debian11, 2.1.55-rocky8, 2.1.55-ubuntu20, 2.1.55-ubuntu20-arm
  • 2.2.21-debian12, 2.2.21-rocky9, 2.2.21-ubuntu22

June 21, 2024

Dataproc Serverless for Spark: To fix compatibility with open table formats (Apache Iceberg, Apache Hudi and Delta Lake), the ANTLR version will be downgraded from 4.13.1 to 4.9.3 in Dataproc Serverless for Spark runtime versions 1.2 and 2.2 on June 26, 2024.

June 20, 2024

Dataproc Serverless for Spark: Spark runtime version 2.2 will become the default Dataproc Serverless for Spark runtime version on September 6, 2024.

New Dataproc Serverless for Spark runtime versions:

  • 1.1.66
  • 1.2.10
  • 2.0.74
  • 2.2.10

June 13, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.106-debian10, 2.0.106-rocky8, 2.0.106-ubuntu18
  • 2.1.54-debian11, 2.1.54-rocky8, 2.1.54-ubuntu20, 2.1.54-ubuntu20-arm
  • 2.2.20-debian12, 2.2.20-rocky9, 2.2.20-ubuntu22

New Dataproc Serverless for Spark runtime versions:

  • 1.1.65
  • 1.2.9
  • 2.0.73
  • 2.2.9

Dataproc Serverless for Spark: Upgraded Spark BigQuery connector to version 0.36.3 in the latest 1.2 and 2.2 Dataproc Serverless for Spark runtime versions.

Support configuration to prevent HiveMetaStore metrics expensive database queries. To prevent expensive queries during HiveMetaStore startup, set Hive property metastore.initial.metadata.count.enabled to false.

June 11, 2024

The Apache Spark in BigQuery feature is available in Private Preview. This feature lets you create a Spark session in a BigQuery notebook that you can use to develop and submit PySpark code from BigQuery. To access this feature, fill in and submit the Dataproc Preview access request form.

June 06, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.105-debian10, 2.0.105-rocky8, 2.0.105-ubuntu18
  • 2.1.53-debian11, 2.1.53-rocky8, 2.1.53-ubuntu20, 2.1.53-ubuntu20-arm
  • 2.2.19-debian12, 2.2.19-rocky9, 2.2.19-ubuntu22

Dataproc on Compute Engine: When creating a cluster with the latest Dataproc on Compute Engine image versions, the secondary worker boot disk type now defaults to the primary worker boot disk type, which is pd-standard if the primary worker boot disk type is not specified.

June 05, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.64
  • 1.2.8
  • 2.0.72
  • 2.2.8

June 03, 2024

Dataproc on Compute Engine: Update restartable job error messages to include job IDs.

Dataproc Serverless for Spark: Automatically apply goog-dataproc-session-id, goog-dataproc-session-uuid and goog-dataproc-location labels for a session resource.

May 30, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.104-debian10, 2.0.104-rocky8, 2.0.104-ubuntu18
  • 2.1.52-debian11, 2.1.52-rocky8, 2.1.52-ubuntu20, 2.1.52-ubuntu20-arm
  • 2.2.18-debian12, 2.2.18-rocky9, 2.2.18-ubuntu22

New Dataproc Serverless for Spark runtime versions:

  • 1.1.63
  • 1.2.7
  • 2.0.71
  • 2.1.50
  • 2.2.7

Dataproc Serverless for Spark: Subminor version 2.1.50 is the last release of runtime version 2.1, which will no longer be supported and will not receive new releases.

Dataproc Serverless for Spark: Removed Spark data lineage support for runtime version 1.2.

Dataproc Serverless for Spark: Enabled Spark checkpoint (spark.checkpoint.compress) and RDD (spark.rdd.compress) compression in the latest 1.2 and 2.2 runtime versions.

May 23, 2024

Blocklisted the following Dataproc on Compute Engine subminor image versions:

  • 2.0.103-debian10, 2.0.103-rocky8, 2.0.103-ubuntu18
  • 2.1.51-debian11, 2.1.51-rocky8, 2.1.51-ubuntu20, 2.1.51-ubuntu20-arm
  • 2.2.17-debian12, 2.2.17-rocky9, 2.2.17-ubuntu22

May 22, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.62
  • 1.2.6
  • 2.0.70
  • 2.1.49
  • 2.2.6

Upgraded Spark BigQuery connector to version 0.36.2 in the latest 1.2 and 2.2 Dataproc Serverless for Spark runtime versions.

May 16, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.102-debian10, 2.0.102-rocky8, 2.0.102-ubuntu18

  • 2.1.50-debian11, 2.1.50-rocky8, 2.1.50-ubuntu20, 2.1.50-ubuntu20-arm

  • 2.2.16-debian12, 2.2.16-rocky9, 2.2.16-ubuntu22

Anaconda's default channel is disabled for package installations on Dataproc on Compute Engine.

May 09, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.101-debian10, 2.0.101-rocky8, 2.0.101-ubuntu18

  • 2.1.49-debian11, 2.1.49-rocky8, 2.1.49-ubuntu20, 2.1.49-ubuntu20-arm

  • 2.2.15-debian12, 2.2.15-rocky9, 2.2.15-ubuntu22

May 08, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.61
  • 1.2.5
  • 2.0.69
  • 2.1.48
  • 2.2.5

May 06, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.100-debian10, 2.0.100-rocky8, 2.0.100-ubuntu18
  • 2.1.48-debian11, 2.1.48-rocky8, 2.1.48-ubuntu20, 2.1.48-ubuntu20-arm
  • 2.2.14-debian12, 2.2.14-rocky9, 2.2.14-ubuntu22

Dataproc on Compute Engine:

May 01, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.60
  • 1.2.4
  • 2.0.68
  • 2.1.47
  • 2.2.4

Dataproc Serverless for Spark:

  • Upgraded Spark RAPIDS to version 24.04.0 in 1.2 and 2.2 Dataproc Serverless for Spark runtimes.

When you submit a Dataproc Serverless Batch with a CMEK key:

  • In addition to encrypting disk and Cloud Storage data, Dataproc Serverless will use your CMEK to also encrypt batch job arguments. This change will require you to do the following:
  • batches.list will return an unreachable field that lists any batches with job arguments that couldn't be decrypted. You can issue a batches.get request to obtain more information on an unreachable batch.
  • Multi-regional and cross-regional CMEKs will no longer be permitted. The key (CMEK) must be located in the same location as the encrypted resource. For example, the CMEK used to encrypt a batch that runs in the us-central1 region must also be located in the us-central1 region.

April 29, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.99-debian10, 2.0.99-rocky8, 2.0.99-ubuntu18
  • 2.1.47-debian11, 2.1.47-rocky8, 2.1.47-ubuntu20, 2.1.47-ubuntu20-arm
  • 2.2.13-debian12, 2.2.13-rocky9, 2.2.13-ubuntu22

April 26, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.59
  • 1.2.3
  • 2.0.67
  • 2.1.46
  • 2.2.3

April 21, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.98-debian10, 2.0.98-rocky8, 2.0.98-ubuntu18
  • 2.1.46-debian11, 2.1.46-rocky8, 2.1.46-ubuntu20, 2.1.46-ubuntu20-arm
  • 2.2.12-debian12, 2.2.12-rocky9, 2.2.12-ubuntu22

April 20, 2024

Announcing Dataproc Workflow Templates supports the CMEK organization policy.

April 18, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.58
  • 1.2.2
  • 2.0.66
  • 2.1.45
  • 2.2.2

Set the soft delete policy of newly created Dataproc staging and temp Cloud Storage buckets to 0 days.

Updated the default autoscaling V2 cool-down time from 2m to 1m to reduce scaling latency.

Fixed a bug where Dataproc Serverless sessions that live longer than 48 hours are underbilled.

April 09, 2024

Dataproc Serverless for Spark: The preview release of Advanced troubleshooting, including Gemini-assisted troubleshooting, is now available for Spark workloads submitted with the following or later-released runtime versions:

  • 1.1.55
  • 1.2.0-RC1
  • 2.0.63
  • 2.1.42
  • 2.2.0-RC15

Dataproc Serverless for Spark: Announcing the preview release of Autotuning Spark workloads.

April 04, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.57
  • 1.2.1
  • 2.0.65
  • 2.1.44
  • 2.2.1

Added bigframes Python package by default in the Dataproc Serverless for Spark runtime versions 1.2 and 2.2

April 02, 2024

The following previously released sub-minor versions of Dataproc on Compute Engine images have been rolled back and can only be used when updating existing clusters that already use them:

  • 2.0.97-debian10, 2.0.97-rocky8, 2.0.97-ubuntu18
  • 2.1.45-debian11, 2.1.45-rocky8, 2.1.45-ubuntu20, 2.1.45-ubuntu20-arm
  • 2.2.11-debian12, 2.2.11-rocky9, 2.2.11-ubuntu22

March 29, 2024

Dataproc Serverless for Spark: runtime version 2.2 will become the default Dataproc Serverless for Spark runtime version on May 3, 2024.

Note: This announcement was updated in the April 19, 2024 release note.

March 28, 2024

Note: the above subminor image versions were rolled back on April 2, 2024

Dataproc on Compute Engine: New Hadoop Google Secret Manager Credential Provider feature introduced in latest Dataproc on Compute Engine 2.0 image versions.

March 27, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.56
  • 1.2.0
  • 2.0.64
  • 2.1.43
  • 2.2.0

Announcing the General Availability (GA) release of Dataproc Serverless for Spark runtime versions 1.2 and 2.2, which include the following components:

  • Spark 3.5.1
  • BigQuery Spark Connector 0.36.1
  • Cloud Storage Connector 3.0.0
  • Conda 24.1
  • Java 17
  • Python 3.12
  • R 4.3
  • Scala 2.12 (1.2 runtime) and Scala 2.13 (2.2 runtime)

Dataproc Serverless for Spark:

  • Upgraded Spark to version 3.5.1 in the latest 1.2 and 2.2 runtimes.
  • Upgraded Conda to version 24.1 in the latest 1.2 and 2.2 runtimes.
  • Upgraded Spark BigQuery connector to version 0.36.1 in the latest 1.2 and 2.2 runtimes.

March 21, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.96-debian10, 2.0.96-rocky8, 2.0.96-ubuntu18
  • 2.1.44-debian11, 2.1.44-rocky8, 2.1.44-ubuntu20, 2.1.44-ubuntu20-arm
  • 2.2.10-debian12, 2.2.10-rocky9, 2.2.10-ubuntu22

March 20, 2024

Announcing the Preview release of Dataproc Serverless for Spark 1.2 runtime:

  • Spark 3.5.0
  • BigQuery Spark Connector 0.35.1
  • Cloud Storage Connector 3.0.0
  • Conda 23.11
  • Java 17
  • Python 3.12
  • R 4.3
  • Scala 2.12

New Dataproc Serverless for Spark runtime versions:

  • 1.1.55
  • 1.2.0-RC1
  • 2.0.63
  • 2.1.42
  • 2.2.0-RC15

Dataproc Serverless for Spark:

  • Upgraded Spark RAPIDS plugin to version 24.2.0 in the latest runtimes.
  • Upgraded Spark to version 3.3.4 in the latest 1.1 and 2.0 runtimes.
  • Backported SPARK-44198 in the latest 1.2 and 2.2 runtimes.

March 14, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.95-debian10, 2.0.95-rocky8, 2.0.95-ubuntu18
  • 2.1.43-debian11, 2.1.43-rocky8, 2.1.43-ubuntu20, 2.1.43-ubuntu20-arm
  • 2.2.9-debian12, 2.2.9-rocky9, 2.2.9-ubuntu22

New Dataproc Serverless for Spark runtime versions:

  • 1.1.54
  • 2.0.62
  • 2.1.41
  • 2.2.0-RC14

Added the bigframes (BigQuery DataFrames) Python package in the Dataproc Serverless for Spark 2.1 runtime.

March 07, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.53
  • 2.0.61
  • 2.1.40
  • 2.2.0-RC13

Dataproc Serverless for Spark: Upgraded Cloud Storage connector to 2.2.20 version in the latest 1.1, 2.0, and 2.1 runtimes.

March 06, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.94-debian10, 2.0.94-rocky8, 2.0.94-ubuntu18
  • 2.1.42-debian11, 2.1.42-rocky8, 2.1.42-ubuntu20, 2.1.42-ubuntu20-arm
  • 2.2.8-debian12, 2.2.8-rocky9, 2.2.8-ubuntu22

Dataproc on Compute Engine: Upgraded Cloud Storage connector version to 2.2.20 for 2.0 and 2.1 images.

Dataproc on Compute Engine: Mounted Java cacerts into containers by default when the Docker-on-YARN feature is enabled.

March 04, 2024

Dataproc Serverless for Spark: Extended Spark metrics collected for a batch now include executor:resultSize, executor:shuffleBytesWritten, and executor:shuffleTotalBytesRead.

February 29, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.52
  • 2.0.60
  • 2.1.39
  • 2.2.0-RC12

February 28, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.93-debian10, 2.0.93-rocky8, 2.0.93-ubuntu18
  • 2.1.41-debian11, 2.1.41-rocky8, 2.1.41-ubuntu20, 2.1.41-ubuntu20-arm
  • 2.2.7-debian12, 2.2.7-rocky9, 2.2.7-ubuntu22

Dataproc on Compute Engine: The new Secret Manager credential provider feature is available in the latest 2.1 image versions.

Dataproc on Compute Engine:

  • Upgraded Zookeeper to 3.8.3 for Dataproc 2.2.
  • Upgraded ORC for Hive to 1.15.13 for Dataproc 2.1.
  • Upgraded ORC for Spark to 1.7.10 for Dataproc 2.1.
  • Extended expiry for the internal Knox Gateway certificate from one year to five years from cluster creation for Dataproc images 2.0, 2.1, and 2.2.

Dataproc on Compute Engine: Fixed ZooKeeper startup failures in image 2.2 HA (High Availability) clusters that use fully qualified hostnames.

February 22, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.51
  • 2.0.59
  • 2.1.38
  • 2.2.0-RC11

February 16, 2024

Dataproc on Compute Engine: The internalIpOnly cluster configuration setting now defaults to true for clusters created with 2.2 image versions. Also see Create a Dataproc cluster with internal IP addresses only.

February 15, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.50
  • 2.0.58
  • 2.1.37
  • 2.2.0-RC10

Dataproc Serverless for Spark: Spark Lineage is available for Dataproc Serverless for Spark 1.1 runtime.

February 08, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.92-debian10, 2.0.92-rocky8, 2.0.92-ubuntu18
  • 2.1.40-debian11, 2.1.40-rocky8, 2.1.40-ubuntu20, 2.1.40-ubuntu20-arm
  • 2.2.6-debian12, 2.2.6-rocky9, 2.2.6-ubuntu22

Dataproc on Compute Engine Ranger Cloud Storage enhancement:

  • Enabled downscoping
  • Added caching of tokens in local cache

Both settings are configurable and can be enabled by customers: see Use Ranger with caching and downscoping .

Dataproc on Compute Engine: The new Secret Manager credential provider feature is available in the latest 2.2 image versions.

Dataproc on Compute Engine: Backported patch for HADOOP-18652.

New Dataproc Serverless for Spark runtime versions:

  • 1.1.49
  • 2.0.57
  • 2.1.36
  • 2.2.0-RC9

Dataproc Serverless for Spark: Backported patch for HADOOP-18652.

February 02, 2024

Dataproc on Compute Engine: Bucket ttl validation now also runs for buckets created by Dataproc.

Dataproc on Compute Engine: Added a warning during cluster creation if the cluster Cloud Storage staging bucket is using the legacy fine-grained/ACL IAM configuration instead of the recommended Uniform bucket-level access controls.

Dataproc Serverless for Spark: When dynamic allocation is enabled, the initial executor number is determined by max of spark.dynamicAllocation.initialExecutors and spark.executor.instances.

February 01, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.91-debian10, 2.0.91-rocky8, 2.0.91-ubuntu18
  • 2.1.39-debian11, 2.1.39-rocky8, 2.1.39-ubuntu20, 2.1.39-ubuntu20-arm
  • 2.2.5-debian12, 2.2.5-rocky9, 2.2.5-ubuntu22

New Dataproc Serverless for Spark runtime versions:

  • 1.1.48
  • 2.0.56
  • 2.1.35
  • 2.2.0-RC8

Dataproc on Compute Engine: Backported patches for HIVE-21214, HIVE-23154, HIVE-23354 and HIVE-23614.

January 31, 2024

Dataproc is now available in the africa-south1 region (Johannesburg, South Africa).

The GitHub Ops Agent initialization action installs the Ops Agent on a Dataproc cluster, and provides metrics similar to the metrics that were enabled with the --metric-sources=monitoring-agent-defaults setting available for use with Dataproc images versions prior to version 2.2.

January 25, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.47
  • 2.0.55
  • 2.1.34
  • 2.2.0-RC7

January 24, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.90-debian10, 2.0.90-rocky8, 2.0.90-ubuntu18
  • 2.1.38-debian11, 2.1.38-rocky8, 2.1.38-ubuntu20, 2.1.38-ubuntu20-arm
  • 2.2.4-debian12, 2.2.4-rocky9, 2.2.4-ubuntu22

Backport HIVE-19568: Active/Passive HiveServer2 HA: Disallow direct connection to passive instance.

Backport HIVE-27715: Remove ThreadPoolExecutorWithOomHook.

January 19, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.89-debian10, 2.0.89-rocky8, 2.0.89-ubuntu18
  • 2.1.37-debian11, 2.1.37-rocky8, 2.1.37-ubuntu20, 2.1.37-ubuntu20-arm
  • 2.2.3-debian12, 2.2.3-rocky9, 2.2.3-ubuntu22

Dataproc on Compute Engine: The default yarn.nm.liveness-monitor.expiry-interval-ms Hadoop YARN setting has been changed in the latest image versions from 15000 (15 seconds) to 120000 (2 minutes).

Dataproc on Compute Engine: Upgraded Cloud Storage connector version to 2.2.19 in the latest 2.0 and 2.1 images.

Dataproc on Compute Engine: Upgraded Miniconda to 23.11, Python to 3.11, and curl to 8.5 to fix CVE-2023-38545 in the latest 2.2 images.

Dataproc on Compute Engine: Fixed the gsutil: command not found error in the latest Ubuntu images.

Dataproc on Compute Engine: Fixed Trino startup issue in the latest 2.2 images.

New Dataproc Serverless for Spark runtime versions:

  • 1.1.46
  • 2.0.54
  • 2.1.33
  • 2.2.0-RC6

Dataproc Serverless for Spark: Upgraded Cloud Storage connector to 2.2.19 version in the latest 1.1, 2.0, and 2.1 runtimes.

January 17, 2024

Beginning March 31, 2024, when you submit a Dataproc Serverless Batch with a CMEK key:

  • In addition to encrypting disk and Cloud Storage data, Dataproc Serverless will use your CMEK to also encrypt batch job arguments. This change will require that you assign the Cloud KMS CryptoKey Encrypter/Decrypter and the Service Usage Consumer role to the Dataproc Service Agent service account.
  • batches.list will return an unreachable field that lists any batches with job arguments that couldn't be decrypted. You can issue a batches.get request to obtain more information on an unreachable batch.
  • Multi-regional and cross-regional CMEKs will no longer be permitted. The key (CMEK) must be located in the same location as the encrypted resource. For example, the CMEK used to encrypt a batch that runs in the us-central1 region must also be located in the us-central1 region.

January 15, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.45
  • 2.0.53
  • 2.1.32
  • 2.2.0-RC5

Dataproc Serverless for Spark:

  • Upgraded Spark RAPIDS to version 23.12.1
  • Upgraded the following components to the following versions in the latest 2.2 runtime:

    • Spark BigQuery connector version 0.35.0
    • Cloud Storage connector version 3.0.0
    • Conda version 23.11

January 05, 2024

New Dataproc Serverless for Spark runtime versions:

  • 1.1.44
  • 2.0.52
  • 2.1.31
  • 2.2.0-RC4

January 04, 2024

The following previously released sub-minor versions of Dataproc images have been rolled back and can only be used when updating existing clusters that already use them:

  • 2.0.88-debian10, 2.0.88-rocky8, 2.0.88-ubuntu18
  • 2.1.36-debian11, 2.1.36-rocky8, 2.1.36-ubuntu20, 2.1.36-ubuntu20-arm
  • 2.2.2-debian12, 2.2.2-rocky9, 2.2.2-ubuntu22

January 02, 2024

New Dataproc on Compute Engine subminor image versions:

  • 2.0.88-debian10, 2.0.88-rocky8, 2.0.88-ubuntu18
  • 2.1.36-debian11, 2.1.36-rocky8, 2.1.36-ubuntu20, 2.1.36-ubuntu20-arm
  • 2.2.2-debian12, 2.2.2-rocky9, 2.2.2-ubuntu22

  • Rollback Notice: See the January 4, 2024 release note rollback notice.

Dataproc on Compute Engine: Changed the Hive Server2 and MetaStore maximum default JVM heap size to 32GiB. Previously, the limit was set to 1/4 of total node memory, which could be too large on large-memory machines.

Dataproc on Compute Engine: Backported the patch for YARN-10975 in the latest 2.0 images.

December 21, 2023

New Dataproc Serverless for Spark runtime versions:

  • 1.1.43
  • 2.0.51
  • 2.1.30
  • 2.2.0-RC3

December 18, 2023

New Dataproc on Compute Engine subminor image versions:

  • 2.0.87-debian10, 2.0.87-rocky8, 2.0.87-ubuntu18
  • 2.1.35-debian11, 2.1.35-rocky8, 2.1.35-ubuntu20, 2.1.35-ubuntu20-arm
  • 2.2.1-debian12, 2.2.1-rocky9, 2.2.1-ubuntu22

December 14, 2023

New Dataproc Serverless for Spark runtime versions:

  • 1.1.42
  • 2.0.50
  • 2.1.29
  • 2.2.0-RC2

Added the google-cloud-secret-manager Python package in the latest Dataproc Serverless for Spark runtimes.

December 11, 2023

Announcing the GA release of Dataproc on Compute Engine image version 2.2 :

  • 2.2.0-debian12, 2.2.0-rocky9, 2.2.0-ubuntu22

The 2.2.0 release includes the following components:

  • Debian-12 / Ubuntu-2204 / RockyLinux 9
  • Apache Hadoop 3.3.6
  • Apache Spark 3.5.0
  • Spark-BigQuery Connector 0.34.0
  • Cloud Storage Connector 3.0.0
  • Trino 432
  • Apache Flink 1.17.0
  • Apache Ranger 2.4.0
  • Apache Solr 9.2.1
  • R 4.2
  • Hue 4.11.0
  • JupyterLab Notebook 3.6

Monitoring-agent-defaults metrics are not available in Dataproc on Compute Engine image version 2.2 clusters unless the Ops Agent is installed. Other metrics for Dataproc provided components will continue to work.

Blocklisted the following Dataproc on Compute Engine Images due to issue with increase in startup time:

  • 2.0.86-debian10, 2.0.86-rocky8, 2.0.86-ubuntu18
  • 2.1.34-debian11, 2.1.34-rocky8, 2.1.34-ubuntu20, 2.1.34-ubuntu20-arm

December 06, 2023

Announcing the Preview release of Dataproc Serverless for Spark 2.2 runtime:

  • Spark 3.5.0
  • BigQuery Spark Connector 0.34.0
  • Cloud Storage Connector 3.0.0-RC1
  • Conda 23.10
  • Java 17
  • Python 3.12
  • R 4.3
  • Scala 2.13

New Dataproc Serverless for Spark runtime versions:

  • 1.1.41
  • 2.0.49
  • 2.1.28
  • 2.2.0-RC1

December 04, 2023

Added the Confidential Computing option on the "Manage Security" panel on the "Create a Dataproc cluster on Compute Engine" page in the Google Cloud console.

New Dataproc on Compute Engine subminor image versions:

  • 2.0.85-debian10, 2.0.85-rocky8, 2.0.85-ubuntu18
  • 2.1.33-debian11, 2.1.33-rocky8, 2.1.33-ubuntu20, 2.1.33-ubuntu20-arm

Updated the Zookeeper component version from 3.8.0 to 3.8.3 in the latest Dataproc on Compute Engine 2.1 image version.

Fixed Dataproc Hub issue in latest Dataproc on Compute Engine 2.1 image.

Backported HIVE-21698 in Hive 3.1.3 component in latest Dataproc on Compute Engine image versions.

December 01, 2023

The Cloud Storage connector has been upgraded to version 2.2.18 in all Dataproc Serverless for Spark runtimes.

November 17, 2023

New Dataproc on Compute Engine subminor image versions:

  • 2.0.84-debian10, 2.0.84-rocky8, 2.0.84-ubuntu18
  • 2.1.32-debian11, 2.1.32-rocky8, 2.1.32-ubuntu20, 2.1.32-ubuntu20-arm
  • 2.2.0-RC3-debian11/-ubuntu22/-rocky9

Upgraded the Cloud Storage connector version to 2.2.18 in the latest 2.0 and 2.1 Dataproc on Compute Engine image versions.

In the Flink component in the latest Dataproc on Compute Engine 2.1 image version, added the following java-storage client properties:

Fixed a regression in the Zeppelin websocket rules that caused a websocket error in Zeppelin notebooks.

The Python kernel does not work in Zeppelin on the Dataproc on Compute Engine 2.1 image version. Other kernels are not impacted.

The Zeppelin REST API does not work (drops query parameters) on Dataproc on Compute Engine 2.0 and 2.1 image versions via the Component Gateway. Other Zeppelin interactions can also break as a result of dropped query parameters.

November 15, 2023

You can use CMEK (Customer Managed Encrytion Keys) with encrypted Dataproc cluster data, incuding persistent disk data, job arguments and queries submitted with Dataproc jobs, and cluster data saved in the cluster Dataproc staging bucket. See Use CMEK with cluster data for more information.

November 10, 2023

Announcing the General Availability (GA) release of Dataproc Jupyter Plugin and its availability in Vertex AI Workbench instance notebooks.

New Dataproc on Compute Engine subminor image versions:

  • 2.0.83-debian10, 2.0.83-rocky8, 2.0.83-ubuntu18
  • 2.1.31-debian11, 2.1.31-rocky8, 2.1.31-ubuntu20, 2.1.31-ubuntu20-arm

November 08, 2023

Announcing the release of Workflow Template CMEK (Customer Managed Encryption Key) encryption. Use this feature to apply CMEK encryption to workflow template job arguments. For example, when this feature is enabled, the query string of a workflow template SparkSQL job is encrypted using CMEK.

You can now use Dataproc Serverless autoscaling V2 to help you manage Dataproc Serverless workloads, improve workload performance, and save costs.

November 07, 2023

Set spark.shuffle.mapOutput.minSizeForBroadcast=128m to fix SPARK-38101 when Dataproc Serverless Spark dynamic allocation is enabled.

November 01, 2023

Announcing the Preview release of Dataproc Flexible VMs. This feature lets you specify prioritized lists of secondary worker VM types that Dataproc will select from when creating your cluster. Dataproc will select the VM type with sufficient available capacity while taking quotas and reservations into account.

October 30, 2023

New Dataproc on Compute Engine subminor image versions:

  • 2.0.82-debian10, 2.0.82-rocky8, 2.0.82-ubuntu18
  • 2.1.30-debian11, 2.1.30-rocky8, 2.1.30-ubuntu20, 2.1.30-ubuntu20-arm

Added spark.dataproc.scaling.version=2 config to let customers control the Dataproc Serverless for Spark autoscaling version.

Increased the TTL for Dataproc on Compute Engine custom images from 60 days to 365 days.

Fixed Knox rewrite rules for Zeppelin URLs in some cases in the latest 2.0 and 2.1 Dataproc on Compute Engine image versions.

October 27, 2023

October 25, 2023

October 23, 2023

Dataproc on Compute Engine: Dataproc now collects the dataproc.googleapis.com/job/yarn/vcore_seconds and dataproc.googleapis.com/job/yarn/memory_seconds job-level resource attribution metrics to track YARN application vcore and memory usage during job execution. These metrics are collected by default and are not chargeable to customers.

Dataproc on Compute Engine: Dataproc now collects a dataproc.googleapis.com/node/yarn/nodemanager/health health metric to track the health of individual YARN node managers running on VMs. This metric is written against the gce_instance monitored resource to help you find suspect nodes. It is collected by default and is not chargeable to customers.

Dataproc on Compute Engine: Properties dataproc:agent.ha.enabled and dataproc:componentgateway.ha.enabled now default to true to provide high availability for the Dataproc Agent and Component Gateway.

October 13, 2023

October 12, 2023

New Dataproc on Compute Engine subminor image versions:

  • 2.0.80-debian10, 2.0.80-rocky8, 2.0.80-ubuntu18
  • 2.1.28-debian11, 2.1.28-rocky8, 2.1.28-ubuntu20, 2.1.28-ubuntu20-arm

October 09, 2023

October 06, 2023

New Dataproc on Compute Engine image version 2.2 is available for preview with upgraded components.

New Dataproc on Compute Engine subminor image versions:

  • 2.0.79-debian10, 2.0.79-rocky8, 2.0.79-ubuntu18
  • 2.1.27-debian11, 2.1.27-rocky8, 2.1.27-ubuntu20, 2.1.27-ubuntu20-arm
  • 2.2.0-RC2-debian11, 2.2.0-RC2-rocky9, 2.2.0-RC2-ubuntu22

Upgraded Hadoop version from 3.3.3 to 3.3.6 in the latest Dataproc on Compute Engine 2.1 image version.

Upgraded the Cloud Storage connector version to 2.2.17 in the latest Dataproc Serverless for Spark runtimes.

Added the gs.http.connect-timeout and gs.http.read-timeout properties in Flink to set the connection timeout and read timeout for java-storage client in the latest Dataproc on Compute Engine 2.1 image version.

Added the gs.filesink.entropy.enabled property in Flink to enable entropy injection in filesink Cloud Storage path in the latest Dataproc on Compute Engine 2.1 image version.

September 28, 2023

New Dataproc on Compute Engine subminor image versions:

  • 2.0.78-debian10, 2.0.78-rocky8, 2.0.78-ubuntu18
  • 2.1.26-debian11, 2.1.26-rocky8, 2.1.26-ubuntu20, 2.1.26-ubuntu20-arm

Upgraded the Cloud Storage connector version to 2.2.17 in the latest 2.0 and 2.1 Dataproc on Compute Engine image versions.

Upgraded Hive version from 3.1.2 to 3.1.3 in the latest Dataproc on Compute Engine 2.0 image version.

September 22, 2023

New Dataproc on Compute Engine subminor image versions:

  • 2.0.77-debian10, 2.0.77-rocky8, 2.0.77-ubuntu18
  • 2.1.25-debian11, 2.1.25-rocky8, 2.1.25-ubuntu20, 2.1.25-ubuntu20-arm

In the latest Dataproc on Compute Engine 2.0 and 2.1 image versions, unset the CLOUDSDK_PYTHON variable to allow the gcloud command-line tool to use its bundled Python interpreter.

Fixed Jupyter notebooks bug that made Scala compilation errors invisible with the Toree kernel in Dataproc on Compute Engine 2.1 images.

September 19, 2023

Dataproc is now available in the me-central2 region (Dammam, Saudi Arabia).

September 15, 2023

New Dataproc on Compute Engine subminor image versions:

  • 2.0.76-debian10, 2.0.76-rocky8, 2.0.76-ubuntu18
  • 2.1.24-debian11, 2.1.24-rocky8, 2.1.24-ubuntu20, 2.1.24-ubuntu20-arm

Scala has been upgraded to version 2.12.18 and Apache Tez has been upgraded to version 0.10.2 in Dataproc on Compute Engine 2.1 images.

September 13, 2023

Announcing the Private Preview release of the Dataproc on Compute Engine Flink Jobs resource. During Private Preview, you can contact your Google Cloud Sales representative to have your project(s) added to an allowlist to allow you to submit Flink jobs to the Dataproc on Compute Engine service.

September 12, 2023

The dataproc.diagnostics.enabled property is now avaiable to enable running diagnostics on Dataproc Serverless for Spark. The existing spark.dataproc.diagnostics.enabled property will be deprecated for use with newer runtimes.

September 08, 2023

Dataproc Auto zone placement for clusters is now available in the Google Cloud console by selecting the "Any" option for the cluster zone.

New Dataproc on Compute Engine subminor image versions:

  • 2.0.75-debian10, 2.0.75-rocky8, 2.0.75-ubuntu18
  • 2.1.23-debian11, 2.1.23-rocky8, 2.1.23-ubuntu20, 2.1.23-ubuntu20-arm

The Apache Spark version has been upgraded from 3.3.0 to 3.3.2 in Dataproc on Compute Engine 2.1 images.

September 04, 2023

Announcing the General Availability (GA) release of Data Lineage for Dataproc, which captures data transformations (lineage events) in Dataproc Spark jobs, and publishes them to Dataplex Lineage.

Dataproc Serverless Interactive sessions detail and list pages are now available in the Google Cloud console.

August 29, 2023

Announcing the Preview release of Dataproc Serverless for Spark Interactive sessions and the Dataproc Jupyter Plugin.

August 25, 2023

August 23, 2023

Fixed a Dataproc Serverless issue where Spark batches failed with unhelpful error messages.

August 22, 2023

Dataproc is now available in the europe-west10 region (Berlin).

August 17, 2023

New Dataproc on Compute Engine subminor image versions:

  • 2.0.74-debian10, 2.0.74-rocky8, 2.0.74-ubuntu18
  • 2.1.22-debian11, 2.1.22-rocky8, 2.1.22-ubuntu20, 2.1.22-ubuntu20-arm

Backported the patches for HIVE-20618 in the new Dataproc on Compute Engine 2.0 and 2.1 images.

August 11, 2023

New Dataproc on Compute Engine subminor image versions:

  • 2.0.73-debian10, 2.0.73-rocky8, 2.0.73-ubuntu18
  • 2.1.21-debian11, 2.1.21-rocky8, 2.1.21-ubuntu20, 2.1.21-ubuntu20-arm

Added new Dataproc Serverless Templates for batch workload creation:

  • Cloud Spanner to Cloud Storage
  • Cloud Storage to JDBC
  • Cloud Storage to Cloud Storage
  • Hive to BigQuery
  • JDBC to Cloud Spanner
  • JDBC to JDBC
  • Pub/Sub to Cloud Storage

Improved the reliability of Dataproc Serverless compute node initialization with a Premium disk tier option.

August 07, 2023

Added a dataproc:dataproc.cluster.caching.enabled flag to enable and disable Dataproc on Compute Engine cluster caching. The flag is false by default. Use this feature with the latest Dataproc on Compute Engine images.

August 06, 2023

The following previously released sub-minor versions of Dataproc on Compute Engine images unintentionally reverted several dependency library versions. This caused a risk of backward-incompatibility for some workloads.

These sub-minor versions have been rolled back, and can only be used when updating existing clusters that already use them:

  • 2.0.71-debian10, 2.0.71-rocky8, 2.0.71-ubuntu18
  • 2.1.19-debian11, 2.1.19-rocky8, 2.1.19-ubuntu20, 2.1.19-ubuntu20-arm

August 05, 2023

New Dataproc on Compute Engine image versions:

  • 2.0.72-debian10, 2.0.72-rocky8, 2.0.72-ubuntu18
  • 2.1.20-debian11, 2.1.20-rocky8, 2.1.20-ubuntu20, 2.1.20-ubuntu20-arm

Upgraded Hudi to 0.12.3 and added the BigQuery Sync tool as part of the Hudi optional component.

Downgraded Cloud Storage connector version to 2.2.15 in all Dataproc on Compute Engine image versions to prevent potential performance regression.

Backported ZEPPELIN-5434 to image 2.1 to fix CVE-2022-2048.

Backported the patches for HIVE-22170 and HIVE-22331.

August 03, 2023

Downgraded Cloud Storage connector to 2.2.15 version in all Dataproc Serverless for Spark runtimes to prevent potential performance regression.

July 30, 2023

New Dataproc on Compute Engine image versions:

  • 2.0.71-debian10, 2.0.72-rocky8, 2.0.72-ubuntu18
  • 2.1.19-debian11, 2.1.20-rocky8, 2.1.20-ubuntu20, 2.1.20-ubuntu20-arm

Note: The above image versions were rolled back. See the August 6, 2023 release note

The Maximum total memory per core for Dataproc Serverless Premium compute tiers has increased to 24576m (7424m for Standard compute tiers unchanged). See Dataproc Serverless Resource allocation properties.

July 28, 2023

July 26, 2023

Clusters cannot be created with a driver node group if the cluster image version is older than 2.0.57 or 2.1.5, or if the permissions for the staging bucket are missing.

Added recommendation details in Autoscaler Stackdriver logs for the CANCEL and DO_NOT_CANCEL recommendations.

July 21, 2023

New Dataproc on Compute Engine image versions, which includes a 2.1.18-ubuntu20-arm image that supports ARM machine types:

  • 2.0.70-debian10, 2.0.70-rocky8, 2.0.70-ubuntu18
  • 2.1.18-debian11, 2.1.18-rocky8, 2.1.18-ubuntu20, 2.1.18-ubuntu20-arm

Fixed a race condition in Spark startup that could lead to nodes failing to initialize when using premium disk tier.

July 14, 2023

Clusters that use a driver node group now configure YARN queues with user-limit-factor set to 2, allowing for a single user to burst to 2x utilization of capacity, which is set to 50. This achieves better resource utilization for workloads submitted by a single user.

Upgraded the Cloud Storage connector version to 2.2.16 in Dataproc Serverless for Spark runtimes.

July 10, 2023

New Dataproc on Compute Engine image versions:

  • 2.0.69-debian10, 2.0.69-rocky8, 2.0.69-ubuntu18
  • 2.1.17-debian11, 2.1.17-rocky8, 2.1.17-ubuntu20

Upgraded the Cloud Storage connector version to 2.2.16 for Dataproc on Compute Engine 2.0 and 2.1 images.

July 07, 2023

Dataproc Serverless Spark 1.1 and 2.0 runtime subminor versions can now be used 365 days after their release (instead of 90 days).

The goog-dataproc-batch-id, goog-dataproc-batch-uuid and goog-dataproc-location labels are now automatically applied to Dataproc Serverless batch resources.

Dataproc Serverless for Spark now supports updating the BigQuery connector using the dataproc.sparkBqConnector.version and dataproc.sparkBqConnector.uri properties see Use the BigQuery connector with Dataproc Serverless for Spark.

July 06, 2023

June 29, 2023

Added support for Premium compute and storage pricing tiers for Dataproc Serverless Spark workloads. Premium compute offers higher performance per core, and Premium storage offers higher throughput and IOPs. To use Premium compute and storage, set the following Spark runtime environment properties:

  • spark.dataproc.(driver|executor).compute.tier=premium
  • spark.dataproc.(driver|executor).storage.tier=premium.

June 28, 2023

New Dataproc on Compute Engine subminor image versions:

  • 2.0.68-debian10, 2.0.68-rocky8, 2.0.68-ubuntu18
  • 2.1.16-debian11, 2.1.16-rocky8, 2.1.16-ubuntu20

Backported ZEPPELIN-5755 to Zeppelin 0.10 in 2.1 images for Spark 3.3 support.

June 26, 2023

Added Dataproc Serverless Templates for batch creation:

  • Cloud Storage to BigQuery
  • Cloud Storage to Cloud Spanner
  • Hive to Cloud Storage
  • JDBC to BigQuery
  • JDBC to Cloud Storage

June 22, 2023

June 16, 2023

New Dataproc on Compute Engine subminor image versions:

  • 2.0.67-debian10, 2.0.67-rocky8, 2.0.67-ubuntu18
  • 2.1.15-debian11, 2.1.15-rocky8, 2.1.15-ubuntu20

Fixed a bug that caused cluster creation to fail when ATSv2 is enabled for tables that have a garbage collection policy setup other than maxversions.

June 14, 2023

June 08, 2023

June 02, 2023

Upgrade Cloud Storage connector to 2.2.14 version in Dataproc Serverless for Spark runtimes.

June 01, 2023

New sub-minor versions of Dataproc images:

  • 2.0.66-debian10, 2.0.66-rocky8, 2.0.66-ubuntu18
  • 2.1.14-debian11, 2.1.14-rocky8, 2.1.14-ubuntu20

Upgrade Cloud Storage connector version to 2.2.14 for 2.0 and 2.1 images

Backport HIVE-22891, HIVE-21660, HIVE-21915 to 2.0 images.

Backport HIVE-22891, HIVE-21660, HIVE-25520, HIVE-25521 to 2.1 images.

May 26, 2023

New sub-minor versions of Dataproc images:

  • 2.0.65-debian10, 2.0.65-rocky8, 2.0.65-ubuntu18
  • 2.1.13-debian11, 2.1.13-rocky8, 2.1.13-ubuntu20

May 24, 2023

Upgraded the Cloud Storage connector to 2.2.13 version in Dataproc on Compute Engine 2.0 and 2.1 image versions.

Unauthorized callers attempting to get, delete, or terminate non-existent Sessions will now receive a 403 response code instead of a 404 response code. This does not impact authorized callers.

Fixed Serverless history server endpoint URL when Persistent History Server (PHS) was setup without using a wildcard.

May 19, 2023

Upgraded the Cloud Storage connector to 2.2.13 version in Dataproc Serverless for Spark runtimes.

Fixed the NoClassDefFoundError for log4j class in Zeppelin BigQuery interpreter in 2.0 images.

Backported HIVE-22891 to 2.0 images.

May 18, 2023

New sub-minor versions of Dataproc images:

  • 2.0.64-debian10, 2.0.64-rocky8, 2.0.64-ubuntu18
  • 2.1.12-debian11, 2.1.12-rocky8, 2.1.12-ubuntu20

You can now use --properties=dataproc:componentgateway.ha.enabled=true to enable the Dataproc Component Gateway and Knox along with the Spark History Server (SHS) UI in HA mode.

May 11, 2023

May 05, 2023

Announcing the General Availability (GA) release of Dataproc Serverless for Spark runtime version 2.1, which includes the following components:

  • Spark 3.4.0
  • BigQuery Spark Connector 0.28.1
  • Cloud Storage Connector 2.2.11
  • Conda 23.3
  • Java 17
  • Python 3.11
  • R 4.2
  • Scala 2.13

April 28, 2023

New Dataproc Serverless for Spark runtime versions:

  • 1.1.12
  • 2.0.20
  • 2.1.0-RC8

Upgrade Spark to 3.4.0 and its dependencies in Dataproc Serverless for Spark 2.1 runtime:

  • Jetty to 9.4.51.v20230217
  • ORC to 1.8.3
  • Parquet to 1.13.0
  • Protobuf to 3.22.3

New sub-minor versions of Dataproc images:

  • 1.5.89-debian10, 1.5.89-rocky8, 1.5.89-ubuntu18
  • 2.0.63-debian10, 2.0.63-rocky8, 2.0.63-ubuntu18
  • 2.1.11-debian11, 2.1.11-rocky8, 2.1.11-ubuntu20

hive principal will be used for Hive catalog queries via presto in kerberos cluster.

April 24, 2023

Dataproc now supports the usage of cross-project service account.

Autoscaler recommendation reasoning details are available now in Cloud Logging logs.

Default batch TTL is set to 4 hours for Dataproc Serverless for Spark runtime version 2.1.

April 20, 2023

New sub-minor versions of Dataproc images:

  • 1.5.88-debian10, 1.5.88-rocky8, 1.5.88-ubuntu18
  • 2.0.62-debian10, 2.0.62-rocky8, 2.0.62-ubuntu18
  • 2.1.10-debian11, 2.1.10-rocky8, 2.1.10-ubuntu20

Running Spark jobs with the DataprocFileOutoutputCommitter is now supported. Enable the committer for Spark applications that write to a Cloud Storage destination concurrently.

April 18, 2023

Add Autoscaler recommendation reasoning details in Cloud Logging.

Dataproc on GKE SLM force delete timeout exception converted to DataprocIoException.

April 17, 2023

Announcing Dataproc General Availability (GA) support for CMEK organization policy.

April 14, 2023

New Dataproc Serverless for Spark runtime versions:

  • 1.1.11
  • 2.0.19
  • 2.1.0-RC7

Make spark user an owner for all items in the driver working directory for Dataproc Serverless for Spark workloads to fix permissions issues after Hadoop upgrade to 3.3.5.

April 06, 2023

New Dataproc Serverless for Spark runtime versions:

  • 1.1.10
  • 2.0.18
  • 2.1.0-RC6

April 04, 2023

March 30, 2023

Dataproc is now available in the me-central1 region (Doha).

March 28, 2023

New sub-minor versions of Dataproc images:

  • 1.5.87-debian10, 1.5.87-rocky8, 1.5.87-ubuntu18
  • 2.0.61-debian10, 2.0.61-rocky8, 2.0.61-ubuntu18
  • 2.1.9-debian11, 2.1.9-rocky8, 2.1.9-ubuntu20

Dataproc cluster creation now supports the pd-extreme disk type.

Dataproc on GKE now disallows update operations.

Dataproc on GKE diagnose operation now verifies that the master agent is running.

March 27, 2023

New sub-minor versions of Dataproc images:

  • 1.5.86-debian10, 1.5.86-rocky8, 1.5.86-ubuntu18
  • 2.0.60-debian10, 2.0.60-rocky8, 2.0.60-ubuntu18
  • 2.1.8-debian11, 2.1.8-rocky8, 2.1.8-ubuntu20

March 24, 2023

Upgrade Python to 3.11 and Conda to 23.1 in Dataproc Serverless for Spark runtime 2.1

March 23, 2023

Dataproc is now available in the europe-west12 region (Turin).

March 17, 2023

March 16, 2023

New sub-minor versions of Dataproc images:

  • 1.5.85-debian10, 1.5.85-rocky8, 1.5.85-ubuntu18
  • 2.0.59-debian10, 2.0.59-rocky8, 2.0.59-ubuntu18
  • 2.1.7-debian11, 2.1.7-rocky8, 2.1.7-ubuntu20
  • Upgrade Flink to 1.15.3 from 1.15.0 in 2.1 images

March 10, 2023

Upgraded Spark BigQuery connector version to 0.28.1 in 1.1 and 2.1 Dataproc Serverless for Spark runtimes.

March 06, 2023

Added stronger validations to disallow upper-case characters in template IDs per Resource Names guidance, which allows Workflow template creation to fail fast instead of failing at workflow template instantiation.

Added decision metric field in Stackdriver autoscaler logs.

March 02, 2023

Release Dataproc Serverless for Spark runtime 2.1 preview:

  • Spark 3.4.0-rc1
  • BigQuery Spark Connector 0.28.0
  • Cloud Storage Connector 2.2.11
  • Conda 22.11
  • Java 17
  • Python 3.10
  • R 4.2
  • Scala 2.13

February 28, 2023

--properties=dataproc:agent.ha.enabled=true can now be used to enable the Dataproc Agent in high availability mode. This property is supported by Dataproc Image versions 2.0 and above.

February 23, 2023

Upgrade Spark to 3.3.2 and its dependencies in 1.1 and 2.0 Dataproc Serverless for Spark runtimes:

  • Jackson to 2.13.5
  • Jetty to 9.4.50.v20221201
  • ORC to 1.8.2
  • Protobuf to 3.21.12
  • RoaringBitmap to 0.9.39

February 17, 2023

New sub-minor versions of Dataproc images:

  • 1.5.82-debian10, 1.5.82-rocky8, 1.5.82-ubuntu18
  • 2.0.56-debian10, 2.0.56-rocky8, 2.0.56-ubuntu18
  • 2.1.4-debian11, 2.1.4-rocky8, 2.1.4-ubuntu20

February 10, 2023

Dataproc Serverless for Spark now supports unconditional TTL to batches. The workload will be terminated after the TTL without waiting for work to complete.

Dataproc Serverless for Spark now supports statically-sized Dataproc Serverless for Spark batch workloads with more than 500 executors.

Add support for filters when listing batches. Batches may be filtered on one or more of batch_id, batch_uuid, state, or create_time (for example, state = RUNNING AND create_time < "2023-01-01T00:00:0Z"). See Filter expressions for more information.

Generate a warning when initialization actions are used in a cluster created with a driver node group.

The default Dataproc Serverless for Spark runtime version has changed to 2.0.

February 03, 2023

1.0.29 is the last release of the Dataproc Serverless for Spark runtime version 1.0, it will no longer be supported and will not receive new releases.

Upgrade Cloud Storage connector to 2.2.11 version in Dataproc Serverless for Spark runtimes.

January 27, 2023

Announcing the General Availability (GA) release of Dataproc Serverless for Spark runtime version 1.1, which includes the following components:

  • Spark 3.3.1
  • BigQuery Spark Connector 0.28.0
  • Cloud Storage Connector 2.2.9
  • Conda 22.11
  • Java 11
  • Python 3.10
  • R 4.2
  • Scala 2.12

Dataproc Serverless for Spark runtime version 1.0 changed to non-LTS because of the release of backward-compatible Dataproc Serverless for Spark runtime version 1.1 LTS.

January 24, 2023

January 23, 2023

New sub-minor versions of Dataproc images:

  • 1.5.80-debian10, 1.5.80-rocky8, 1.5.80-ubuntu18
  • 2.0.54-debian10, 2.0.54-rocky8, 2.0.54-ubuntu18
  • 2.1.2-debian11, 2.1.2-rocky8, 2.1.2-ubuntu20

Added support for enabling Hive Metastore OSS metrics by passing hivemetastore to --metric-sources property during cluster creation.

Added support for Dataproc Metastore integration with Trino.

Upgraded Parquet to 1.12.2 for 2.1 images.

The value of hive.server2.builtin.udf.blacklist is now set by default to reflect,reflect2 in hive-site.xml to prevent arbitrary code execution.

January 13, 2023

December 19, 2022

New sub-minor versions of Dataproc images:

  • 1.5.79-debian10, 1.5.79-rocky8, 1.5.79-ubuntu18
  • 2.0.53-debian10, 2.0.53-rocky8, 2.0.53-ubuntu18
  • 2.1.1-debian11, 2.1.1-rocky8, 2.1.1-ubuntu20

Backported Spark patch in Dataproc Serverless for Spark runtime 1.0 and 2.0:

  • SPARK-40481: Ignore stage fetch failure caused by decommissioned executor.

December 12, 2022

General Availability (GA) release of Dataproc 2.1 images.

New sub-minor versions of Dataproc images:

  • 1.5.78-debian10, 1.5.78-rocky8, 1.5.78-ubuntu18
  • 2.0.52-debian10, 2.0.52-rocky8, 2.0.52-ubuntu18
  • 2.1.0-debian11, 2.1.0-rocky8, 2.1.0-ubuntu20

Upgrade Cloud Storage connector version to 2.1.9 for 1.5 images.

Upgrade Cloud Storage connector version to 2.2.9 for 2.1 images.

Dataproc Serverless for Spark runtime 1.0:

  • Upgrade to Spark to 3.2.3
  • Upgrade Cloud Storage connector to 2.2.9
  • Upgrade Spark dependencies:
    • Jetty to 9.4.49.v20220914
    • ORC to 1.7.7
    • Protobuf to 3.19.6
    • RoaringBitmap to 0.9.35
    • Scala to 2.12.17

Dataproc Serverless for Spark runtime 2.0:

  • Upgrade Cloud Storage connector to 2.2.9
  • Upgrade Spark dependencies:
    • Protobuf to 3.21.9
    • RoaringBitmap to 0.9.35

Use jemalloc as a default OS memory allocator in Dataproc Serverless for Spark runtime.

Backport Spark patches in Dataproc Serverless for Spark runtime 1.0 and 2.0:

  • SPARK-39324: Log ExecutorDecommission as INFO level in TaskSchedulerImpl
  • SPARK-40168: Handle SparkException during shuffle block migration
  • SPARK-40269: Randomize the orders of peer in BlockManagerDecommissioner
  • SPARK-40778: Make HeartbeatReceiver as an IsolatedRpcEndpoint

December 09, 2022

Added the dataproc.googleapis.com/job/state metric to track the status of Dataproc Jobs states (such as, RUNNING or PENDING). This metric is collected by default and is not chargeable to customers.

Dataproc job IDs are now queryable and viewable from MQL(Monitoring Query Language), and the metric can be used for long-running job monitoring and alerting.

December 06, 2022

Dataproc Serverless for Spark runtime version 2.0 will become the default Dataproc Serverless for Spark runtime version on January 24, 2023 (instead of December 13, 2022, as previously announced).

November 17, 2022

Dataproc Serverless for Spark supports Spark and System metrics. These metrics are enabled by default. Spark driver and executor metrics can be customised using overrides.

Added support for Dataproc to attach to a gRPC Dataproc Metastore in any region.

Secure Boot, Virtual trusted platform module (vTPM), and Integrity monitoring Shielded VM features are enabled by default for Dataproc on Compute Engine clusters that use 2.1 preview images.

Nodemanagers in DECOMMISSIONING, NEW, and SHUTDOWN state are now included in the /cluster/yarn/nodemanagers metric.

Dataproc Serverless for Spark now shows the subminor runtime version used in the runtimeConfig.version field,

Fixed a bug that caused a Dataproc cluster with a Dataproc Metastore service to fail the creation process, if the cluster was in the same network but different subnetworks.

November 14, 2022

Dataproc Serverless for Spark now now uses runtime version 1.0.23 and 2.0.3.

New sub-minor versions of Dataproc images:

1.5.77-debian10, 1.5.77-rocky8, 1.5.77-ubuntu18,

2.0.51-debian10, 2.0.51-rocky8, 2.0.51-ubuntu18,

preview 2.1.0-RC4-debian11, preview 2.1.0-RC4-rocky8, preview 2.1.0-RC4-ubuntu20.

Downgraded google-auth-oauthlib Python package to fix gcsfs Python package for 2.0 and 2.1 images.

Backported HIVE-17317 in the latest 2.0 and 2.1 images.

Dataproc Serverless for Spark runtime version 1.0.23 and 2.0.3 downgrades google-auth-oauthlib Python package to fix gcsfs Python package.

Upgraded Apache Commons Text to 1.10.0 for Knox in 1.5 images, and for Spark, Pig, Knox in 2.0 images, addressing CVE-2022-42889.

Dataproc Serverless for Spark runtime version 1.0.23 and 2.0.3 adds PyMongo Python library.

November 11, 2022

Dataproc Serverless for Spark runtime versions 1.0.22 and 2.0.2 will be deprecated on 11/11/2022. New batch submissions that use these runtime versions will fail starting 11/11/2022. This is due to an update to the google auth library which breaks running Pyspark batch workloads having dependency on gcsfs. Upcoming runtime versions will address this issue.

Dataproc images 2.0.50 and preview 2.1.0-RC3 are deprecated and cluster creations based on these images will fail starting 11/11/2022. This is due to an update to the google auth library which breaks running Pyspark batch workloads having dependency on gcsfs. Upcoming image versions will have a fix to address this issue.

November 07, 2022

New sub-minor versions of Dataproc images:

1.5.76-debian10, 1.5.76-rocky8, 1.5.76-ubuntu18

2.0.50-debian10, 2.0.50-rocky8, 2.0.50-ubuntu18

preview 2.1.0-RC3-debian11, preview 2.1.0-RC3-rocky8, preview 2.1.0-RC3-ubuntu20,

Dataproc Serverless for Spark now now uses runtime version 1.0.22 and 2.0.2.

If a Dataproc Metastore service uses the gRPC endpoint protocol, a Dataproc or self-managed cluster located in any region can attach to the service.

October 31, 2022

Dataproc Serverless for Spark now allows the customization of driver and executor memory using the following properties:

  • spark.driver.memory
  • spark.driver.memoryOverhead
  • spark.executor.memory
  • spark.executor.memoryOverhead

Dataproc Serverless for Spark now outputs approximate_usage after a workload finishes that shows the approximate DCU and shuffle storage resource consumption by the workload.

Removed the Auto Zone placement check for supported machine types.

October 28, 2022

The following preview Dataproc image versions are available:

  • 2.1.0-RC2-debian11
  • 2.1.0-RC2-rocky8
  • 2.1.0-RC2-ubuntu20

The following component versions are available for use with the 2.1.0-RC2 images (the HBase and Druid components are not supported in 2.1 image versions):

  • Apache Atlas 2.2.0

  • Apache Flink 1.15.0

  • Apache Hadoop 3.3.3

  • Apache Hive 3.1.3

  • Apache Hive WebHCat 3.1.3

  • Apache Kafka 3.1.0

  • Apache Pig 0.18.0-SNAPSHOT

  • Apache Spark 3.3.0

  • Apache Sqoop v1 1.5.0-SNAPSHOT

  • Apache Sqoop v2 1.99.6

  • Apache Tez 0.10.1

  • Cloud Storage Connector hadoop3-2.2.8

  • Conscrypt 2.5.2

  • Docker 20.10

  • Hue 4.10.0

  • Java temurin-11-jdk

  • JupyterLab Notebook 3.4

  • Oozie 5.2.1

  • Presto 376

  • Python 3.10

  • R 4.1

  • Ranger 2.2.0

  • Scala 2.12.14

  • Solr 9.0.0

  • Zeppelin Notebook 0.10.1

  • Zookeeper 3.8.0

Dataproc Serverless for Spark now now uses runtime version 1.0.21 and 2.0.1.

Dataproc Serverless for Spark runtime version 2.0.1 upgrades Apache Commons Text to 1.10.0, addressing CVE-2022-42889

Dataproc Serverless for Spark runtime version 2.0.1 upgrades the following components:

October 26, 2022

All Dataproc Serverless for Spark runtime versions prior to 1.0.21 and 2.0.1 will be deprecated on November 2, 2022.

October 25, 2022

Dataproc Serverless for Spark runtime version 2.0 will become the default Dataproc Serverless for Spark runtime version on December 13, 2022.

October 24, 2022

Dataproc Serverless for Spark now supports spark.dataproc.diagnostics.enabled property that enables auto diagnostics on Batch failure. Note that enabling auto diagnostics will hold compute and storage quota after Batch is complete and until diagnostics is finished.

October 21, 2022

New sub-minor versions of Dataproc images:

1.5.75-debian10, 1.5.75-rocky8, 1.5.75-ubuntu18

2.0.49-debian10, 2.0.49-rocky8, 2.0.49-ubuntu18

Announcing the General Availability (GA) release of Dataproc Serverless for Spark runtime 2.0.

Dataproc Serverless for Spark now uses runtime version 1.0.20 and 2.0.0.

Upgraded Cloud Storage connector version to 2.2.8 in the latest 2.0 images.

Upgraded the Conscrypt library to 2.5.2 in the latest 1.5 and 2.0 images.

Dataproc Serverless for Spark runtime version 2.0.0 upgrades the following components:

  • Conda to 22.9
  • Jetty to 9.4.49.v20220914
  • ORC to 1.8.0
  • Protobuf to 3.21.7
  • RoaringBitmap to 0.9.32

Disabled auto deletion of files under /tmp in the latest Rocky images. Previous Rocky images have files in the /tmp folder deleted every 10 days due to default OS system setting in /usr/lib/tmpfiles.d/tmp.conf.

Changed Hive TokenStoreDelegationTokenSecretManager in the latest 1.5 and 2.0 images so that it updates the base class's current key ID after generating a new master key. This is important for users of DBTokenStore, which generates key IDs based on a monotonically increasing sequence from the database. Prior to this fix, there was a race condition during master key rollover that could cause it to attempt updating the prior master key using an incorrect ID value. This would fail and then quickly retry, sometimes multiple times, causing too many rows in the database.

Set yarn:spark.yarn.shuffle.stop_on_failure to true by default in the latest 1.5 and 2.0 images. This change causes YARN node manager startup to fail if the Spark external shuffle service startup fails. On VM boot, Dataproc will continuously restart the YARN node manager until it is able to start. This change reduces Spark executor errors, such as: org.apache.spark.SparkException: Unable to register with external shuffle server due to : Failed to connect to <worker host>:7337, particularly when starting a stopped cluster. See Spark external shuffle service documentation for details.

Backported the patch for HADOOP-18316 in the latest 2.0 images, addressing CVE-2022-25168.

Backported the patch for HIVE-25468 in the latest 1.5 and 2.0 images, addressing CVE-2021-34538.

Addressing CVE-2022-23305, CVE-2022-23302, CVE-2021-4104, CVE-2019-17571, migrated log4j 1.2 to reload4j for Hadoop, Spark in the latest 1.5 images and Hadoop, Spark, ZooKeeper, Oozie, Knox in the latest 2.0 images.

Enabled Spark authentication and encryption for Kerberos enabled clusters created with the latest 1.5 and 2.0 images.

Set HDFS /user/<name> directory permission with owner=<name> and mode=700 for Kerberos enabled clusters created with the latest 1.5 and 2.0 images.

Backported the patch for SPARK-36383 in the latest 2.0 images.

Backported the patch for HIVE-19310 in the latest 1.5 images.

Backported the patch for HIVE-20004 in the latest 2.0 images.

Fixed an issue in which Presto queries might fail when submitted to HA clusters in the latest 1.5 and 2.0 images.

Fixed a bug where metrics created based on the yarn:yarn.resourcemanager.metrics.runtime.buckets property were not exported to Cloud Monitoring, even though listed in --metric-overrides during cluster creation.

Fixed a "gsutil not found" issue in the latest 1.5 and 2.0 Ubuntu images.

Backported the patch for HIVE-26447 in the latest 2.0 images.

Backported the patch for HIVE-20607 in the latest 2.0 images.

October 05, 2022

Dataproc is now available in the me-west1 region (Tel Aviv, Israel).

October 03, 2022

Preemptible SPOT VMs can be used as secondary workers in a Dataproc cluster. Unlike legacy preemptible VMs with a 24-hour maximum lifetime, Spot VMs have no maximum lifetime.

October 01, 2022

Dataproc Serverless for Spark now supports Artifact Registry with image streaming.

Dataproc Metastore: Fixed an endpoint resolution issue that caused 500 type errors for valid setups. The service was overly aggressive in describing networks and subnetworks attached to the service via the NetworkConfig field.

September 27, 2022

Dataproc Auto Zone Placement now takes ANY reservation into account by default.

September 26, 2022

Dataproc Serverless for Spark now uses runtime version 1.0.19 and 2.0.0-RC4, which also upgrades both runtimes to Cloud Storage Connector to 2.2.8.

September 20, 2022

Dataproc Serverless for Spark: You can now use the spark.dynamicAllocation.executorAllocationRatio property to configure how aggressively to scale up Serverless workloads. A value of 1.0 provides maximum scale up.

Dataproc Serverless for Spark: Reduced the latency between batch workload completion and when a batch is marked SUCCEEDED.

Dataproc Serverless for Spark: Increased initial and maximum Spark executor limits to 500 and 2,000, respectively.

Dataproc Serverless for Spark: Sets a maximum limit of 500 workers per scale up or scale down operation.

Dataproc on Compute Engine: Stop all master and worker VMs when starting a cluster fails due to stockout or insufficient quota.

September 19, 2022

Dataproc Serverless for Spark now uses runtime version 1.0.18 and 2.0.0-RC3.

September 12, 2022

Dataproc Serverless for Spark now uses runtime version 1.0.17 and 2.0.0-RC2.

September 08, 2022

Avoid using the following image versions when creating new clusters:

  • 2.0.31-debian10, 2.0.31-ubuntu18, 2.0.31-rocky8
  • 2.0.32-debian10, 2.0.32-ubuntu18, 2.0.32-rocky8
  • 2.0.33-debian10, 2.0.33-ubuntu18, 2.0.33-rocky8
  • 1.5.57-debian10, 1.5.57-ubuntu18, 1.5.57-rocky8
  • 1.5.58-debian10, 1.5.58-ubuntu18, 1.5.58-rocky8
  • 1.5.59-debian10, 1.5.59-ubuntu18, 1.5.59-rocky8

If your cluster uses one of these image versions, there is a small chance that the cluster might enter an ERROR_DUE_TO_UPDATE state while being updated, either manually or as a result of autoscaling. If that happens, contact support. You can avoid future occurrences by creating new clusters with a newer image version.

September 01, 2022

Fixed issue where gcloud dataproc batches list hangs when a large number of batches is present.

August 24, 2022

Announcing the Preview release of Dataproc custom constraints, which can be used to allow or deny specific operations on Dataproc clusters.

August 22, 2022

Announcing Dataproc Serverless for Spark preview runtime version 2.0.0-RC1, which includes the following components:

  • Spark 3.3.0
  • Cloud Storage Connector 2.2.7
  • Java 17
  • Conda 4.13
  • Python 3.10
  • R 4.1
  • Scala 2.13

Dataproc Serverless for Spark now uses runtime version 1.0.16, which upgrades the following components to the following versions:

  • Spark 3.2.2
  • Avro 1.11.1
  • Hadoop 3.3.4
  • Jetty 9.4.48.v20220622
  • ORC 1.7.5
  • RoaringBitmap 0.9.31
  • Scala 2.12.16

August 13, 2022

New sub-minor versions of Dataproc images:

1.5.73-debian10, 1.5.73-rocky8, 1.5.73-ubuntu18

2.0.47-debian10, 2.0.47-rocky8, 2.0.47-ubuntu18

Enabled Spark authentication and encryption for Kerberos clusters in 1.5 and 2.0 images.

Dataproc Serverless for Spark now uses runtime version 1.0.15, which upgrades the following Spark dependencies to the following versions:

  • Jackson 2.13.3
  • Jetty 9.4.46.v20220331
  • ORC 1.7.4
  • Parquet 1.12.3
  • Protobuf 3.19.4
  • RoaringBitmap 0.9.28

Dataproc on Compute Engine images now have master VM memory protection enabled by default. Jobs may be terminated to prevent the master VM running out of memory.

FallbackHiveAuthorizerFactory is now set by default on newly created 1.5 and 2.0 image clusters that have the any of the following features enabled:

If you encounter a Cannot modify <PARAM> or similar runtime error when running a SET statement in a Hive query, this means the parameter is not in list of allowable runtime parameters. You can allow the parameter using hive.security.authorization.sqlstd.confwhitelist.append as a cluster property when you create a cluster.

Example:

--properties="hive:hive.security.authorization.sqlstd.confwhitelist.append=tez.application.tags,<ADDITIONAL-1>,<ADDITIONAL-2>"

August 01, 2022

New sub-minor versions of Dataproc images:

1.5.72-debian10, 1.5.72-rocky8, 1.5.72-ubuntu18

2.0.46-debian10, 2.0.46-rocky8, 2.0.46-ubuntu18

Upgraded Hadoop to version 3.2.3 in 2.0 images.

Upgraded Hadoop to version 2.10.2 version 2.10.2 in 1.5 images.

Default MySQL instance root password changed to a random value in 1.5 and 2.0 images. New password is now stored in MySQL configuration file accessible only by the OS level root user.

Backported the patch for KNOX-1997 in 2.0 images.

Backported the patch for HIVE-19048 in 2.0 images.

Backported the patches for HIVE-19047 and HIVE-19048 in 1.5 images.

July 07, 2022

Dataproc support for the following images has been extended to the following dates:

July 01, 2022

New sub-minor versions of Dataproc images:

1.5.71-debian10, 1.5.71-rocky8, 1.5.71-ubuntu18

2.0.45-debian10, 2.0.45-rocky8, 2.0.45-ubuntu18

For 1.5 images and the 2.0.45-ubuntu18 image, backported the upstream fix for KNOX-1997.

June 21, 2022

New sub-minor versions of Dataproc images:

1.5.70-debian10, 1.5.70-rocky8, 1.5.70-ubuntu18

2.0.44-debian10, 2.0.44-rocky8, 2.0.44-ubuntu18

Dataproc Metastore: For 1.5 images, added a spark.hadoop.hive.eager.fetch.functions.enabled Spark Hive client property to control whether the client fetches all functions from Hive Metastore during initialization. The default setting is true, which preserves the existing behavior of fetching all functions. If set to false, the client will not fetch all functions during initialization, which can help reduce high latency during initialization, particularly when there are many functions and the Metastore is not located in the client's region.

For 1.5 and 2.0 images, backported YARN-9608 to fix the issue in graceful decommissioning.

June 14, 2022

Announcing the General Availability (GA) release of Dataproc Custom OSS Metrics GA, which collects then integrates Dataproc cluster OSS component metrics into Cloud Monitoring.

New sub-minor versions of Dataproc images:

1.5.69-debian10, 1.5.69-rocky8, 1.5.69-ubuntu18

2.0.43-debian10, 2.0.43-rocky8, 2.0.43-ubuntu18

Backported the patch for HBASE-23287 to HBase 1.5.0 in 1.5 image

June 13, 2022

Announcing the General Availability (GA) release of Ranger Cloud Storage plugin. This plugin activates an authorization service on each Dataproc cluster VM, which evaluates requests from the Cloud Storage connector against Ranger policies and, if the request is allowed, returns an access token for the cluster VM service account

Dataproc is now available in the us-south1 region (Dallas, Texas).

June 06, 2022

Announcing the General Availability (GA) release of Dataproc Persistent History Server, which provides web interfaces to view job history for jobs run on active or deleted Dataproc clusters.

Dataproc Serverless for Spark now uses runtime version 1.0.13.

New sub-minor versions of Dataproc images:

1.5.68-debian10, 1.5.68-rocky8, 1.5.68-ubuntu18

2.0.42-debian10, 2.0.42-rocky8, 2.0.42-ubuntu18

Dataproc Serverless for Spark runtime versions 1.0.2, 1.0.3 and 1.0.4 are unavailable for new batch submissions.

Dataproc on GKE Spark 3.1 images upgraded to Spark version 3.1.3.

Upgrade Cloud Storage connector version 2.1.8 for 1.5 images only.

Fixed a bug where HDFS directories initialization could fail when user names in a project contain special characters.

Fix a Dataproc on GKE bug that caused upload of driver logs to Cloud Storage to fail.

June 01, 2022

Dataproc is now available in the us-east5 region (Columbus, Ohio).

May 31, 2022

Dataproc is now available in the europe-southwest1 region (Madrid, Spain).

Dataproc is now available in the europe-west9 region (Paris, France).

May 30, 2022

New sub-minor versions of Dataproc images:

1.5.67-debian10, 1.5.67-ubuntu18, 1.5.67-rocky8

2.0.41-debian10, 2.0.41-ubuntu18, 2.0.41-rocky8

Dataproc on GKE error messages now provide additional information.

Backported fixes for HIVE-22098, HIVE-23809, HIVE-20462, HIVE-21675 to Hive 3.1 in Dataproc 2.0 images.

Fix a bug where properties related to Kerberos cross realm trust were not properly set.

Fixed a bug where older-image (for example, 1.3.95) cluster create operations failed with the error message : "does not support specifying local SSD interface other than 'SCSI'".

May 23, 2022

New sub-minor versions of Dataproc images:

1.5.66-debian10, 1.5.66-ubuntu18, 1.5.66-rocky8

2.0.40-debian10, 2.0.40-ubuntu18, 2.0.40-rocky8

Upgraded Spark to 3.1.3 in Dataproc image version 2.0.

Fixed a bug where job was not being marked as terminated after master node reboot.

Fixed a bug where Flink was not able to run on HA clusters.

Backported the fix for HIVE-20514 to Hive 2.3 in Dataproc image version 1.5.

Fixed a bug with HDFS directories initialization when core:fs.defaultFS is set to an external HDFS.

May 09, 2022

New sub-minor versions of Dataproc images:

1.5.65-debian10, 1.5.65-ubuntu18, 1.5.65-rocky8

2.0.39-debian10, 2.0.39-ubuntu18, 2.0.39-rocky8

Dataproc Serverless for Spark now uses runtime version 1.0.12.

Fixed an issue where chronyd systemd service failed to start due to a race condition between systemd-timesyncd and chronyd.

Dataproc Serverless for Spark runtime version 1.0.1 is unavailable for new batch submissions.

May 03, 2022

New sub-minor versions of Dataproc images:

1.5.64-debian10, 1.5.64-ubuntu18, 1.5.64-rocky8

2.0.38-debian10, 2.0.38-ubuntu18, 2.0.38-rocky8

Dataproc Serverless for Spark now uses runtime version 1.0.11.

If you request to cancel a job in one of the following states, Dataproc will return the job, but not initiate cancellation, since it is already in progress: CANCEL_PENDING, CANCEL_STARTED, or CANCELLED.

When submitting a Dataproc job or workflow that selects a cluster that matches the specified labels, Dataproc will avoid choosing clusters that are in a state that disallows running jobs. Specifically, Dataproc will only choose among clusters in one of the following states: RUNNING, UPDATING, CREATING, or ERROR_DUE_TO_UPDATE.

Added Dataproc Serverless support for updating the Cloud Storage connector using the dataproc.gcsConnector.version and dataproc.gcsConnector.uri properties.

Hive: Upgrade to Apache ORC 1.5.13 in image version 2.0. Notable in this release are 2 bug fixes: ORC-598 and ORC-672, related to handling ORC files with arrays larger than 1024 elements.

Dataproc correctly defaults NodePool locations when the GKE cluster is in us-east1 and europe-west1.

Dataproc Serverless for Spark runtime version 1.0.0 is unavailable for new batch submissions.

April 22, 2022

New sub-minor versions of Dataproc images:

1.5.63-debian10, 1.5.63-ubuntu18, 1.5.63-rocky8

2.0.37-debian10, 2.0.37-ubuntu18, 2.0.37-rocky8

Dataproc Serverless for Spark now uses runtime version 1.0.10.

Cloud Storage connector version upgraded to 2.2.6 in image version 2.0.

Hive: Bundle threeten classes in hive-exec.jar in image version 2.0. ORC now requires date handling classes in the org.threeten package, which are not present in hive-exec.jar at query time.

HIVE-22589 fixed this bug upstream, but it was part of a large new feature. Instead, this change applies a small targeted fix to address the bug.

April 20, 2022

Dataproc is now available in the europe-west8 region (Milan, Italy).

April 13, 2022

Announcing the General Availability (GA) release of Dataproc on GKE, which allows you to execute Big Data applications using the Dataproc jobs API on GKE clusters.

April 11, 2022

The dataproc:dataproc.performance.metrics.listener.enabled cluster property, which is enabled by default, listens on port 8791 on all master nodes to extract performance-related telemetry Spark metrics. The metrics are published to the Dataproc service for it to use to set better defaults and improve the service. To opt-out of this feature, set dataproc:dataproc.performance.metrics.listener.enabled=false when creating a Dataproc cluster.

New sub-minor versions of Dataproc images:

1.5.62-debian10, 1.5.62-ubuntu18, and 1.5.62-rocky8

2.0.36-debian10, 2.0.36-ubuntu18, and 2.0.36-rocky8

Changed the owner of /usr/lib/knox/conf/gateway-site.xml from root:root to knox:knox.

Fixed and issue in which the Dataproc autoscaler would sometimes try to scale down a cluster by more than one thousand secondary worker nodes at one time. Now, the autoscaler will scale down at most one thousand nodes at one time. In cases where the autoscaler previously would have scaled down more than one thousand nodes at one time, it will scale down the nodes by at most one thousand nodes, and a log will be written to the autoscaler log noting this occurrence.

Fixed bugs that could cause Dataproc to delay marking a job cancelled.

April 01, 2022

New sub-minor versions of Dataproc images:

1.5.61-debian10, 1.5.61-ubuntu18, and 1.5.61-rocky8

2.0.35-debian10, 2.0.35-ubuntu18, and 2.0.35-rocky8

Changed the owner of /var/lib/zookeeper/myid from root to zookeeper.

March 25, 2022

New sub-minor versions of Dataproc images:

1.5.60-debian10, 1.5.60-ubuntu18, and 1.5.60-rocky8

2.0.34-debian10, 2.0.34-ubuntu18, and 2.0.34-rocky8

March 17, 2022

New sub-minor versions of Dataproc images:

1.5.59-debian10, 1.5.59-ubuntu18, and 1.5.59-rocky8

2.0.33-debian10, 2.0.33-ubuntu18, and 2.0.33-rocky8

March 07, 2022

New sub-minor versions of Dataproc images:

1.5.58-debian10, 1.5.58-ubuntu18, and 1.5.58-rocky8

2.0.32-debian10, 2.0.32-ubuntu18, and 2.0.32-rocky8

Fixed bug where clusters created via Dataproc Hub failed with Unit file jupyter.service does not exist error.

Fixed bug where clusters created with Kerberos failed with SSL Certificate string is too long error.

February 18, 2022

Added support for Enhanced Flexibility Mode (EFM) with primary worker shuffle mode on Spark for image version 2.0.

General Availability (GA) release of new Rocky Linux based images: 1.5.57-rocky8 and 2.0.31-rocky8. These images are replacing CentOS images which are EOL.

Dataproc Serverless for Spark now uses runtime version 1.0.4, which updates GCS connector to 2.2.5 version.

New sub-minor versions of Dataproc images:

1.5.57-debian10, 1.5.57-ubuntu18, and 1.5.57-rocky8

2.0.31-debian10, 2.0.31-ubuntu18, and 2.0.31-rocky8

Upgraded Cloud Storage connector version to 2.2.5 in image version 2.0.

Upgraded Cloud Storage connector version to 2.1.7 in image version 1.5.

CentOS images are EOL. 1.5.56-centos8 and 2.0.30-centos8 are the final CentOS based images. CentOS images are no longer supported and will not receive new releases.

February 17, 2022

A script that checks if a project or organization is using an unsupported Dataproc image is available for downloading (see Unsupported Dataproc versions).

February 15, 2022

Dataproc images prior to 1.3.95, 1.4.77, 1.5.53, and 2.0.27 are deprecated and cluster creations based on these images will fail starting 2/28/2022.

February 11, 2022

February 07, 2022

Added cluster_type field to job and operation metrics in Cloud Monitoring.

February 01, 2022

Enabled the Resource Manager UI and HA capable UIs in HA cluster mode.

1.4.80-debian10 and 1.4.80-ubuntu18 are the last releases for the 1.4 images. Dataproc 1.4 images will no longer be supported and will not receive new releases.

New sub-minor versions of Dataproc images:

1.4.80-debian10 and 1.4.80-ubuntu18

1.5.56-debian10, 1.5.56-ubuntu18, and 1.5.56-centos8

2.0.30-debian10, 2.0.30-ubuntu18, and 2.0.30-centos8

Configured Zeppelin Spark interpreter to run in YARN client mode by default for image version 2.0.

January 31, 2022

Dataproc Serverless for Spark now uses runtime version 1.0.2, which updates Spark to 3.2.1 version.

January 24, 2022

Dataproc Serverless for Spark now uses runtime version 1.0.1, which includes improved error messaging for network connectivity issues.

January 19, 2022

Announcing the General Availability (GA) release of Dataproc Serverless for Spark, which allows you to run your Spark jobs on Dataproc without having to spin up and manage your own cluster.

January 18, 2022

Added support for Dataproc Metastore's beta NetworkConfig field. Beta services using this field can now be used in conjunction with v1 Dataproc clusters.

Dataproc extracts the warehouse directory from the Dataproc Metastore service for the cluster-local warehouse directory.

January 17, 2022

New sub-minor versions of Dataproc images:

1.4.79-debian10 and 1.4.79-ubuntu18

1.5.55-debian10, 1.5.55-ubuntu18, and 1.5.55-centos8

2.0.29-debian10, 2.0.29-ubuntu18, and 2.0.29-centos8

Dataproc images 1.4.79, 1.5.55, and 2.0.29, listed above, were updated with log4j version 2.17.1. It is strongly recommended that your clusters use previously released images 1.4.77, 1.5.53, or 2.0.27, or higher (see Supported Dataproc versions). While not urgent nor strongly recommended, Dataproc advises you to create or recreate Dataproc clusters with the latest sub-minor image versions when possible.

The Cloud Storage connector jar is installed on the Solr server (even if dataproc:solr.gcs.path property is not set). Applies to image versions 1.4, 1.5, and 2.0.

Migrated to Eclipse Temurin JDK in image versions 1.4, 1.5, and 2.0.

Fixed a bug where cluster restart disabled Solr and Ranger services even if the components are selected. Applies to image versions 1.4, 1.5, and 2.0.

YARN-8865: RMStateStore contains large number of expired RMDelegationToken. Applies to 1.5 images.

RANGER-3324: Make optimized db schema script idempotent for MySQL DB. Applies to 2.0 images.

January 09, 2022

New sub-minor versions of Dataproc images:

1.4.78-debian10, and 1.4.78-ubuntu18

1.5.54-centos8, 1.5.54-debian10, and 1.5.54-ubuntu18

2.0.28-centos8, 2.0.28-debian10, and 2.0.28-ubuntu18

Dataproc images 1.4.78, 1.5.54, and 2.0.28, listed above, were updated with log4j version 2.17.0. It is strongly recommended that your clusters use previously released images 1.4.77, 1.5.53, or 2.0.27, or higher (see Supported Dataproc versions). While not urgent nor strongly recommended, Dataproc advises you to create or recreate Dataproc clusters with the latest sub-minor image versions when possible.

Upgraded Cloud Storage connector version to 2.2.4 in image version 2.0.

Fixed the problem that jars added with the --jars flag in gcloud dataproc jobs submit spark-sql are missing at runtime.

December 21, 2021

Dataproc has released 1.3.95-debian10/-ubuntu18 images with a one-time patch that addresses the Apache Log4j 2 CVE-2021-44228 and CVE-2021-45046 vulnerabilities, but note that that all 1.3 images remain unsupported, and Dataproc will not provide upgrades to 1.3 images .

December 18, 2021

Dataproc has released the following sub-minor image versions to address an Apache Log4j 2 vulnerability (also see Create a cluster and Recreate and update a cluster for more information). Note: These images supersede the 1.5 and 2.0 images listed in the December 16, 2021 release note:

1.5.53-centos8, 1.5.53-debian10, 1.5.53-ubuntu18,

2.0.27-centos8, 2.0.27-debian10, 2.0.27-ubuntu18

Removed the Geode interpreter from Zeppelin notebook, which is affected by https://nvd.nist.gov/vuln/detail/CVE-2021-45046.

December 16, 2021

Dataproc has released the following [sub-minor image versions](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions:

1.4.77-debian10, 1.4.77-ubuntu18,

1.5.52-centos8, 1.5.52-debian10, 1.5.52-ubuntu18,

2.0.26-centos8, 2.0.26-debian10, 2.0.26-ubuntu18

Upgraded log4j version to 2.16.0, which fixes https://nvd.nist.gov/vuln/detail/CVE-2021-44228.

December 13, 2021

Dataproc has added new images, listed in this release note, to address an Apache Log4j 2 vulnerability.

Note: these images have been superseded. by the 12/16/21 images (see the December 16, 2021 release note). Also see Create a cluster and Recreate and update a cluster for more information.

New sub-minor versions of Dataproc images:

1.4.76-debian10, 1.4.76-ubuntu18,

1.5.51-centos8, 1.5.51-debian10, 1.5.51-ubuntu18,

2.0.25-centos8, 2.0.25-debian10, 2.0.25-ubuntu18

Upgraded log4j version to 2.15.0, which fixes https://nvd.nist.gov/vuln/detail/CVE-2021-45046.

HIVE-21040: msck does unnecessary file listing at last level of directory tree. Applies to 1.5 and 2.0 images.

Fixed executor log links on Spark History Server Web UI for running and completed applications. Applies to 1.4 and 1.5 images.

Fixed a bug where driver log links on PHS Web UI stop working once the job cluster is deleted. Applies to 1.4 and 1.5 images.

YARN-8990: Fixed a Fairscheduler race condition. Applies to 2.0 images.

SPARK-7768: Make user-defined type (UDT) API public. Applies to 2.0 images.

SPARK-35817: Queries against wide Avro tables can be slow. Applies to 2.0 images.

November 17, 2021

Dataproc is now available in the southamerica-west1 region (Santiago, Chile).

November 01, 2021

Added the following new Apache Spark properties to control Cloud Storage flush behavior for event logs for 1.4 and later images:

  • spark.history.fs.gs.outputstream.type (default: BASIC)
  • spark.history.fs.gs.outputstream.sync.min.interval.ms (default: 5000ms).

Note: The default configuration of these properties enables the display of running jobs in the Spark History Server UI for clusters using Cloud Storage to store spark event logs.

Added support in 1.5 and 2.0 images to filter Spark Applications on the Spark History Server Web UI based on Cloud Storage path. Filtering is accomplished using the eventLogDirFilter parameter, which accepts any Cloud Storage path substring and will return applications that match the Cloud Storage path.

New sub-minor versions of Dataproc images:

1.4.75-debian10, 1.4.75-ubuntu18,

1.5.50-centos8, 1.5.50-debian10, 1.5.50-ubuntu18,

2.0.24-centos8, 2.0.24-debian10, 2.0.24-ubuntu18

Removed Apache Iceberg and Delta Lake libraries in 2.0 images because they are not compatible with Spark 3.1.

Upgraded Cloud Storage connector to version 2.2.3 on 2.0 Images.

The previous Dataproc on GKE beta documentation has been replaced with a Dataproc on GKE private preview sign up form. Existing beta customers can continue using the beta release, but note that the beta release is planned to be deprecated and removed.

Patched Hive in 2.0 images with HIVE-20187, which fixes a bug where Hive returned incorrect query results when hive.convert.join.bucket.mapjoin.tez is set to true.

Backported SPARK-31946 in 2.0 images.

Backported SPARK-23182 in 1.4 and 1.5 images. This prevents long-running Spark shuffle servers from leaking connections when they are not cleanly terminated.

Fixed stdout and stderr links in the Spark History Server Web UI in 2.0 images.

October 22, 2021

The dataproc:dataproc.cluster-ttl.consider-yarn-activity cluster property is now set to true by default for image versions 1.4.64+, 1.5.39+, and 2.0.13+. With this change, with clusters created with these image versions, Dataproc Cluster Scheduled Deletion by default will consider YARN activity, in addition to Dataproc Jobs API activity, when determining cluster idle time . This change does not affect clusters with images with lower version numbers: cluster idle time for those clusters will continue to be computed based on Dataproc Jobs API activity only. When using image versions 1.4.64+, 1.5.39+, and 2.0.13+, you can opt out of this changed behavior by setting this property to false when you create the cluster.

October 08, 2021

In a future announcement (on approximately October 22, 2021), Dataproc will announce that Cluster Scheduled Deletion by default will consider YARN activity, in addition to Dataproc Jobs API activity, when determining cluster idle time. This change will affect image versions 1.4.64+, 1.5.39+, and 2.0.13+. To test this feature now, create a cluster with a recent image, setting the dataproc:dataproc.cluster-ttl.consider-yarn-activity cluster property to true. Note: After this behavior becomes the default, you can opt out when you create a cluster by setting the property to false.

October 01, 2021

New sub-minor versions of Dataproc images:

1.4.73-debian10, 1.4.73-ubuntu18,

1.5.48-centos8, 1.5.48-debian10, 1.5.48-ubuntu18,

2.0.22-centos8, 2.0.22-debian10, 2.0.22-ubuntu18

Fixed an issue where complete YARN container logs were not visible in 1.5 and 2.0 Images.

HADOOP-15129: Fixed in 2.0 Images: Datanode cached namenode DNS lookup failure and could not startup on.

September 17, 2021

Updated August 19, 2021 release notes with cluster creation Failure Action feature.

September 13, 2021

New sub-minor versions of Dataproc images: 1.4.71-debian10, 1.4.71-ubuntu18, 1.5.46-centos8, 1.5.46-debian10, 1.5.46-ubuntu18, 2.0.20-centos8, 2.0.20-debian10, 2.0.20-ubuntu18

Added support for enabling/disabling Ubuntu Snap daemon with cluster property dataproc:dataproc.snap.enabled. The default value is "true". If set to "false", pre-installed Snap packages in the image won't be affected, but auto refresh will be disabled. Applies to all Ubuntu images.

HIVE-21018: Grouping/distinct on more than 64 columns should be possible. Applies to 2.0 images.

September 08, 2021

The following previously released sub-minor versions of Dataproc images included a bug where the dataproc user account was broken. This prevented some Dataproc services from functioning properly, which resulted in features being unavailable. In particular, this prevented Jupyter from running in clusters with Personal Cluster Authentication enabled.

These sub-minor versions have been rolled back, and can only be used when updating existing clusters that already use them:

  • 1.4.66-debian10, 1.4.66-ubuntu18
  • 1.4.67-debian10, 1.4.67-ubuntu18
  • 1.5.41-centos8, 1.5.41-debian10, 1.5.41-ubuntu18
  • 1.5.42-centos8, 1.5.42-debian10, 1.5.42-ubuntu18
  • 2.0.15-centos8, 2.0.15-debian10, 2.0.15-ubuntu18
  • 2.0.16-centos8, 2.0.16-debian10, 2.0.16-ubuntu18

September 07, 2021

Added additional messages to the error messages for networking and IAM errors when creating a new cluster.

August 30, 2021

New sub-minor versions of Dataproc images: 1.4.70-debian10, 1.4.70-ubuntu18, 1.5.45-centos8, 1.5.45-debian10, 1.5.45-ubuntu18, 2.0.19-centos8, 2.0.19-debian10, 2.0.19-ubuntu18

Backported SPARK-34295: Added a new spark.yarn.kerberos.renewal.excludeHadoopFileSystemsconfiguration option.

Image 2.0:

OOZIE-3599: Upgraded Jetty version to 9.4.

August 23, 2021

New sub-minor versions of Dataproc images: 1.4.69-debian10, 1.4.69-ubuntu18, 1.5.44-centos8, 1.5.44-debian10, 1.5.44-ubuntu18, 2.0.18-centos8, 2.0.18-debian10, and 2.0.18-ubuntu18.

Configured YARN ResourceManager to use port 8554 and Druid to use port 17071 for JMX Remote RMI port.

August 19, 2021

Added support for Dataproc Metastore in three recently turned up regions: .europe-west1, northamerica-northeast1, and asia-southeast1.

Users can now help assure the successful creation of a cluster by automatically deleting any failed primary workers (the master(s) and at least two primary workers must be successfully provisioned for cluster creation to succeed). To delete any failed primary workers when you create a cluster:

  1. Using gcloud: Set the gcloud dataproc clusters create --action-on-failed-primary-workers flag to "DELETE".

  2. Using the Dataproc clusters.create API: Set the actionOnFailedPrimaryWorkers field to "DELETE".

Dataproc issues a warning message if the staging or test bucket name contains an underscore.

August 13, 2021

New sub-minor versions of Dataproc images: 1.4.68-debian10, 1.4.68-ubuntu18, 1.5.43-centos8, 1.5.43-debian10, 1.5.43-ubuntu18, 2.0.17-centos8, 2.0.17-debian10, and 2.0.17-ubuntu18.

Upgrade Flink to version 1.12.5 in image 2.0.

HIVE-2527: Fixed slow Hive partition deletion for Cloud Object Stores with expensive ListFiles.

Fixed Jupyter startup on Personal Auth clusters on all images.

August 09, 2021

New sub-minor versions of Dataproc images: 1.4.67-debian10, 1.4.67-ubuntu18, 1.5.42-centos8, 1.5.42-debian10, 1.5.42-ubuntu18, 2.0.16-centos8, 2.0.16-debian10, and 2.0.16-ubuntu18.

SPARK-28290: Fixed an issue where Spark History Server failed to serve because of a wild card certificate in the 1.4 and 1.5 images.

August 03, 2021

Dataproc is now available in the northamerica-northeast2 region (Toronto).

August 02, 2021

1.3 images are no longer supported and will not receive new releases.

New sub-minor versions of Dataproc images: 1.4.66-debian10, 1.4.66-ubuntu18, 1.5.41-centos8, 1.5.41-debian10, 1.5.41-ubuntu18, 2.0.15-centos8, 2.0.15-debian10, and 2.0.15-ubuntu18.

In image 2.0, set the mapreduce.fileoutputcommitter.algorithm.version=2 property in Spark. This makes Spark commit algorithm version consistent with prior Dataproc image versions.

July 27, 2021

New sub-minor versions of Dataproc images: 1.3.94-debian10, 1.3.94-ubuntu18, 1.4.65-debian10, 1.4.65-ubuntu18, 1.5.40-centos8, 1.5.40-debian10, 1.5.40-ubuntu18, 2.0.14-centos8, 2.0.14-debian10, and 2.0.14-ubuntu18.

The following component versions were updated in image 2.0:

Fixed a rare bug that sometimes happened when scaling down the number of secondary workers in a cluster in which the update operation would fail with error 'Resource is not a member of' or 'Cannot delete instance that was already deleted'.

July 20, 2021

New sub-minor versions of Dataproc images: 1.3.93-debian10, 1.3.93-ubuntu18, 1.4.64-debian10, 1.4.64-ubuntu18, 1.5.39-centos8, 1.5.39-debian10, 1.5.39-ubuntu18, 2.0.13-centos8, 2.0.13-debian10, and 2.0.13-ubuntu18.

Upgraded Cloud Storage connector to version 2.2.2 on 2.0 images.

Fixed Hue installation on Ubuntu 2.0 images.

Fixed an issue on 1.4 and 1.5 images where temporary shuffle data could be leaked when running Enhanced Flexibility Mode (EFM) with Spark.

July 12, 2021

For 2.0+ image clusters, the dataproc:dataproc.master.custom.init.actions.mode cluster property can be set to RUN_AFTER_SERVICES to run initialization actions on the master after HDFS and any services that depend on HDFS are initialized. Examples of HDFS-dependent services include: HBase, Hive Server2, Ranger, Solr, and the Spark and MapReduce history servers. Default: RUN_BEFORE_SERVICES.

July 09, 2021

Custom image limitation: New images announced in the Dataproc release notes are not available for use as the base for custom images until one week from their announcement date.

The Dataproc v1beta2 APIs are deprecated. Please use the Dataproc v1 APIs.

July 07, 2021

The end date of support for Dataproc image version 1.4 has been extended from August, 2021 to November, 2021.

July 05, 2021

New sub-minor versions of Dataproc images: 1.3.92-debian10, 1.3.92-ubuntu18, 1.4.63-debian10, 1.4.63-ubuntu18, 1.5.38-centos8, 1.5.38-debian10, 1.5.38-ubuntu18, 2.0.12-centos8, 2.0.12-debian10, and 2.0.12-ubuntu18.

Upgraded Spark version to 2.4.8 in the following images:

  • Image 1.4
  • Image 1.5

Upgrade HBase to version 2.2.7 in image 2.0.

Minimum boot disk sizes for Dataproc images:

  • Image 2.0: 30GB
  • Image 1.5: 20GB
  • Image 1.4: 15GB
  • Image 1.3: 15GB

Fixed stdout/stderr links on Spark History Server Web UI of the Persistent History Server in the following images:

  • Image 1.4
  • Image 1.5

Fixed a bug where personal auth credentials would not propagate to every VM in the cluster if VPC service controls were enabled.

June 29, 2021

Dataproc is now available in the asia-south2 region (Delhi).

The following previously released sub-minor versions of Dataproc images have been rolled back and can only be used when updating existing clusters that already use them:

  • 1.3.91-debian10, 1.3.91-ubuntu18
  • 1.4.62-debian10, 1.4.62-ubuntu18
  • 1.5.37-centos8, 1.5.37-debian10, 1.5.37-ubuntu18
  • 2.0.11-centos8, 2.0.11-debian10, and 2.0.11-ubuntu18.

Added support for Dataproc Metastore in three new recently turned up regions: europe-west3, us-west1, and us-east1.

Introduced a new ERROR_DUE_TO_UPDATE state, which indicates a cluster has encountered an irrecoverable error while scaling. Clusters in this state cannot be scaled, but can accept jobs.

Fixed an issue where a spurious unrecognized property warning was generated when the dataproc:jupyter.listen.all.interfaces cluster property is set.

June 21, 2021

Dataproc is now available in the australia-southeast2 region (Melbourne).

June 18, 2021

Dataproc Component Gateway URLs for any two new clusters that have the same project ID, region, and name will be identical unless Dataproc Personal Cluster Authentication is enabled.

June 08, 2021

Custom image limitation: Currently, the following Dataproc image versions are the latest images that can be used as the base for custom images:

  • 1.3.89-debian10, 1.3.89-ubuntu18
  • 1.4.60-debian10, 1.4.60-ubuntu18
  • 1.5.35-debian10, 1.5.35-ubuntu18, 1.5.35-centos8
  • 2.0.9-debian10, 2.0.9-ubuntu18, 2.0.11-centos8

June 01, 2021

New sub-minor versions of Dataproc images: 1.3.91-debian10, 1.3.91-ubuntu18, 1.4.62-debian10, 1.4.62-ubuntu18, 1.5.37-centos8, 1.5.37-debian10, 1.5.37-ubuntu18, 2.0.11-centos8, 2.0.11-debian10, and 2.0.11-ubuntu18.

Image 1.3 - 2.0

  • All jobs now share a single JobthreadPool.

  • The number of Job threads in the Agent is configurable with the dataproc:agent.process.threads.job.min and dataproc:agent.process.threads.job.max cluster properties, defaulting to 10 and 100, respectively. The previous behavior was to always use 10 Job threads.

Image 2.0

  • Added snappy-jar dependency to Hadoop.
  • Upgraded versions of Python packages: nbdime 2.1 -> 3.0, pyarrow 2.0 -> 3.0, spyder 4.2 -> 5.0, spyder-kernels 1.10 -> 2.0, regex 2020.11 -> 2021.4.

Image 1.5 and 2.0

Image 1.3 - 2.0

  • SPARK-35227: Replace Bintray with the new repository service for the spark-packages resolver in SparkSubmit.

Image 2.0

  • Fixed the problem that the environment variable PATH was not set in YARN containers.

  • SPARK-34731: ConcurrentModificationException in EventLoggingListener when redacting properties.

May 20, 2021

You can customize the Conda environment during cluster creation using new Conda-related cluster properties. See Using Conda-related cluster properties.

Added validation for clusters created with Dataproc Metastore services to determine compatibility between the Dataproc image's Hive version and the DPMS service's hive version

April 23, 2021

Announcing Dataproc Confidential Compute: Dataproc clusters now support Compute Engine Confidential VMs.

New sub-minor versions of Dataproc images: 1.3.89-debian10, 1.3.89-ubuntu18, 1.4.60-debian10, 1.4.60-ubuntu18, 1.5.35-centos8, 1.5.35-debian10, 1.5.35-ubuntu18, 2.0.9-centos8, 2.0.9-debian10, and 2.0.9-ubuntu18.

Image 1.4

Image 1.5

  • CentOS only: adoptopenjdk is set as the default Java environment.

Image 1.5 and 2.0

  • Updated Oozie version to 5.2.1
  • The Jupyter optional component now uses the "GCS" subdirectory as the initial working directory when you open the JupyterLab UI.

April 16, 2021

Added the ability to stop and start high-availability clusters.

Fixed a bug where scale-down update cluster requests failed due to quota validation if the user project was over a quota limit.

April 05, 2021

Image 2.0:

March 31, 2021

Dataproc support of Dataproc Metastore services is now available in GA.

March 26, 2021

Image 2.0:

  • Changed default private IPv6 Google APIs access for 2.0 clusters from OUTBOUND to INHERIT_FROM_SUBNETWORK.

March 24, 2021

Dataproc is now available in the europe-central2 region (Warsaw).

March 23, 2021

The default Dataproc image is now image version 2.0.

New sub-minor versions of Dataproc images: 1.3.88-debian10, 1.3.88-ubuntu18, 1.4.59-debian10, 1.4.59-ubuntu18, 1.5.34-centos8, 1.5.34-debian10, 1.5.34-ubuntu18, 2.0.7-centos8, 2.0.7-debian10, and 2.0.7-ubuntu18.

Image 2.0:

  • Updated Iceberg to version 0.11.0.
  • Updated Flink to version 1.12.2.

Image 2.0:

  • HIVE-22373: File Merge tasks fail when containers are reused.

Fixed a bug that caused Hive jobs to fail on Ranger-enabled clusters.

The end date of support for Dataproc image version 1.3 has been extended from March, 2021 to July, 2021.

Fixed a bug where Spark event logs directory and history server directory could not be set to Cloud Storage correctly.

Fixed a bug where Presto property value with ';' could not be set correctly in the config file.

CVE-2020-13957: SOLR-14663: ConfigSets CREATE does not set trusted flag.

CVE-2020-1926: HIVE-22708: Test fix for http transport.

March 16, 2021

New sub-minor versions of Dataproc images: 1.3.87-debian10, 1.3.87-ubuntu18, 1.4.58-debian10, 1.4.58-ubuntu18, 1.5.33-centos8, 1.5.33-debian10, 1.5.33-ubuntu18, 2.0.6-centos8, 2.0.6-debian10, and 2.0.6-ubuntu18.

Image 2.0: Upgraded Spark to version 3.1.1

March 08, 2021

Dataproc 2.0 image version will become a default Dataproc image version in 1 week on March 15, 2021.

March 05, 2021

New sub-minor versions of Dataproc images: 1.3.86-debian10, 1.3.86-ubuntu18, 1.4.57-debian10, 1.4.57-ubuntu18, 1.5.32-centos8, 1.5.32-debian10, 1.5.32-ubuntu18, 2.0.5-debian10, and 2.0.5-ubuntu18

Image 2.0:

Fixed a bug where YARN applications launched by Hive jobs were not correctly tagged, leading to missing YARN application status from job state.

Fixed the permission for mounted SSD Hadoop directories.

March 02, 2021

Added the --cluster-labels flag to gcloud dataproc jobs submit to allow submitting jobs to a cluster that matches specified cluster labels. Also see Submitting a Dataproc job.

March 01, 2021

Dataproc 2.0 image version will become a default Dataproc image version in 2 weeks on March 15, 2021.

February 26, 2021

New sub-minor versions of Dataproc images: 1.3.85-debian10, 1.3.85-ubuntu18, 1.4.56-debian10, 1.4.56-ubuntu18, 1.5.31-centos8, 1.5.31-debian10, 1.5.31-ubuntu18, 2.0.4-debian10, and 2.0.4-ubuntu18

Image 2.0: Upgraded Spark to 3.1.1 RC2 version

Allow stopping clusters that have autoscaling enabled, and allow enabling autoscaling on clusters that are STOPPED, STOPPING, or STARTING. If you stop a cluster that has autoscaling enabled, the Dataproc autoscaler will stop scaling the cluster. It will resume scaling the cluster once it has been started again. If you enable autoscaling on a stopped cluster, the autoscaling policy will only take effect once the cluster has been started (see Starting and stopping clusters).

Deactivated mysql and hive-metastore components for clusters created with a Dataproc Metastore service on an image that has the DISABLE_COMPONENT_HIVE_METASTORE and DISABLE_COMPONENT_MYSQL capabilities.

Image 1.3 - 1.5: HIVE-18871: hive on Tez execution error due to set hive.aux.jars.path to hdfs://

February 22, 2021

Dataproc 2.0 image version will become a default Dataproc image version in 3 weeks on March 15, 2021.

February 16, 2021

New sub-minor versions of Dataproc images: 1.3.84-debian10, 1.3.84-ubuntu18, 1.4.55-debian10, 1.4.55-ubuntu18, 1.5.30-centos8, 1.5.30-debian10, 1.5.30-ubuntu18, 2.0.3-debian10, and 2.0.3-ubuntu18

Fixed a bug that prevented Dataproc on GKE cluster creation.

February 15, 2021

Dataproc 2.0 image version will become a default Dataproc image version in 4 weeks on March 15, 2021.

February 09, 2021

New sub-minor versions of Dataproc images: 2.0.2-debian10, and 2.0.2-ubuntu18.

Image 2.0:

  • Upgraded Spark built-in Hive to version 2.3.8.
  • Upgraded Druid to version 0.20.1
  • HIVE-24436: Fixed Avro NULL_DEFAULT_VALUE compatibility issue.
  • SQOOP-3485: Fixed Avro NULL_DEFAULT_VALUE compatibility issue.
  • SQOOP-3447: Removed usage of org.codehaus.jackson and org.json packages.

Fixed a bug for beta clusters using a Dataproc Metastore Service where using a subnetwork for the cluster resulted in an error.

February 08, 2021

Dataproc 2.0 image version will become a default Dataproc image version in 5 weeks on March 15, 2021.

February 01, 2021

Dataproc 2.0 image version will become a default Dataproc image version in 3 weeks on February 22, 2021.

January 29, 2021

New sub-minor versions of Dataproc images: 1.3.83-debian10, 1.3.83-ubuntu18, 1.4.54-debian10, 1.4.54-ubuntu18, 1.5.29-centos8, 1.5.29-debian10, 1.5.29-ubuntu18, 2.0.1-debian10, and 2.0.1-ubuntu18.

Image 2.0:

January 25, 2021

Dataproc 2.0 image version will become a default Dataproc image version in 4 weeks on February 22, 2021.

January 22, 2021

Announcing the General Availability (GA) release of Dataproc 2.0 images. This image will become the default Dataproc image version on February 22, 2021.

2.0 image clusters:

You can no longer pass the dataproc:dataproc.worker.custom.init.actions.mode property when creating a 2.0 image cluster. For 2.0+ image clusters, dataproc:dataproc.worker.custom.init.actions.mode is set to RUN_BEFORE_SERVICES. For more information, see Important considerations and guidelines—Initialization processing.

2.0 image clusters:

In 2.0 clusters, yarn.nm.liveness-monitor.expiry-interval-ms is set to 15000 (15 seconds). If the resource manager does not receive a heartbeat from a NodeManager during this period, it marks the NodeManager as LOST. This setting is important for clusters that use preemptible VMs. Usually, NodeManagers unregister with the resource manager when their VMs shut down, but in rare cases when they are be shut down ungracefully, it is important for the resource manager to notice this quickly.

New sub-minor versions of Dataproc images: 1.3.82-debian10, 1.3.82-ubuntu18, 1.4.53-debian10, 1.4.53-ubuntu18, 1.5.28-centos8, 1.5.28-debian10, 1.5.28-ubuntu18, 2.0.0-debian10, and 2.0.0-ubuntu18.

Fixed bug affecting cluster scale-down: If Dataproc was unable to verify whether a master node exists, for example when hitting Compute Engine read quota limits, it would erroneously put the cluster into an ERROR state.

January 15, 2021

Announcing the Beta release of Dataproc Service Account Based Secure Multi-tenancy, which allows you to share a cluster with multiple users. With secure multi-tenancy, users can submit interactive workloads to the cluster with isolated user identities.

New sub-minor versions of Dataproc images: 1.3.81-debian10, 1.3.81-ubuntu18, 1.4.52-debian10, 1.4.52-ubuntu18, 1.5.27-centos8, 1.5.27-debian10, 1.5.27-ubuntu18, 2.0.0-RC23-debian10, and 2.0.0-RC23-ubuntu18.

Image 2.0 preview:

  • Upgraded Spark to version 3.1.0 RC1.

  • Upgraded Zeppelin to version 0.9.0.

  • Upgraded Cloud Storage Connector to version 2.2.0.

  • Upgraded JupyterLab to version 3.0.

The gcloud_dataproc_personal_cluster.py tool for the personal auth beta is no longer supported for new images. It will be replaced by an equivalent set of commands in an upcoming gcloud release.

January 12, 2021

Added support for user configuration of Compute Engine Shielded VMs in a Dataproc Cluster.

January 08, 2021

Added support for new persistent disk type, pd-balanced.

New sub-minor versions of Dataproc images: 1.3.80-debian10, 1.3.80-ubuntu18, 1.4.51-debian10, 1.4.51-ubuntu18, 1.5.26-centos8, 1.5.26-debian10, 1.5.26-ubuntu18, 2.0.0-RC22-debian10, and 2.0.0-RC22-ubuntu18.

Image 2.0 preview:

  • Upgraded Delta Hive connector to version 0.2.0.
  • Upgraded Flink to version 1.12.0.
  • Updated Iceberg to version 0.10.0.

Image 2.0 preview:

HIVE-21646: Tez: Prevent TezTasks from escaping thread logging context

December 17, 2020

New sub-minor versions of Dataproc images: 1.3.79-debian10, 1.3.79-ubuntu18, 1.4.50-debian10, 1.4.50-ubuntu18, 1.5.25-centos8, 1.5.25-debian10, 1.5.25-ubuntu18, 2.0.0-RC21-debian10, and 2.0.0-RC21-ubuntu18.

Image 2.0 preview:

Changed the default value of Spark SQL property spark.sql.autoBroadcastJoinThreshold to 0.75% of executor memory.

Fixed SPARK-32436: Initialize numNonEmptyBlocks in HighlyCompressedMapStatus.readExternal

Image 1.4-1.5:

Fixed a NullPointerException in a primary worker shuffle when the BypassMergeSortShuffleWriter is used when some output partitions are empty.

Images 1.5-2.0 preview:

Fixed ZOOKEEPER-1936: Server exits when unable to create data directory due to race condition.

Fixed a bug where Dataproc agent logs had separate entries for exception stack trace in StackDriver.

December 15, 2020

Announcing the Beta release of the Dataproc cluster Stop/Start.

Announcing the General Availability (GA) release of the Dataproc Workflow Timeout feature, which allows users to set a timeout on their graph of jobs and automatically cancel their workflow after a specified period.

December 08, 2020

Restartable jobs: Added the ability for users to specify the maximum number of total failures when a job is submitted.

Image 2.0 preview

  • Using the n1-standard-1 machine type is no longer supported.

  • Changed default values of Spark SQL properties:

    • spark.sql.adaptive.enabled=true
    • spark.sql.autoBroadcastJoinThreshold =< 2% of executor memory.

The Dataproc Metastore Service is now available in the us-east4, europe-west2, asia-northeast1, and australia-southeast1 regions in addition to the existing us-central1 region.

New sub-minor versions of Dataproc images: 1.3.78-debian10, 1.3.78-ubuntu18, 1.4.49-debian10, 1.4.49-ubuntu18, 1.5.24-debian10, 1.5.24-ubuntu18, 2.0.0-RC20-debian10, and 2.0.0-RC20-ubuntu18.

Image 1.5:

November 16, 2020

New sub-minor versions of Dataproc images: 1.3.77-debian10, 1.3.77-ubuntu18, 1.4.48-debian10, 1.4.48-ubuntu18, 1.5.23-debian10, 1.5.23-ubuntu18, 2.0.0-RC19-debian10, and 2.0.0-RC19-ubuntu18.

Image 2.0 preview

  • Upgraded Hue to version 4.8.0

November 09, 2020

Clusters that use Dataproc Metastore must be created in the same region as the Dataproc Metastore service that they will use.

New sub-minor versions of Dataproc images: 1.3.76-debian10, 1.3.76-ubuntu18, 1.4.47-debian10, 1.4.47-ubuntu18, 1.5.22-debian10, 1.5.22-ubuntu18, 2.0.0-RC18-debian10, and 2.0.0-RC18-ubuntu18.

Image 2.0 preview

Fixed a bug where the Jupyter optional component depended on the availability of GitHub at cluster creation time.

October 30, 2020

Added a dataproc:dataproc.cooperative.multi-tenancy.user.mapping cluster property which takes a list of comma-separated user-to-service account mappings. If a cluster is created with this property set, when a user submits a job, the cluster will attempt to impersonate the corresponding service account when accessing Cloud Storage through the Cloud Storage connector. This feature requires Cloud Storage connector version 2.1.4 or higher.

New sub-minor versions of Dataproc images: 1.3.75-debian10, 1.3.75-ubuntu18, 1.4.46-debian10, 1.4.46-ubuntu18, 1.5.21-debian10, 1.5.21-ubuntu18, 2.0.0-RC17-debian10, and 2.0.0-RC17-ubuntu18.

Fixed a bug in HBASE optional component on HA clusters in which hbase.rootdir was always configured to be hdfs://${CLUSTER_NAME}-m-0:8020/hbase, which assumes that master 0 is the active namenode. Now it is configured to be hdfs://${CLUSTER_NAME}:8020/hbase, so that the active master is always chosen.

Image 1.3 to 2.0 preview:

Fixed HIVE-19202: CBO failed due to NullPointerException in HiveAggregate.isBucketedInput().

Image 2.0 preview:

Fixed HADOOP-15124: Slow FileSystem.Statistics counters implementation.

October 23, 2020

Decreased the minimum allowed value of Dataproc Scheduled Deletion LifecycleConfig.idleDeleteTtl (Dataproc API) and --max-idle flag (gcloud command-line tool) from 10 minutes to 5 minutes.

New sub-minor versions of Dataproc images: 1.3.74-debian10, 1.3.74-ubuntu18, 1.4.45-debian10, 1.4.45-ubuntu18, 1.5.20-debian10, 1.5.20-ubuntu18, 2.0.0-RC16-debian10, and 2.0.0-RC16-ubuntu18.

2.0 preview image versions:

Pin MySQL Java connector version to prevent breakage of the /usr/share/java/mysql-connector-java.jar symlink on long-running and old clusters caused by auto-upgrade to a new MySQL Java connector.

Sole-tenant node cluster create or update requests to use preemptible secondary workers or attach autoscaling policies that create preemptible secondary workers are now correctly rejected.

All image versions:

  • Fixed a bug where files uploaded to Cloud Storage through the JupyterLab UI were incorrectly base64 encoded.

1.4 and 1.5 image versions:

  • SPARK-32708: Fixed SparkSQL query optimization failure to reuse exchange with DataSourceV2.

October 22, 2020

Announcing the Alpha release of the Dataproc Persistent History Server, which provides a UI to view job history for jobs run on active and deleted Dataproc clusters.

October 19, 2020

October 16, 2020

October 13, 2020

New sub-minor versions of Dataproc images: 1.3.72-debian10, 1.3.72-ubuntu18, 1.4.43-debian10, 1.4.43-ubuntu18, 1.5.18-debian10, 1.5.18-ubuntu18, 2.0.0-RC14-debian10, and 2.0.0-RC14-ubuntu18.

October 06, 2020

New sub-minor versions of Dataproc images: 1.3.71-debian10, 1.3.71-ubuntu18, 1.4.42-debian10, 1.4.42-ubuntu18, 1.5.17-debian10, 1.5.17-ubuntu18, 2.0.0-RC13-debian10, and 2.0.0-RC13-ubuntu18.

Image 1.4

Image 1.5

  • Upgraded Spark to version 2.4.7.
  • Installed google-cloud-bigquery-storage package by default in the Anaconda component.
  • Changed default value of zeppelin.notebook.storage in zeppelin-site.xml to "org.apache.zeppelin.notebook.repo.GCSNotebookRepo".

Image 2.0

  • Updated HBase to version 2.2.6.
  • Installed google-cloud-bigquery-storage in default conda environment.
  • Changed default value of zeppelin.notebook.storage in zeppelin-site.xml to "org.apache.zeppelin.notebook.repo.GCSNotebookRepo".

October 01, 2020

Launched Dataproc integration with Compute Engine sole-tenant nodes, which allows users to create a cluster in a sole-tenant node group.

September 30, 2020

Creating clusters and instantiating workflow requests that succeed even when the requester did not have ActAs permission on the service account now generate a warning field in the audit log request.

New sub-minor versions of Dataproc images: 1.3.70-debian10, 1.3.70-ubuntu18, 1.4.41-debian10, 1.4.41-ubuntu18, 1.5.16-debian10, 1.5.16-ubuntu18, 2.0.0-RC12-debian10, and 2.0.0-RC12-ubuntu18.

All supported images

Upgraded Conscrypt to the 2.5.1 version.

Image 1.5

Image 1.5 and Image 2.0 Preview

Image 2.0 preview

  • YARN-9607: Auto-configuring rollover-size of IFile format for non-appendable filesystems.

  • YARN-9525: IFile format is not working against s3a remote folder.

September 18, 2020

New sub-minor versions of Dataproc images: 1.3.69-debian10, 1.3.69-ubuntu18, 1.4.40-debian10, 1.4.40-ubuntu18, 1.5.15-debian10, 1.5.15-ubuntu18, 2.0.0-RC11-debian10, and 2.0.0-RC11-ubuntu18.

All image versions

Image 2.0 preview

September 11, 2020

Added the PrivateIpv6GoogleAccess API field to allow configuring IPv6 access to Dataproc cluster.

New sub-minor versions of Dataproc images: 1.3.68-debian10, 1.3.68-ubuntu18, 1.4.39-debian10, 1.4.39-ubuntu18, 1.5.14-debian10, 1.5.14-ubuntu18, 2.0.0-RC10-debian10, and 2.0.0-RC10-ubuntu18.

1.3-1.5 Images:

HIVE-18323: Vectorization: add the support of timestamp in VectorizedPrimitiveColumnReader for parquet

1.5 and 2.0 preview images:

Upgraded the jupyter-core and jupyter-client packages in the 1.5 and 2.0 images to be compatible with the installed notebook package version.

2.0 preview image:

Fixed a regression that could cause clusters to fail to start if user-supplied keystore/truststore are provided when enabling Kerberos.

September 04, 2020

Switched 1.3 and 1.3-debian image version aliases to point to 1.3 Debian 10 images.

When Enhanced Flexibility Mode is enabled, increased app master, task, and Spark stage retries to 10 to improve resiliency of applications to downscaling and preemption of preemptible VMs.

Support more than 8 local SSDs on VMs. Compute Engine supports 16 and 24 SSDs for larger machine types.

Changed secondary workers default boot disk size to 1000 GB in clusters created with 2.0 preview images.

Improved node memory utilization in clusters created with 2.0 preview images.

August 28, 2020

Launched Dataproc Workflow Timeout feature, which allows users to set a timeout on their graph of jobs and automatically cancel their workflow after a specified period.

Dataproc Metastore integration, which allows users to create a cluster using a Dataproc Metastore service as an external metastore, is now available for Alpha release testing.

August 21, 2020

Announcing the Beta release of Dataproc Enhanced Flexibility Mode (EFM), which manages shuffle data to minimize job progress delays caused by the removal of nodes from a running cluster.

New sub-minor versions of Dataproc images: 1.3.67-debian10, 1.3.67-ubuntu18, 1.4.38-debian10, 1.4.38-ubuntu18, 1.5.13-debian10, 1.5.13-ubuntu18, 2.0.0-RC9-debian10, and 2.0.0-RC9-ubuntu18.

Image 1.3 and 1.4: upgraded the Cloud Storage connector to version to 1.9.18.

Changed 1.4 and 1.4-debian image version aliases to point to 1.4-debian10. The version name 1.4-debian9 will continue to be available, but it won't get updates in future releases.

August 17, 2020

Launched new Personal Cluster Authentication feature, which allows the creation of single-user clusters that can access Cloud Storage using the user's own credentials instead of a VM service account.

August 14, 2020

Dataproc quotas are now regional: each region now has its own quota, which can be separately adjusted. All existing quota overrides have been migrated; customer action is not required.

Enabled Spark SQL parquet metadata cache (removed spark.sql.parquet.cacheMetadata=false from Spark default configuration).

New sub minor versions of Dataproc images: 1.3.66-debian10, 1.3.66-ubuntu18, 1.4.37-debian10, 1.4.37-ubuntu18, 1.5.12-debian10, 1.5.12-ubuntu18, 2.0.0-RC8-debian10, and 2.0.0-RC8-ubuntu18.

Image 1.4:

  • Fixed a bug in Spark EFM HCFS shuffle where failures after partial commits are not recoverable.
  • Upgraded Spark to 2.4.6 version.

Image 1.5:

  • Fixed a bug in Spark EFM HCFS shuffle where failures after partial commits are not recoverable.
  • Upgraded Spark to 2.4.6 version.
  • Upgraded Zeppelin to 0.9.0-preview2 version.
  • Included all plugins in Zeppelin installation by default.

Image 2.0 preview:

August 03, 2020

Dataproc users are required to have service account ActAs permission to deploy Dataproc resources, for example, to create clusters and submit jobs. See Managing service account impersonation for more information.

Opt-in for existing Dataproc customers: This change does not automatically apply to current Dataproc customers without ActAs permission. To opt in, see Securing Dataproc, Dataflow, and Cloud Data Fusion.

July 31, 2020

Enabled Kerberos automatic-configuration feature. When creating a cluster, users can enable Kerberos by setting the dataproc:kerberos.beta.automatic-config.enable cluster property to true. When using this feature, users do not need to specify the Kerberos root principal password with the --kerberos-root-principal-password and --kerberos-kms-key-uri flags.

New sub-minor versions of Dataproc images: 1.3.65-debian10, 1.3.65-ubuntu18, 1.4.36-debian10, 1.4.36-ubuntu18, 1.5.11-debian10, 1.5.11-ubuntu18, 2.0.0-RC7-debian10, and 2.0.0-RC7-ubuntu18.

1.3+ images (includes Preview image):

  • HADOOP-16984: Added support to read history files only from the done directory.

  • MAPREDUCE-7279: Display the Resource Manager name on the HistoryServer web page.

  • SPARK-32135: Show the Spark driver name on the Spark history web page.

  • SPARK-32097: Allow reading Spark history log files via the Spark history server from multiple directories.

Images 1.3 - 1.5:

  • HIVE-20600: Fixed Hive Metastore connection leak.

Images 1.5 - 2.0 preview:

Fixed an issue where optional components that depend on HDFS failed on single node clusters.

Fixed an issue that caused workflows to be stuck in the RUNNING state when managed clusters (created by the workflow) were deleted while the workflow was running.

July 24, 2020

Terminals started in Jupyter and JupyterLab now use login shells. The terminals behave as if you SSH'd into the cluster as root.

Upgraded the jupyter-gcs-contents-manager package to the latest version. This upgrade includes a bug fix to a 404 (NOT FOUND) error message that was issued in response to an attempt to create a file in the virtual top-level directory instead of the expected 403 (PERMISSION DENIED) error message.

New sub-minor versions of Dataproc images: 1.3.64-debian10, 1.3.64-ubuntu18, 1.4.35-debian10, 1.4.35-ubuntu18, 1.5.10-debian10, 1.5.10-ubuntu18, 2.0.0-RC6-debian10, and 2.0.0-RC6-ubuntu18.

Fixed a bug in which the HDFS DataNode daemon was enabled on secondary workers but not started (except on VM reboot if started automatically by systemd).

Fixed a bug in which StartLimitIntervalSec=0 appeared in the Service section instead of the Unit section for systemd services, which disabled rate limiting for retries when systemd restarted a service.

July 17, 2020

Dataproc now uses Shielded VMs for Debian 10 and Ubuntu 18.04 clusters by default.

The Proxy-Authorization header is accepted in place of Authorization to authenticate to Component Gateway to the backend for programmatic API calls. If Proxy-Authorization is set to a bearer token, Component Gateway will forward the Authorization header if it does not contain a bearer token.

For example, this allows setting both Proxy-Authorization: Bearer <google-access-token> as well as setting Authorization: Basic ... to authenticate to HiveServer2 with HTTP basic auth.

Added support for Zeppelin Spark and shell interpreters in Kerberized clusters by default.

New sub-minor versions of Dataproc images: 1.3.63-debian10, 1.3.63-ubuntu18, 1.4.34-debian10, 1.4.34-ubuntu18, 1.5.9-debian10, 1.5.9-ubuntu18, 2.0.0-RC5-debian10, and 2.0.0-RC5-ubuntu18.

Image 2.0 preview:

If a project's regional Dataproc staging bucket is manually deleted, it will be recreated automatically when a cluster is subsequently created in that region.

July 10, 2020

Added --temp-bucket flag to gcloud dataproc clusters create and gcloud dataproc workflow-templates set-managed-cluster to allow users to configure a Cloud Storage bucket that stores ephemeral cluster and jobs data, such as Spark and MapReduce history files.

Extended Jupyter to support notebooks stored on VM persistent disk. This change modifies the Jupyter contents manager to create two virtual top-level directories, named GCS, and Local Disk. The GCS directory points to the Cloud Storage location used by previous versions, and the Local Disk directory points to the persistent disk of the VM running Jupyter.

Dataproc images now include the oauth2l command line tool. The tool is installed in /usr/local/bin, which is available to all users in the VM.

New sub-minor versions of Dataproc images: 1.2.102-debian9, 1.3.62-debian9, 1.4.33-debian9, 1.3.62-debian10, 1.4.33-debian10, 1.5.8-debian10, 1.3.62-ubuntu18, 1.4.33-ubuntu18, 1.5.8-ubuntu18, 2.0.0-RC4-debian10, 2.0.0-RC4-ubuntu18

  • Images 1.3 - 1.5:

    • Fixed HIVE-11920: ADD JAR failing with URL schemes other than file/ivy/hdfs.
  • Images 1.3 - 2.0 preview:

    • Fixed TEZ-4108: NullPointerException during speculative execution race condition.

Fixed a race condition that could nondeterministically cause Hive-WebHCat to fail at startup when HBase is not enabled.

July 07, 2020

Announcing the General Availability (GA) release of Dataproc Component Gateway, which provides secure access to web endpoints for Dataproc default and optional components.

June 24, 2020

New subminor image versions: 1.2.100-debian9, 1.3.60-debian9, 1.4.31-debian9, 1.3.60-debian10, 1.4.31-debian10, 1.5.6-debian10, 1.3.60-ubuntu18, 1.4.31-ubuntu18, 1.5.6-ubuntu18, preview 2.0.0-RC2-debian10, and preview 2.0.0-RC2-ubuntu18.

  • Image 2.0 preview:

    • SPARK-22404: set spark.yarn.unmanagedAM.enabled property to true on clusters where Kerberos is not enabled to run Spark Application Master in driver (not managed in YARN) to improve job execution time.
    • Updated R version to 3.6

    • Updated Spark to 3.0.0 version

  • Image 1.5

    • Updated R version to 3.6

Fixed a quota validation bug where accelerator counts were squared before validation -- for example, previously if you requested 8 GPUs, Dataproc validated whether your project had quota for 8^2=64 GPUs.

June 11, 2020

Users can now configure a tempBucket in API calls. The temp bucket is a Cloud Storage bucket used to store ephemeral cluster and jobs data, such as Spark and MapReduce history files. If you do not specify a temp bucket, Dataproc will determine a Cloud Storage location (US, ASIA, or EU) for your cluster's temp bucket according to the Compute Engine zone where your cluster is deployed, and then create and manage this project-level, per-location bucket.

  • New subminor image versions: 1.2.99-debian9, 1.3.59-debian9, 1.4.30-debian9, 1.3.59-debian10, 1.4.30-debian10, 1.5.5-debian10, 1.3.59-ubuntu18, 1.4.30-ubuntu18, and 1.5.5-ubuntu18.

  • New preview image 2.0.0-RC1-debian10, 2.0.0-RC1-ubuntu18, with the following components:

    • Anaconda 2019.10
    • Atlas 2.0.0
    • Druid 0.18.1
    • Flink 1.10.1
    • Hadoop 3.2.1
    • HBase 2.2.4
    • Hive 3.1.2 (with LLAP support)
    • Hue 4.7.0
    • JupyterLab 2.1.0
    • Kafka 2.3.1
    • Miniconda3 4.8.3
    • Pig 0.18.0
    • Presto SQL 333
    • Oozie 5.2.0
    • R 3.6.0
    • Ranger 2.0.0
    • Solr 8.1.1
    • Spark 3.0.0
    • Sqoop 1.5.0
    • Zeppelin 0.9.0
  • Image 1.3+

    • Patched HIVE-23496 Adding a flag to disable materialized views cache warm up.

Druid's Historical's and Broker's JVM and runtime properties are now calculated using server resources. Previously, only the Historical's and MiddleManager's MaxHeapSize property was calculated using server resources. This change modifies how new values for MaxHeapSize and MaxDirectMemorySize properties are calculated for Broker and Historical processes. Also, new runtime properties druid.processing.numThreads and druid.processing.numMergeBuffers are calculated using server resources.

If the project-level staging bucket is manually deleted, it will be recreated when a cluster is created.

Dataproc Job container logging now supports Dataproc Kerberized clusters.

Image 1.5:

  • Fixed a bug that prevented users from logging on to the Presto UI when using Component Gateway.

June 08, 2020

Dataproc is now available in the asia-southeast2 region (Jakarta).

May 27, 2020

Dataproc now provides beta support for Dataproc Hub.

May 21, 2020

You can now set core:fs.defaultFS to a location in Cloud Storage (for example, gs://bucket) when creating a cluster to set Cloud Storage as the default filesystem. This also sets core:fs.gs.reported.permissions, the reported permission returned by the Cloud Storage connector for all files, to 777. If Cloud Storage is not set as the default filesystem, this property will continue to return 700, the default value.

Image 1.4 and 1.5

HADOOP-16984: Enable persistent history server to read from done directory.

New sub-minor versions of Dataproc images: 1.2.98-debian9, 1.3.58-debian9, 1.4.29-debian9, 1.3.58-debian10, 1.4.29-debian10, 1.5.4-debian10, 1.3.58-ubuntu18, 1.4.29-ubuntu18, 1.5.4-ubuntu18.

Image 1.3, 1.4, and 1.5

  • Restrict Jupyter, Zeppelin, and Knox to only accept connections from localhost when Component Gateway is enabled. This restriction reduces the risk of remote code execution over unsecured notebook server APIs. To override this change, when you create the cluster, set the Jupyter, Zeppelin, and Knox cluster properties, respectively, as follows: dataproc:jupyter.listen.all.interfaces=true, zeppelin:zeppelin.server.addr=0.0.0.0, and knox:gateway.host=0.0.0.0.

  • Upgrade Hive to version 2.3.7.

Image 1.4 and 1.5

SPARK-29367: Add ARROW_PRE_0_15_IPC_FORMAT=1 in yarn-env.sh to fix the Pandas UDF issue with pyarrow 0.15.

Image 1.5

Hide the "Quit" button from Jupyter notebook (c.NotebookApp.quit_button = False) when using the Jupyter optional component. The Jupyter environment is shut down when the cluster is deleted.

Set the hive.localize.resource.num.wait.attempts property to 25 to improve reliability of Hive queries.

Image 1.5

Fix a race condition in which hbase-master would try to write /hbase/.tmp/hbase.version to HDFS before HDFS was initialized. This can increase cluster creation time for clusters created with HBase.

  • Fix a race condition in which, when the am.primary_only property is provided, the "non-preemptible" node label was not added to the resource manager's node label store before node managers started registering with the resource manager.

  • Store resource manager node labels in Cloud Storage when am.primary_only property is provided.

The dataproc:alpha.state.shuffle.hcfs.enabled cluster property has been deprecated. To enable Enhanced Flexibility Mode (EFM) for Spark, set dataproc:efm.spark.shuffle=hcfs. To enable EFM for MapReduce, set dataproc:efm.mapreduce.shuffle=hcfs.

May 05, 2020

Clusters can now be created with non-preemptible secondary workers.

May 01, 2020

Announcing the Beta release of Dataproc Component Gateway, which provides secure access to web endpoints for Dataproc default and optional components.

April 27, 2020

Dataproc on GKE version 1.4.27-beta is available with minor fixes.

April 24, 2020

Image 1.5

Delta Lake version is upgraded to 0.5.0 release. Delta Lake Hive Connector 0.1.0 is also added to the 1.5 image.

Customers can now adjust the amount of time the Dataproc startup script will wait for Presto Coordinator service to start before deciding that their startup has succeeded. This is set via dataproc:startup.component.service-binding-timeout.presto-coordinator property and takes a value in seconds. The maximum respected value is 1800 (30 minutes).

New sub-minor image versions: 1.2.96-debian9, 1.3.56-debian9, 1.4.27-debian9, 1.3.56-debian10, 1.4.27-debian10, 1.5.2-debian10, 1.3.56-ubuntu18, 1.4.27-ubuntu18, 1.5.2-ubuntu18

Image 1.5

Cloud Storage connector upgraded to version 2.1.2 (for more information, review the change notes in the GitHub repository)

Image 1.5

Notebook bug fixes: fixed a bug in Zeppelin and Jupyter that resulted in a failure to render images when using Component Gateway. Also fixed a Jupyter Notebooks bug that caused notebook downloads to fail.

April 20, 2020

Dataproc is now available in the us-west4 region (Las Vegas).

April 17, 2020

Announcing the Beta release of Dataproc on Google Kubernetes Engine. Customers can now create Dataproc on GKE clusters to run Spark jobs on Kubernetes via the Dataproc jobs API.

April 15, 2020

Image 1.5

Jupyter on Dataproc now supports exporting notebooks as PDFs.

Image 1.5

Presto now includes two default catalogs:

  • bigquery pointing to the datasets of the cluster's project

  • bigquery_public_data pointing to the public datasets

Image 1.3, 1.4 and 1.5

Added Component Gateway support for Dataproc clusters secured with Kerberos.

New sub-minor versions of Dataproc images: 1.2.95-debian9, 1.3.55-debian9, 1.4.26-debian9, 1.3.55-debian10, 1.4.26-debian10, 1.5.1-debian10, 1.3.55-ubuntu18, 1.4.26-ubuntu18, 1.5.1-ubuntu18.

Image 1.5

Updated Presto to version 331.

Created cloud-sql-proxy log file for the Cloud SQL Proxy initialization action and for Dataproc clusters with Ranger that use Cloud SQL Proxy.

Image 1.3 and 1.4

Debian 10 images will become default images for 1.3 and 1.4 image tracks and Debian 9 images will not be released for these tracks anymore after June 30, 2020.

Images 1.4 and 1.5

SPARK-29080: Support R file extension case-insensitively when submitting Spark R jobs.

Image 1.3, 1.4 and 1.5

Fixed a bug where Jupyter was unable to read and write notebooks stored in Cloud Storage buckets with CMEK enabled.

Image 1.3, 1.4 and 1.5

HIVE-17275: Auto-merge fails on writes of UNION ALL output to ORC file with dynamic partitioning.

April 03, 2020

Added Presto and SparkR job type support to Dataproc Workflows.

Fixed an Auto Zone Placement bug that incorrectly returned INVALID_ARGUMENT errors as INTERNAL errors, and didn't propagate these error messages to the user.

April 01, 2020

Announcing the General Availability (GA) release of Dataproc Presto job type, which can be submitted to a cluster using the gcloud dataproc jobs submit presto command. Note: The Dataproc Presto Optional Component must be enabled when the cluster is created to submit a Presto job to the cluster.

March 25, 2020

Added pagination support to Clusters List methods to provide functionality to the pageSize parameter, which is a part of the API. This feature allows users to specify a page size to receive paginated data in the response. The default page size is 200 and the max page size is 1000.

Added alphabetical sort order to Workflow Templates List methods.

Dataproc clusters can now be created on the GKE platform by setting the GkeClusterConfig instead of the GceClusterConfig via the Beta API. This feature allows jobs to be submitted that will run on the Kubernetes cluster.

Announcing the General Availability (GA) release of Dataproc 1.5 images.

New sub-minor versions of Dataproc images: 1.2.94-debian9, 1.3.54-debian9, 1.4.25-debian9, 1.5.0-debian10, 1.3.54-ubuntu18, 1.4.25-ubuntu18, and 1.5.0-ubuntu18

Image 1.5
Upgraded the Cloud Storage connector to version 2.1.1.

Images 1.2 and 1.4

Dataproc 1.4 will be the default image version after April 31, 2020.

Dataproc 1.2 will have no further releases after June 30, 2020.

Images 1.3, 1.4, and 1.5
Fixed HDFS UI in the Component Gateway on HA clusters

Fixed issue where Jupyter hangs when loading a directory containing many large files. This also improves responsiveness when listing directories.

March 18, 2020

Added the following flag to the gcloud dataproc clusters update command:

  • --num-secondary-workers

The following flag to gcloud dataproc clusters update has been deprecated:

  • --num-preemptible-workers

See the related change, above, for the new flag to use in place of this deprecated flag.

March 17, 2020

Added a dataproc:job.history.to-gcs.enabled cluster property that allows persisting MapReduce and Spark history files to the Dataproc temp bucket (default: true for image versions 1.5+). bucket. This property defaults to true for image versions 1.5 and up. Users can overwrite the locations of job history file persistence through the following properties:

  • mapreduce.jobhistory.done-dir
  • mapreduce.jobhistory.intermediate-done-dir *spark.eventLog.dir
  • spark.history.fs.logDirectory

Added support for n2-, c2-, e2-, n2d-, and m2- machine types when using Auto Zone Placement. Previously, users could only specify n1- or custom- machine types when using Auto Zone Placement.

Added a mapreduce.jobhistory.always-scan-user-dir cluster property that enables the MapReduce job history server to read the history files (recommended when writing history files to Cloud Storage). The default is true.

Customers can now enable the Cloud Profiler when submitting a Dataproc job by setting the cloud.profiler.enable property. To use profiling, customers must enable the Cloud Profiler API for their project and create the cluster with --scopes=cloud-platform. The following profiling properties can also be set when submitting a Dataproc job:

  1. cloud.profiler.name: to collect profiler data under the specified name. If not specified, it defaults to the job UUID.

  2. cloud.profiler.service.version: to compare profiler information from different job runs. If not specified, defaults to the job UUID.

New