Distributed training

This page describes how to run distributed training jobs on Vertex AI.

Code requirements

Use an ML framework that supports distributed training. In your training code, you can use the CLUSTER_SPEC or TF_CONFIG environment variables to reference specific parts of your training cluster.

Structure of the training cluster

If you run a distributed training job with Vertex AI, you specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify. Your running job on a given node is called a replica. A group of replicas with the same configuration is called a worker pool.

Each replica in the training cluster is given a single role or task in distributed training. For example:

Primary replica: Exactly one replica is designated the primary replica. This task manages the others and reports status for the job as a whole.
Worker(s): One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your job configuration.
Parameter server(s): If supported by your ML framework, one or more replicas may be designated as parameter servers. These replicas store model parameters and coordinate shared model state between the workers.
Evaluator(s): If supported by your ML framework, one or more replicas may be designated as evaluators. These replicas can be used to evaluate your model. If you are using TensorFlow, note that TensorFlow generally expects that you use no more than one evaluator.

Configure a distributed training job

You can configure any custom training job as a distributed training job by defining multiple worker pools. You can also run distributed training within a training pipeline or a hyperparameter tuning job.

To configure a distributed training job, define your list of worker pools (workerPoolSpecs[]), designating one WorkerPoolSpec for each type of task:

Position in `workerPoolSpecs[]`	Task performed in cluster
First (`workerPoolSpecs[0]`)	Primary, chief, scheduler, or "master"
Second (`workerPoolSpecs[1]`)	Secondary, replicas, workers
Third (`workerPoolSpecs[2]`)	Parameter servers, Reduction Server
Fourth (`workerPoolSpecs[3]`)	Evaluators

You must specify a primary replica, which coordinates the work done by all the other replicas. Use the first worker pool specification only for your primary replica, and set its replicaCount to 1:

{
  "workerPoolSpecs": [
     // `WorkerPoolSpec` for worker pool 0, primary replica, required
     {
       "machineSpec": {...},
       "replicaCount": 1,
       "diskSpec": {...},
       ...
     },
     // `WorkerPoolSpec` for worker pool 1, optional
     {},
     // `WorkerPoolSpec` for worker pool 2, optional
     {},
     // `WorkerPoolSpec` for worker pool 3, optional
     {}
   ]
   ...
}

Specify additional worker pools

Depending on your ML framework, you may specify additional worker pools for other purposes. For example, if you are using TensorFlow, you could specify worker pools to configure worker replicas, parameter server replicas, and evaluator replicas.

The order of the worker pools you specify in the workerPoolSpecs[] list determines the type of worker pool. Set empty values for worker pools that you don't want to use, so that you can skip them in the workerPoolSpecs[] list in order to specify worker pools that you do want to use. For example:

If you want to specify a job that has only a primary replica and a parameter server worker pool, you must set an empty value for worker pool 1:

{
  "workerPoolSpecs": [
     // `WorkerPoolSpec` for worker pool 0, required
     {
       "machineSpec": {...},
       "replicaCount": 1,
       "diskSpec": {...},
       ...
     },
     // `WorkerPoolSpec` for worker pool 1, optional
     {},
     // `WorkerPoolSpec` for worker pool 2, optional
     {
       "machineSpec": {...},
       "replicaCount": 1,
       "diskSpec": {...},
       ...
     },
     // `WorkerPoolSpec` for worker pool 3, optional
     {}
   ]
   ...
}

Reduce training time with Reduction Server

When you train a large ML model using multiple nodes, communicating gradients between nodes can contribute significant latency. Reduction Server is an all-reduce algorithm that can increase throughput and reduce latency for distributed training. Vertex AI makes Reduction Server available in a Docker container image that you can use for one of your worker pools during distributed training.

To learn about how Reduction Server works, see