This page describes how to run distributed training jobs on Vertex AI.
Code requirements
Use an ML framework that supports distributed training. In your training code,
you can use the
CLUSTER_SPEC
or TF_CONFIG
environment variables
to reference specific parts of your training cluster.
Structure of the training cluster
If you run a distributed training job with Vertex AI, you specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify. Your running job on a given node is called a replica. A group of replicas with the same configuration is called a worker pool.
Each replica in the training cluster is given a single role or task in distributed training. For example:
Primary replica: Exactly one replica is designated the primary replica. This task manages the others and reports status for the job as a whole.
Worker(s): One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your job configuration.
Parameter server(s): If supported by your ML framework, one or more replicas may be designated as parameter servers. These replicas store model parameters and coordinate shared model state between the workers.
Evaluator(s): If supported by your ML framework, one or more replicas may be designated as evaluators. These replicas can be used to evaluate your model. If you are using TensorFlow, note that TensorFlow generally expects that you use no more than one evaluator.
Configure a distributed training job
You can configure any custom training job as a distributed training job by defining multiple worker pools. You can also run distributed training within a training pipeline or a hyperparameter tuning job.
To configure a distributed training job, define your list of worker pools
(workerPoolSpecs[]
),
designating one WorkerPoolSpec
for each type of task:
Position in workerPoolSpecs[] |
Task performed in cluster |
---|---|
First (workerPoolSpecs[0] ) |
Primary, chief, scheduler, or "master" |
Second (workerPoolSpecs[1] ) |
Secondary, replicas, workers |
Third (workerPoolSpecs[2] ) |
Parameter servers, Reduction Server |
Fourth (workerPoolSpecs[3] ) |
Evaluators |
You must specify a primary replica, which coordinates the work done by all the
other replicas. Use the first worker pool specification only for your primary
replica, and set its
replicaCount
to 1
:
{
"workerPoolSpecs": [
// `WorkerPoolSpec` for worker pool 0, primary replica, required
{
"machineSpec": {...},
"replicaCount": 1,
"diskSpec": {...},
...
},
// `WorkerPoolSpec` for worker pool 1, optional
{},
// `WorkerPoolSpec` for worker pool 2, optional
{},
// `WorkerPoolSpec` for worker pool 3, optional
{}
]
...
}
Specify additional worker pools
Depending on your ML framework, you may specify additional worker pools for other purposes. For example, if you are using TensorFlow, you could specify worker pools to configure worker replicas, parameter server replicas, and evaluator replicas.
The order of the worker pools you specify in the workerPoolSpecs[]
list
determines the type of worker pool. Set empty values for
worker pools that you don't want to use, so that you can skip them in the
workerPoolSpecs[]
list in order to specify worker pools that you do want to
use. For example:
If you want to specify a job that has only a primary replica and a parameter server worker pool, you must set an empty value for worker pool 1:
{
"workerPoolSpecs": [
// `WorkerPoolSpec` for worker pool 0, required
{
"machineSpec": {...},
"replicaCount": 1,
"diskSpec": {...},
...
},
// `WorkerPoolSpec` for worker pool 1, optional
{},
// `WorkerPoolSpec` for worker pool 2, optional
{
"machineSpec": {...},
"replicaCount": 1,
"diskSpec": {...},
...
},
// `WorkerPoolSpec` for worker pool 3, optional
{}
]
...
}
Reduce training time with Reduction Server
When you train a large ML model using multiple nodes, communicating gradients between nodes can contribute significant latency. Reduction Server is an all-reduce algorithm that can increase throughput and reduce latency for distributed training. Vertex AI makes Reduction Server available in a Docker container image that you can use for one of your worker pools during distributed training.
To learn about how Reduction Server works, see