Add support for multihost pmaps. #1376

skye · 2019-09-19T18:15:51Z

All participating hosts are assumed to be running the same pmap code. Conceptually, this can be considered a single pmap over an array sharded on its leading pmapped dimension across the hosts. Each host passes its input shard to its pmapped function call, which returns the corresponding output shard (i.e. an array of the same leading dimension size). However, any collective operations will be run across the entire "global" array.

If the devices argument to pmap is None, the pmap is assumed to be running across all hosts visible to XLA (as returned by jax.host_count()). Each host can pass in an input array of leading dimension size equal to or less than the number of devices local to that host. Note that this doesn't change the current behavior for single-host platforms. If devices are specified, the participating hosts are dictated by the devices' host_ids, and each host must pass in an input array of leading dim size equal to the number of local participating devices.

Implementation-wise, each host independently compiles the computation, which we assume yields the same executable on all hosts (follow-up work will add more error checking). The hosts must know the global axis size of the sharded array, e.g. to provide the correct replica count to XLA. This is equal to the length of devices if specified, but if not, pmap is recursively called (with devices specified) to use psum to compute the global axis size.

hawkinsp

You should ideally get a review from Matt if he can.

hawkinsp · 2019-09-23T12:59:11Z

jax/api.py

                   WrapHashably, Hashable, prod, split_list)
 from .lib.xla_bridge import (canonicalize_dtype, device_count,
-                             local_device_count, devices, host_id)
+                             local_device_count, devices, host_id, host_count)


Nit, not new: now we are importing 5 different things from xla_bridge, I think I'd prefer we just imported the module and qualified the names.

These are imported to bring them into the jax namespace, e.g. so you can call jax.host_count(). I would prefer we dealt with this in a different way, but haven't gotten around to trying anything...

jax/interpreters/pxla.py

hawkinsp · 2019-09-23T13:05:37Z

jax/interpreters/pxla.py

+_get_global_axis_size_pmapped = None
+
+def _get_global_axis_size(local_axis_size):
+  """Uses pmap to sum `local_axis_size` across all hosts.


While this approach is cunning, don't we know the global topology from the low-level runtime stack? Can't we use that here instead?

In particular the module dependency structure seems a bit odd to me with this design.

It also feels strange to me that we have to special case multihost computations here; it feels to me like the same code should work in both single host and multihost cases.

As discussed offline, I got rid of this for now and instead require that multihost pmaps run on all devices or have devices specified.

mattjj

LGTM! Clear explanations, simple global_axis_size implementation, plus logging and error checking. Thanks for adding this!

All participating hosts are assumed to be running the same pmap code. Conceptually, this can be considered a single pmap over an array sharded on its leading pmapped dimension across the hosts. Each host passes its input shard to its pmapped function call, which returns the corresponding output shard (i.e. an array of the same leading dimension size). However, any collective operations will be run across the entire "global" array. If the `devices` argument to pmap is None, the pmap is assumed to be running across all hosts visible to XLA (as returned by jax.host_count()). Each host can pass in an input array of leading dimension size equal to or less than the number of devices local to that host. Note that this doesn't change the current behavior for single-host platforms. If `devices` are specified, the participating hosts are dictated by the devices' host_ids, and each host must pass in an input array of leading dim size equal to the number of local participating devices. Implementation-wise, each host independently compiles the computation, which we assume yields the same executable on all hosts (follow-up work will add more error checking). The hosts must know the global axis size of the sharded array, e.g. to provide the correct replica count to XLA. This is equal to the length of `devices` if specified, but if not, pmap is recursively called (with `devices` specified) to use `psum` to compute the global axis size.

skye · 2019-09-26T21:44:07Z

Thanks Matt!

FYI I reworked the multi-host pmap example in the pmap docstring based on offline suggestions from @necula01. I'm gonna submit now as-is, but if anyone has further comments, I can iterate in subsequent commits.

skye requested review from hawkinsp and mattjj September 19, 2019 18:15

googlebot added the cla: yes label Sep 19, 2019

hawkinsp reviewed Sep 23, 2019

View reviewed changes

skye force-pushed the multihost branch from 00d4c10 to d1fbdd3 Compare September 24, 2019 00:11

hawkinsp approved these changes Sep 25, 2019

View reviewed changes

skye force-pushed the multihost branch 2 times, most recently from 77ce8f6 to 1f36e3d Compare September 26, 2019 01:42

mattjj approved these changes Sep 26, 2019

View reviewed changes

skye force-pushed the multihost branch 2 times, most recently from 72512f9 to 9dd43b3 Compare September 26, 2019 19:38

skye force-pushed the multihost branch from 9dd43b3 to 6b55b20 Compare September 26, 2019 21:15

skye merged commit dc2ee0d into jax-ml:master Sep 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for multihost pmaps. #1376

Add support for multihost pmaps. #1376

Uh oh!

skye commented Sep 19, 2019 •

edited

Loading

Uh oh!

hawkinsp left a comment

Uh oh!

hawkinsp Sep 23, 2019

Uh oh!

skye Sep 24, 2019

Uh oh!

Uh oh!

hawkinsp Sep 23, 2019

Uh oh!

skye Sep 24, 2019

Uh oh!

mattjj left a comment

Uh oh!

skye commented Sep 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add support for multihost pmaps. #1376

Add support for multihost pmaps. #1376

Uh oh!

Conversation

skye commented Sep 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hawkinsp left a comment

Choose a reason for hiding this comment

Uh oh!

hawkinsp Sep 23, 2019

Choose a reason for hiding this comment

Uh oh!

skye Sep 24, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hawkinsp Sep 23, 2019

Choose a reason for hiding this comment

Uh oh!

skye Sep 24, 2019

Choose a reason for hiding this comment

Uh oh!

mattjj left a comment

Choose a reason for hiding this comment

Uh oh!

skye commented Sep 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

skye commented Sep 19, 2019 •

edited

Loading