2x slowdown using pmap with DeviceArrays on TPU.

Recently we've observed 2x slower training when using pmap on jax arrays. Casting the data as numpy arrays is a current workaround.

The training loop looks roughly like
```
update_params = jax.pmap(update)
for data in data_gen():
  # Casting to onp fixes the slowness
  # data = onp.array(data)
  params = update_params(data, params)
```