Remove DataParallel container in SS-VAE model #3227

martinrohbeck · 2023-06-08T09:32:11Z

This PR removes the usage of the DataParallel container, because it seems to cause issues.

The issues only come with cuda enabled, because otherwise the DataParallel is not used. When running python ss_vae_M2.py --cuda memory is allocated on more than one GPU, but nothing seems to happen.
However, after dropping the --cuda, i.e. running python ss_vae_M2.py, everything works fine. Code also works with CUDA_VISIBLE_DEVICE=1 python ss_vae_M2.py --cuda, hence the multi-gpu training create the trouble.
On the other hand, it is recommended to use DistributedDataParallel, see here.
I think the lines can be dropped, since MNIST is not a dataset where multi-GPU training is needed anymore ;).

I installed pyro from the latest dev branch (v1.8.5) and pytorch v2.0.1.
PR also contains minor housekeeping.

fritzo

Thanks for cleaning up!

martinrohbeck added 2 commits June 8, 2023 11:09

Removed DataParallel container

7fc73e5

Call model instances instead of forward, housekeeping

37ea594

fritzo approved these changes Jun 8, 2023

View reviewed changes

fritzo merged commit 727aff7 into pyro-ppl:dev Jun 8, 2023

martinrohbeck deleted the fix-parallelisation-ss-vae branch June 9, 2023 06:21

Provide feedback