Skip to content

Conversation

martinrohbeck
Copy link
Contributor

This PR removes the usage of the DataParallel container, because it seems to cause issues.

  1. The issues only come with cuda enabled, because otherwise the DataParallel is not used. When running python ss_vae_M2.py --cuda memory is allocated on more than one GPU, but nothing seems to happen.
    However, after dropping the --cuda, i.e. running python ss_vae_M2.py, everything works fine. Code also works with CUDA_VISIBLE_DEVICE=1 python ss_vae_M2.py --cuda, hence the multi-gpu training create the trouble.

  2. On the other hand, it is recommended to use DistributedDataParallel, see here.

  3. I think the lines can be dropped, since MNIST is not a dataset where multi-GPU training is needed anymore ;).

I installed pyro from the latest dev branch (v1.8.5) and pytorch v2.0.1.
PR also contains minor housekeeping.

Copy link
Member

@fritzo fritzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning up!

@fritzo fritzo merged commit 727aff7 into pyro-ppl:dev Jun 8, 2023
@martinrohbeck martinrohbeck deleted the fix-parallelisation-ss-vae branch June 9, 2023 06:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants