Skip to content

Conversation

abidwael
Copy link
Contributor

Currently, if we're resuming model training from a checkpoint, we try to download artifacts from storage only on the coordinator process. DDP assumes all workers can read model params, so it's necessary to download in every worker process.

@github-actions
Copy link

Unit Test Results

       6 files  ±  0         6 suites  ±0   2h 29m 33s ⏱️ - 32m 43s
1 589 tests  - 19  1 564 ✔️  -   7  24 💤  - 12  1 ±0 
1 609 runs   - 32  1 578 ✔️  - 22  30 💤  - 10  1 ±0 

For more details on these failures, see this check.

Results for commit fb81a3d. ± Comparison against base commit 5fbcca7.

@abidwael abidwael merged commit 5e57e7d into master Apr 25, 2023
@abidwael abidwael deleted the resume-files-exist branch April 25, 2023 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants