How to start training again from last checkpoint #4488
Replies: 8 comments 10 replies
-
|
If you're using exp manager, it has two flags for resuming training v Without it, you can try loading your checkpoint using load_from_checkpoinr() and then calling trainer.train() again but we haven't tested that |
Beta Was this translation helpful? Give feedback.
-
|
Are you talking about this one? create_checkpoint_callback. It is set to True @titu1994 |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
Note that you will need to fix eithr |
Beta Was this translation helpful? Give feedback.
-
|
We also some documentation here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/core/exp_manager.html |
Beta Was this translation helpful? Give feedback.
-
|
You could continue training from the latest checkpoint following the steps:
|
Beta Was this translation helpful? Give feedback.
-
|
@FatimaArshad-DS Do you have any other questions about this? |
Beta Was this translation helpful? Give feedback.
-
|
Its possible save checkpoint from Moe models? i got some issues trying to save one the only way i se now its saving weights but doesnt work for resume training |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
My training got interrupted and Im trying to restart training from last checkpoint. However, training starts from the beginning. How do I make it start from last checkpoint?
Beta Was this translation helpful? Give feedback.
All reactions