Skip to content

Conversation

SylarTiaNII
Copy link
Contributor

PR types

Bug fixes

PR changes

Others

Description

fix async save for optimizer

@paddle-bot
Copy link

paddle-bot bot commented Aug 5, 2023

Thanks for your contribution!


optimizer_name = _add_variant(OPTIMIZER_NAME, self.args.optimizer_name_suffix)

if self.args.use_hybrid_parallel:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么要修改 保存代码 的位置?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimizer保存结束会写入一个saved_signal用来判断文件是否checkpoint完整性。挪这个位置是为了保证异步写和同步写的代码执行一致性。

task.join()


def async_save_optimizer(optimizer_state_dict, path, saved_signal_path, protocol=4, sync_other_task=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些东西能找个 单独的文件放一下 最好了。比如 trainer/plugins 目录

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,后续会调整这个位置。

@ZHUI ZHUI merged commit 28d4e0c into PaddlePaddle:refactor-training-loop Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants