-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Refactor training loop #6098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Refactor training loop #6098
Conversation
Thanks for your contribution! |
c689509
to
6b690ec
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #6098 +/- ##
===========================================
- Coverage 63.53% 63.46% -0.07%
===========================================
Files 514 514
Lines 73620 73690 +70
===========================================
- Hits 46773 46767 -6
- Misses 26847 26923 +76 ☔ View full report in Codecov by Sentry. |
[Distributed]Fix trainer for distributed training
[Dygraph] Support PP new strategy - delay_scale_loss & dp_comm_overlap
pipeline 兼容非master-grad
support sharding stage1 in hybrid parallel.
Add time statistics for nccl-connection.
allow to use `main-grad` under TF32/FP32
[hot-fix] resume from accumulation-step wrong
[fix] pp非mp情况下只存了pp01 model
…sume, warning when model-weight has missing keys
online hot fix, rasie Error when optimizer/lr scheduler no show in re…
Support asynchronous save
support sharding save and load
pangengzheng seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
fix sharding group
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。 |
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。 |
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。 |
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。 |
PR types
PR changes
Description