-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[Auto Parallel] Support semi-auto trainer and fit Llama2 training #7885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Auto Parallel] Support semi-auto trainer and fit Llama2 training #7885
Conversation
|
Thanks for your contribution! |
9668320 to
97498b9
Compare
ZHUI
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改动比较大,先 Request changes 一手。
97498b9 to
16bca68
Compare
| ) | ||
|
|
||
| return optimizer | ||
| def _wrap_dist_loader(self, train_dataloader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not used in dynamic mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, now it's used in dynamic and static mode.
paddlenlp/trainer/auto_trainer.py
Outdated
| meshes.append(_get_mesh(pp_idx)) | ||
| return meshes | ||
|
|
||
| def _wrap_dist_loader(self, train_dataloader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference with _wrap_dist_loader in run_pretrain_3D_auto.py?
paddlenlp/trainer/auto_trainer.py
Outdated
| shard_dims="dp", | ||
| ) | ||
|
|
||
| def _wrap_for_static(self, model, train_dataloader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's called in Trainer from paddlenlp/trainer/trainer.py, for wrapping model into DistModel in static mode
| position_ids = paddle.arange(seq_length, dtype="int64").expand((batch_size, seq_length)) | ||
| # NOTE(zhaoyingli): infer spmd does not support [seq_len] --> [batch, seq_len] in data_parallel | ||
| position_ids = dist.shard_tensor(position_ids, get_mesh(), [dist.Shard(0), dist.Replicate()]) | ||
| position_ids = dist.shard_tensor(position_ids, get_mesh(), [dist.Replicate(), dist.Replicate()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change to replicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in static mode, infer spmd hasn't supported the case -- "[seq_len] --> [batch, seq_len]"
0afdc96 to
624abd7
Compare
paddlenlp/trainer/trainer.py
Outdated
| if self.args.use_auto_parallel and self.args.run_static_semi_auto: | ||
| model = self._wrap_for_static(model, train_dataloader) | ||
|
|
||
| self.model = model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if self.args.use_auto_parallel and self.args.run_static_semi_auto: | |
| model = self._wrap_for_static(model, train_dataloader) | |
| self.model = model | |
| if self.args.use_auto_parallel and self.args.run_static_semi_auto: | |
| model = self._wrap_for_static(model, train_dataloader) | |
| self.model = model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
624abd7 to
6a381c3
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## develop #7885 +/- ##
===========================================
- Coverage 56.80% 56.57% -0.23%
===========================================
Files 588 589 +1
Lines 89536 89900 +364
===========================================
+ Hits 50858 50865 +7
- Misses 38678 39035 +357 ☔ View full report in Codecov by Sentry. |
4df557e to
dee9d04
Compare
e3dfa0b to
eda936c
Compare
eda936c to
e541379
Compare
|
|
||
| # all_gather + mean() to get average loss over all processes | ||
| tr_loss_scalar = self._nested_gather(tr_loss).mean().item() | ||
| tr_loss_scalar = self._get_item_from_loss(self._nested_gather(tr_loss).mean()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PaddleNLP/paddlenlp/trainer/trainer.py
Lines 1199 to 1209 in fe6b45d
| def _get_item_from_loss(self, loss): | |
| assert isinstance(loss, paddle.Tensor) and loss._is_initialized() | |
| return loss.item() | |
| def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval, **kwargs): | |
| if self.control.should_log: | |
| logs: Dict[str, float] = {} | |
| # all_gather + mean() to get average loss over all processes | |
| tr_loss_scalar = self._get_item_from_loss(self._nested_gather(tr_loss).mean()) |
这里 我看你是复用了 _maybe_log_save_evaluate 函数。而且外面包括了 guard,为什么这里 要加一个 assert isinstance(loss, paddle.Tensor) and loss._is_initialized()的检查?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可以删掉,半自动判断逻辑在 auto_trainer 中重写即可
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.
PR types
Bug fixes
PR changes
Others
Description
[Auto Parallel] Support semi-auto trainer and fit Llama2 training