-
Notifications
You must be signed in to change notification settings - Fork 12
RMS norm implementation #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -31,7 +31,7 @@ train_dataset: | |||
component_key: dataset | |||
variant_key: packed_mem_map_dataset_megatron | |||
config: | |||
raw_data_path: /raid/s3/opengptx/max_lue/LLMgym/data/redpyjama_v2_default_DE_num_docs_16777216.pbin | |||
raw_data_path: /raid/s3/opengptx/max_lue/modalities/data/sample_datasets/redpajama_v2/mem_map/redpajama_v2_gpt2_tokenized_num_samples_1050391.pbin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably use relative paths here (and in other configs, too).
|
||
|
||
class RMSLayerNorm(LayerNormIF): | ||
def __init__(self, ndim: int, epsilon: float = 1e-6): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not implement an (optional) bias for RMSLayerNorm
, just like we do for ZLayerNorm
? The original RMSNorm paper uses a bias by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I also checked the original RMSNorm implementation and they also had it (see: https://github.com/bzhangGo/rmsnorm/blob/2e726f1a3f106bb719056422f4f9b6aca03d3ce6/rmsnorm_torch.py#L32). Added bias also to this implementation.
|
||
Args: | ||
ndim (int): The dimension of the input tensor. | ||
eps (float, optional): A small value added to the denominator for numerical stability. Default is 1e-6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Epsilon is 1e-6
in the LLaMa implementation by default. However, it seems that they actually used 1e-5
themselves, see here. 1e-5
is also the default value in PyTorch for LayerNorm and used elsewhere for RMSNorm (e.g. here), so it seems like a standard value that we should perhaps also use instead of 1e-6
?
return copied_instance | ||
|
||
|
||
class ZLayerNorm(LayerNormIF): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it called ZLayerNorm
, i.e. what does the Z
stand for? Is this only to differentiate it from the more generic LayerNormIF
class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The layer norm that is implemented in Pytorch basically calculates the z scores for each vector component (with two additional, learnable affine transformation parameters):
https://en.wikipedia.org/wiki/Standard_score
I found that the naming of layer norm is too generic as RMSNorm is also a layer norm. What naming would you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Michael and Mehdi suggested to also use the original Layer Norm name. Given the overrulement it's LayerNorm again :-)
Also, we don't use a custom LayerNorm wrapper anymore. I found a way to simplify that part so that we don't have to override __copy__()
.
|
||
class RMSLayerNormConfig(BaseModel): | ||
ndim: Annotated[int, Field(strict=True, ge=1)] | ||
epsilon: Annotated[float, Field(gt=0, default=1e-6)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we do not use strict=True
here (and above in ZLayerNormConfig
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I added a few comments and questions.
…orch implementation without the need for a wrapper. Removed __copy__ overrides, as calling deepcopy on nn.Module already is capable of recursively copying a nn.Module. Introduced bias to RMSLayerNorm
The layer norm was originally instantiated individually for every attention block internally.
modalities/src/modalities/models/gpt2/gpt2_model.py
Line 193 in dd0db07
For every new layer norm type we would have had to add an if-clause to check which layer norm we would want to instantiate. As a workaround, we now pass in the layer norm object from outside to the GPT2 model and copy it in every attention block. Note that we override the copy function in the layer norm implementations.
For the future, it would make sense to have the possibility to instantiate Lists of components. For instance, a GPTModel would have a dependency for a list of attention block. We would specify a single attention block and instantiate the block n times (see
num_instances
in the YAML below). Each attention block would have a dependency for a layer norm and would not have to be copied internally anymore.This is an example: