RMS norm implementation #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

le1nux merged 16 commits into main from rms_norm

Mar 13, 2024

Member

le1nux commented Mar 4, 2024 •

edited

Loading

Implemented RMS Norm based on https://arxiv.org/pdf/1910.07467.pdf and majorly adopted from https://github.com/facebookresearch/llama/blob/main/llama/model.py#L34C1-L77C36
Refactored the original LayerNorm implementation as a component in modalities: https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html

The layer norm was originally instantiated individually for every attention block internally.

modalities/src/modalities/models/gpt2/gpt2_model.py

Line 193 in dd0db07

self.ln_1 = LayerNorm(ndim=n_embd, bias=bias, epsilon=epsilon)

For every new layer norm type we would have had to add an if-clause to check which layer norm we would want to instantiate. As a workaround, we now pass in the layer norm object from outside to the GPT2 model and copy it in every attention block. Note that we override the copy function in the layer norm implementations.

For the future, it would make sense to have the possibility to instantiate Lists of components. For instance, a GPTModel would have a dependency for a list of attention block. We would specify a single attention block and instantiate the block n times (see num_instances in the YAML below). Each attention block would have a dependency for a layer norm and would not have to be copied internally anymore.

This is an example:

model:
  component_key: model
  variant_key: gpt2
  config:
    [...]
    attention_blocks:
        component_key: attention_block
        variant_key: gpt2_attention_block
        num_instances: 12 
        config:    
          n_embd: 768
          dropout: 0.0
          scaling_factor: 3
         [...]

le1nux added 3 commits

March 4, 2024 19:48


          feat: implemented RMSLayerNorm

4d17e33


          feat: added layer norm to GPT2Model

8dd74a4


          feat: wired up RMSnorm and ZLayerNorm in registry

8bfec3e

le1nux added the enhancement label

le1nux added 3 commits

March 4, 2024 19:53


          feat: created YAML config with ZLayerNorm and RMSLayerNorm suppport

fa63c95


          refactor: added workaround to instantiate multiple LayerNorm layers f…

04b4f45

…rom single layernorm instance


          test: added layer norm tests

78a2461

le1nux self-assigned this

le1nux requested review from mali-git and flxst

March 7, 2024 17:27

le1nux marked this pull request as ready for review

March 7, 2024 17:28

le1nux changed the title ~~Draft: Rms norm implementation~~ RMS norm implementation

flxst reviewed

View reviewed changes

config_files/config_example_mem_map_dataset.yaml

    
            @@ -31,7 +31,7 @@ train_dataset:
          
                component_key: dataset

                variant_key: packed_mem_map_dataset_megatron

                config: 

                  raw_data_path: /raid/s3/opengptx/max_lue/LLMgym/data/redpyjama_v2_default_DE_num_docs_16777216.pbin

                  raw_data_path: /raid/s3/opengptx/max_lue/modalities/data/sample_datasets/redpajama_v2/mem_map/redpajama_v2_gpt2_tokenized_num_samples_1050391.pbin

Member

flxst Mar 8, 2024

We should probably use relative paths here (and in other configs, too).

flxst reviewed

View reviewed changes

src/modalities/models/components/layer_norms.py Outdated



		class RMSLayerNorm(LayerNormIF):
		def __init__(self, ndim: int, epsilon: float = 1e-6):

Member

flxst Mar 8, 2024

Should we not implement an (optional) bias for RMSLayerNorm, just like we do for ZLayerNorm? The original RMSNorm paper uses a bias by default.

Member Author

le1nux Mar 13, 2024 •

edited

Loading

Good point! I also checked the original RMSNorm implementation and they also had it (see: https://github.com/bzhangGo/rmsnorm/blob/2e726f1a3f106bb719056422f4f9b6aca03d3ce6/rmsnorm_torch.py#L32). Added bias also to this implementation.

flxst reviewed

View reviewed changes

src/modalities/models/components/layer_norms.py Outdated

+                      Args:
+                          ndim (int): The dimension of the input tensor.
+                          eps (float, optional): A small value added to the denominator for numerical stability. Default is 1e-6.

Member

flxst Mar 8, 2024

Epsilon is 1e-6 in the LLaMa implementation by default. However, it seems that they actually used 1e-5 themselves, see here. 1e-5 is also the default value in PyTorch for LayerNorm and used elsewhere for RMSNorm (e.g. here), so it seems like a standard value that we should perhaps also use instead of 1e-6?

flxst reviewed

View reviewed changes

src/modalities/models/components/layer_norms.py Outdated Show resolved Hide resolved

flxst reviewed

View reviewed changes

src/modalities/models/components/layer_norms.py Outdated

		return copied_instance


		class ZLayerNorm(LayerNormIF):

Member

flxst Mar 8, 2024

Why is it called ZLayerNorm, i.e. what does the Z stand for? Is this only to differentiate it from the more generic LayerNormIF class?

Member Author

le1nux Mar 13, 2024

The layer norm that is implemented in Pytorch basically calculates the z scores for each vector component (with two additional, learnable affine transformation parameters):
https://en.wikipedia.org/wiki/Standard_score

I found that the naming of layer norm is too generic as RMSNorm is also a layer norm. What naming would you suggest?

Member Author

le1nux Mar 13, 2024 •

edited

Loading

Michael and Mehdi suggested to also use the original Layer Norm name. Given the overrulement it's LayerNorm again :-)
Also, we don't use a custom LayerNorm wrapper anymore. I found a way to simplify that part so that we don't have to override __copy__().

flxst reviewed

View reviewed changes

src/modalities/models/components/layer_norms.py Outdated Show resolved Hide resolved

flxst reviewed

View reviewed changes

src/modalities/models/components/layer_norms.py

+              class RMSLayerNormConfig(BaseModel):
+                  ndim: Annotated[int, Field(strict=True, ge=1)]
+                  epsilon: Annotated[float, Field(gt=0, default=1e-6)]

Member

flxst Mar 8, 2024

Is there a reason we do not use strict=True here (and above in ZLayerNormConfig)?

Member Author

le1nux Mar 13, 2024

fixed.

flxst reviewed

View reviewed changes

tests/models/components/test_layer_norms.py Outdated Show resolved Hide resolved

flxst reviewed

View reviewed changes

Member

flxst left a comment

Looks good to me! I added a few comments and questions.

lllAlexanderlll self-requested a review

March 11, 2024 08:48

le1nux added 4 commits

March 13, 2024 16:43


          refactor: type annotations for layer norms are now nn.Module

eebd757


          refactor: removed ZLayerNorm and resort to the original LayerNorm pyt…

6382d0e

…orch implementation without the need for a wrapper. Removed __copy__ overrides, as calling deepcopy on nn.Module already is capable of recursively copying a nn.Module. Introduced bias to RMSLayerNorm


          refactor: renamed PydanticModelIFType to PydanticPytorchModuleType

1698a7d


          refactor: added config with latest layer norm changes

5d05fb3

mali-git approved these changes

View reviewed changes

le1nux added 3 commits

March 13, 2024 17:12


          test: removed LayerNorm tests and improved RMSLayerNorm tests

56b88ae


          refactor: renamed RMSLayerNorm.weight to RMSLayerNorm.gain

1558ef4


          refactor: changed default value for epsilon to 1e-5

3469d7c

le1nux and others added 3 commits

March 13, 2024 17:49


          Merge branch 'main' into rms_norm

90a7b00


          fix: fixed renaming of attention -> attention_config

9cd1863


          feat: added ROPE to gpt2 model

bedb564

le1nux merged commit 4f509cc into main

fromm-m approved these changes

View reviewed changes

le1nux deleted the rms_norm branch

March 13, 2024 17:53

le1nux restored the rms_norm branch

March 13, 2024 17:54

le1nux added a commit that referenced this pull request


          Merge pull request #67 from Modalities/rms_norm

fe5a31b

RMS norm implementation

le1nux added a commit that referenced this pull request


          Merge pull request #67 from Modalities/rms_norm

662e68f

RMS norm implementation

le1nux added a commit that referenced this pull request


          Merge pull request #67 from Modalities/rms_norm

RMS norm implementation

flxst mentioned this pull request

Feat: Implementation of RMSNorm #66

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels