Skip to content

Conversation

janEbert
Copy link
Contributor

Only relevant if elementwise_affine=True.

@pytorch-bot
Copy link

pytorch-bot bot commented May 17, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101683

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9ff21d5:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@drisspg drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 17, 2023
@jbschlosser
Copy link
Contributor

Hey @janEbert, thanks for the PR - is there an associated issue for this?

@mikaylagawarecki
Copy link
Contributor

For my understanding, is this change related to the porting of RMSNorm from this comment?

@janEbert
Copy link
Contributor Author

Hey @jbschlosser, I tried querying issues and PRs matching "LayerNorm without bias" and similar, but didn't find anything. I haven't opened an issue for this but can do it if it makes administration easier.
Also if you have a tip for searching in very large open-source projects like PyTorch, I'd really appreciate it for the future!

The reason to implement this PR is T5-style models as discussed in @mikaylagawarecki's linked issue (which I didn't find in my search). The PaLM paper also mentioned more stable scaling when disabling LayerNorm bias for large models.


@albanD mentioned:

yes bias is not great as it is usually just the bias weight but here it also means the centering is also removed (you don't remove the average bias). Do you think rms_only=True would be a better name?

Should this be discussed? I personally think bias is clear enough, considering the consistency with other PyTorch layers and that centering in the statistical sense is usually "mean".

@janEbert
Copy link
Contributor Author

janEbert commented Jul 5, 2023

Hey, any new opinions on this? I'd be really happy to see this merged so that the PyTorch Transformers API becomes more flexible for scaling up. :)

@mikaylagawarecki mikaylagawarecki self-assigned this Jul 7, 2023
@mikaylagawarecki
Copy link
Contributor

mikaylagawarecki commented Jul 14, 2023

@janEbert Apologies for the delay, so my understanding (which perhaps you were getting at) is that

LayerNorm is given by

$y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta$
where the $\epsilon$ is added for numerical stability

This PR is doing LayerNorm(elementwise_affine=True, bias=False) is

$y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma$

and RMSNorm is given by

$y = \frac{x}{ RMS(x)} * \gamma$ (with maybe an optional $+ \beta$ per this comment)

Where

$Var(x) = \frac{1}{n}\sum\limits_{i=1}^{n} (x_i - E[x])^2$

and

$RMS(x) = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n} (x_i)^2}$ ($\epsilon$ could also be used for numerical stability in this calculation)

so RMSNorm defers from LayerNorm on 3 counts

  1. don't subtract expectation from numerator
  2. use RMS(x) rather than $\sqrt{Var(x) + \epsilon}$ in the denominator
  3. (perhaps) don't learn an elementwise affine bias

And this change is completely separate from RMSNorm. In conclusion, we are okay with adding a bias argument here to disable the affine bias for consistency with other modules (e.g. Bilinear, Linear) :)

@mikaylagawarecki
Copy link
Contributor

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased layernorm_nobias onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout layernorm_nobias && git pull --rebase)

@mikaylagawarecki mikaylagawarecki added release notes: nn release notes category topic: improvements topic category labels Jul 14, 2023
@janEbert
Copy link
Contributor Author

Thank you so much @mikaylagawarecki, that's an amazing summary that clears up any misunderstandings!

@mikaylagawarecki
Copy link
Contributor

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 17, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jul 26, 2023
As used by T5 and PaLM, citing "increased training stability for large models" (https://arxiv.org/abs/2204.02311).

Depends on #101683, which allows disabling bias for `LayerNorm`s. Marked as draft due to this.
Pull Request resolved: #101687
Approved by: https://github.com/mikaylagawarecki
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: nn release notes category topic: improvements topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

8 participants