16 bit precision in Trainer leading to NaN

### Description & Motivation

In Trainer, when using lightning_precision `16-true` and `16-mixed`, small float values (~ > 1e-7 - 1e-8 ?) can be round up to zero (in input batches, model parameters and its gradients, optimizer hyper-parameters) leading to NaN. 

This does not raise any warning or error during training as the training loop ignore NaN. 
After training, in inference mode, all output can be NaN, (which makes it a poor model).

After debugging for several days, I found that I had issue in my parameter weights (both from blank initialization or from pre-trained) but also in the very commonly used Adam optimizer (with default eps=1e-8, it was updating all my parameters to NaN in the very first step, even when learning rate was set to 0).

### Pitch

I would suggest to add a caveat in the documentation to warn the user that one has to deal with every possible small values in its model, optimizer and input data. In particular, changing default values of optimizer seems required (in my case at least).
 
Also some default warnings when autocasts to lower-precision lead to over/under -flow would be nice. 

Thanks for the nice lib by the way

(edit : added thanks + details about precisions leading to NaNs, `bf16-mixed` and `bf16-true` were computing fine in my case)


### Alternatives

_No response_

### Additional context

_No response_

cc @lantiga @borda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

16 bit precision in Trainer leading to NaN #20619

Description & Motivation

Pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

16 bit precision in Trainer leading to NaN #20619

Description

Description & Motivation

Pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions