Skip to content

16 bit precision in Trainer leading to NaN #20619

@runyournode

Description

@runyournode

Description & Motivation

In Trainer, when using lightning_precision 16-true and 16-mixed, small float values (~ > 1e-7 - 1e-8 ?) can be round up to zero (in input batches, model parameters and its gradients, optimizer hyper-parameters) leading to NaN.

This does not raise any warning or error during training as the training loop ignore NaN.
After training, in inference mode, all output can be NaN, (which makes it a poor model).

After debugging for several days, I found that I had issue in my parameter weights (both from blank initialization or from pre-trained) but also in the very commonly used Adam optimizer (with default eps=1e-8, it was updating all my parameters to NaN in the very first step, even when learning rate was set to 0).

Pitch

I would suggest to add a caveat in the documentation to warn the user that one has to deal with every possible small values in its model, optimizer and input data. In particular, changing default values of optimizer seems required (in my case at least).

Also some default warnings when autocasts to lower-precision lead to over/under -flow would be nice.

Thanks for the nice lib by the way

(edit : added thanks + details about precisions leading to NaNs, bf16-mixed and bf16-true were computing fine in my case)

Alternatives

No response

Additional context

No response

cc @lantiga @Borda

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsDocumentation relatedfeatureIs an improvement or enhancementprecision: halfHalf-precision

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions