Loss Scaling Free ((free)) | RELIABLE · Overview |

In the world of Deep Learning, Mixed Precision (MP) training has become the standard for speeding up models and reducing memory usage. By using 16-bit floating point numbers (FP16) instead of 32-bit (FP32), we get massive performance gains.

Here are some best practices for using loss scaling: loss scaling free

While the large gradient values are clipped (handled by gradient clipping), the small gradient values pose a different threat. During backpropagation, many gradients are tiny. In FP16, if a gradient value falls below that $\approx 6 \times 10^-5$ threshold, it doesn't just get rounded—it becomes . In the world of Deep Learning, Mixed Precision

return scaled_loss

BF16 has the , so gradients rarely underflow — even without loss scaling. The tradeoff: less precision (7 vs 10 mantissa bits), but for most deep learning tasks, BF16’s precision is sufficient. During backpropagation, many gradients are tiny

Loss scaling can be applied to various deep learning tasks, including: