Keep Attention Softmax FP32 during FP16/ZeRO Training #1474
-
| Hi all, Recent discoveries from GLM-130 and researchers at Tsinghua have shown that keeping the attention softmax fp32 during training with fp16 and ZeRO leads to much greater stability at scale. 
 Since ColossalAI handles the floating point precision during training, is there a specific recommended way to ensure that the softmax remains fp32 without being overridden automatically by the engine with fp16/ZeRO initialized? That way you can use fp16 and ZeRO enabled in the configuration while maintaining numerical stability. Thank you, Enrico | 
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
| Thanks for your advice! We will do more experiments about this tech. | 
Beta Was this translation helpful? Give feedback.
-
| Could you create an issue? | 
Beta Was this translation helpful? Give feedback.
-
| I just re-discovered that FP16 can be quite unstable when computing attention using Q and K. I had been getting NaN values for a day, and I finally realized that switching to FP32 in the attention computation resolved the issue. | 
Beta Was this translation helpful? Give feedback.
Could you create an issue?