You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Per the discussion in https://arxiv.org/abs/1805.02867, I am wondering if there is still a potential performance boost with the 2-pass online softmax. Flash attention, which is already enabled in this project, has already fused up the softmax using the online normalizer. If the single op is still used, I hope there will be some profit.
It is eventually determined by the model architecture and the project implementation. I hope someone could help on the analysis.
The text was updated successfully, but these errors were encountered:
Another paper that presents a similar 2-pass softmax algorithm is https://arxiv.org/abs/2001.04438 though their focus is on CPUs.
I tried implementing it for CUDA/HIP to see what the performance would look like. On a RX 5700XT test-backend-ops for softmax was between a ~0.93-1.02 speedup compared to master depending on the case. On a GTX 1050 performance was much worse, generally around 0.80 speedup with some outliers in both directions.
For very large tensors that need global memory instead of shared, which aren't compiled in test-backend-ops by default, performance was around 20% to 40% faster than master for both GPUs.
Per the discussion in https://arxiv.org/abs/1805.02867, I am wondering if there is still a potential performance boost with the 2-pass online softmax. Flash attention, which is already enabled in this project, has already fused up the softmax using the online normalizer. If the single op is still used, I hope there will be some profit.
It is eventually determined by the model architecture and the project implementation. I hope someone could help on the analysis.
The text was updated successfully, but these errors were encountered: