Possible performance boost with 2-pass online softmax #7306

zixuanweeei · 2024-05-15T15:01:35Z

Per the discussion in https://arxiv.org/abs/1805.02867, I am wondering if there is still a potential performance boost with the 2-pass online softmax. Flash attention, which is already enabled in this project, has already fused up the softmax using the online normalizer. If the single op is still used, I hope there will be some profit.
It is eventually determined by the model architecture and the project implementation. I hope someone could help on the analysis.

Engininja2 · 2024-05-18T03:08:52Z

Another paper that presents a similar 2-pass softmax algorithm is https://arxiv.org/abs/2001.04438 though their focus is on CPUs.

I tried implementing it for CUDA/HIP to see what the performance would look like. On a RX 5700XT test-backend-ops for softmax was between a ~0.93-1.02 speedup compared to master depending on the case. On a GTX 1050 performance was much worse, generally around 0.80 speedup with some outliers in both directions.

For very large tensors that need global memory instead of shared, which aren't compiled in test-backend-ops by default, performance was around 20% to 40% faster than master for both GPUs.

You can look at this branch if you're interested: https://github.com/Engininja2/llama.cpp/tree/2pass-softmax

zixuanweeei · 2024-05-21T02:17:40Z

Hi @Engininja2 . Thanks for the comments. I tried an initial implementation based on the online normalizer in https://arxiv.org/abs/1805.02867 which has a better performance compared to https://arxiv.org/abs/2001.04438 on almost all defaults in test-backend-ops. You can give a try if you're interested: https://github.com/zixuanweeei/llama.cpp/tree/zx/two-pass-softmax .

zixuanweeei added the bug-unconfirmed label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible performance boost with 2-pass online softmax #7306

Possible performance boost with 2-pass online softmax #7306

zixuanweeei commented May 15, 2024

Engininja2 commented May 18, 2024

zixuanweeei commented May 21, 2024

Possible performance boost with 2-pass online softmax #7306

Possible performance boost with 2-pass online softmax #7306

Comments

zixuanweeei commented May 15, 2024

Engininja2 commented May 18, 2024

zixuanweeei commented May 21, 2024