Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible performance boost with 2-pass online softmax #7306

Open
zixuanweeei opened this issue May 15, 2024 · 2 comments
Open

Possible performance boost with 2-pass online softmax #7306

zixuanweeei opened this issue May 15, 2024 · 2 comments

Comments

@zixuanweeei
Copy link

Per the discussion in https://arxiv.org/abs/1805.02867, I am wondering if there is still a potential performance boost with the 2-pass online softmax. Flash attention, which is already enabled in this project, has already fused up the softmax using the online normalizer. If the single op is still used, I hope there will be some profit.
It is eventually determined by the model architecture and the project implementation. I hope someone could help on the analysis.

@Engininja2
Copy link
Contributor

Another paper that presents a similar 2-pass softmax algorithm is https://arxiv.org/abs/2001.04438 though their focus is on CPUs.

I tried implementing it for CUDA/HIP to see what the performance would look like. On a RX 5700XT test-backend-ops for softmax was between a ~0.93-1.02 speedup compared to master depending on the case. On a GTX 1050 performance was much worse, generally around 0.80 speedup with some outliers in both directions.

For very large tensors that need global memory instead of shared, which aren't compiled in test-backend-ops by default, performance was around 20% to 40% faster than master for both GPUs.

You can look at this branch if you're interested: https://github.com/Engininja2/llama.cpp/tree/2pass-softmax

@zixuanweeei
Copy link
Author

Hi @Engininja2 . Thanks for the comments. I tried an initial implementation based on the online normalizer in https://arxiv.org/abs/1805.02867 which has a better performance compared to https://arxiv.org/abs/2001.04438 on almost all defaults in test-backend-ops. You can give a try if you're interested: https://github.com/zixuanweeei/llama.cpp/tree/zx/two-pass-softmax .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants