Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sched : support async weight copy #7315

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented May 15, 2024

Adds support for copying the weights asynchronously with partial offload, so that the next weight can be uploaded while the current one is being used.

-ngl 0 -fa 1:

GPU Model Test t/s master t/s sl/async-weight-copy Speedup
RTX 3090 Ti llama 7B Q4_0 pp512 750.72 956.93 1.27
RTX 3090 Ti llama 7B Q4_0 pp1024 1241.37 1656.96 1.33
RTX 3090 Ti llama 7B Q4_0 pp2048 1543.42 2492.75 1.62
RTX 3090 Ti llama 7B Q4_0 pp4096 1830.87 2478.10 1.35
RTX 3090 Ti llama 7B Q4_0 pp8192 1868.97 2205.80 1.18

While it improves performance significantly, it is still far below what should be possible because the KV cache is still copied synchronously, which results in a stall in every layer which pretty much destroys the performance. Fixing that is going to be more complicated.

@mofosyne mofosyne added performance Speed related topics review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 16, 2024
@JohannesGaessler
Copy link
Collaborator

The code changes make sense to me but my understanding of the ggml backend code is not very good.

Comment on lines +1571 to +1573
sched->copy_streams[cur_backend_id][split->w_copy_stream_id].max_size = MAX(
sched->copy_streams[cur_backend_id][split->w_copy_stream_id].max_size,
ggml_backend_buft_get_alloc_size(sched->bufts[cur_backend_id], src));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor : the ggml_backend_buft_get_alloc_size() will be evaluated twice due to the MAX macro

@Dampfinchen
Copy link

Dampfinchen commented May 17, 2024

It appears this increases VRAM usage? In that case, I believe it's important to leave this PR as an option should it get merged. The speed decrease in text generation might be higher than the increase in prompt processing. For example, here's a benchmark by someone in the KoboldAI discord server with a 2060 using a custom build of Koboldcpp which includes this PR:

koboldcpp_cuda_NpLzApJcP2

koboldcpp_cuda_12

As we can see here, that hardware was only able to offload 25 layers instead of 30 layers on a 10.7B model. Thus, while the speedup from prompt processing is impressive, the text generating speed is noticeably slower as the higher VRAM usage means less layers can be offloaded. So we have a situation where the new PR is slower overall (2,42 token/s vs 2,53 token/s).

As for my own tests, I was not able to offload 5 layers on Mixtral anymore with this PR, so I have little reason to doubt these findings. But if you so wish, I may conduct my own tests in due time with a more apples to apples comparison between master and this PR.

Otherwise, great work on this PR, as always. I think it's a great option for people who prefer prompt processing speed to text generation. But as it has its drawback, I suggest handling it as a commandline option.

I'm interested to hear your thoughts!

@slaren
Copy link
Collaborator Author

slaren commented May 17, 2024

It will always use more memory since it requires reserving enough VRAM for multiple weights at the same time, instead of only one. The number can be configured with GGML_SCHED_MAX_COPY_STREAMS. Currently it is set to 8 because at this point I am not concerned with memory usage, only performance, but with some optimizations it should be possible to reduce it to 2 with the same performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants