sched : support async weight copy #7315

slaren · 2024-05-15T22:54:23Z

Adds support for copying the weights asynchronously with partial offload, so that the next weight can be uploaded while the current one is being used.

-ngl 0 -fa 1:

GPU	Model	Test	t/s master	t/s sl/async-weight-copy	Speedup
RTX 3090 Ti	llama 7B Q4_0	pp512	750.72	956.93	1.27
RTX 3090 Ti	llama 7B Q4_0	pp1024	1241.37	1656.96	1.33
RTX 3090 Ti	llama 7B Q4_0	pp2048	1543.42	2492.75	1.62
RTX 3090 Ti	llama 7B Q4_0	pp4096	1830.87	2478.10	1.35
RTX 3090 Ti	llama 7B Q4_0	pp8192	1868.97	2205.80	1.18

While it improves performance significantly, it is still far below what should be possible because the KV cache is still copied synchronously, which results in a stall in every layer which pretty much destroys the performance. Fixing that is going to be more complicated.

JohannesGaessler · 2024-05-16T14:29:43Z

The code changes make sense to me but my understanding of the ggml backend code is not very good.

ggerganov · 2024-05-17T12:31:09Z

ggml-backend.c

+                            sched->copy_streams[cur_backend_id][split->w_copy_stream_id].max_size = MAX(
+                                    sched->copy_streams[cur_backend_id][split->w_copy_stream_id].max_size,
+                                    ggml_backend_buft_get_alloc_size(sched->bufts[cur_backend_id], src));


minor : the ggml_backend_buft_get_alloc_size() will be evaluated twice due to the MAX macro

Dampfinchen · 2024-05-17T19:04:07Z

It appears this increases VRAM usage? In that case, I believe it's important to leave this PR as an option should it get merged. The speed decrease in text generation might be higher than the increase in prompt processing. For example, here's a benchmark by someone in the KoboldAI discord server with a 2060 using a custom build of Koboldcpp which includes this PR:

As we can see here, that hardware was only able to offload 25 layers instead of 30 layers on a 10.7B model. Thus, while the speedup from prompt processing is impressive, the text generating speed is noticeably slower as the higher VRAM usage means less layers can be offloaded. So we have a situation where the new PR is slower overall (2,42 token/s vs 2,53 token/s).

As for my own tests, I was not able to offload 5 layers on Mixtral anymore with this PR, so I have little reason to doubt these findings. But if you so wish, I may conduct my own tests in due time with a more apples to apples comparison between master and this PR.

Otherwise, great work on this PR, as always. I think it's a great option for people who prefer prompt processing speed to text generation. But as it has its drawback, I suggest handling it as a commandline option.

I'm interested to hear your thoughts!

slaren · 2024-05-17T19:36:06Z

It will always use more memory since it requires reserving enough VRAM for multiple weights at the same time, instead of only one. The number can be configured with GGML_SCHED_MAX_COPY_STREAMS. Currently it is set to 8 because at this point I am not concerned with memory usage, only performance, but with some optimizations it should be possible to reduce it to 2 with the same performance.

sched : support async weight copy

5de9b74

mofosyne added performance Speed related topics review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 16, 2024

ggerganov reviewed May 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sched : support async weight copy #7315

sched : support async weight copy #7315

slaren commented May 15, 2024 •

edited

JohannesGaessler commented May 16, 2024

ggerganov May 17, 2024

Dampfinchen commented May 17, 2024 •

edited

slaren commented May 17, 2024

sched : support async weight copy #7315

Are you sure you want to change the base?

sched : support async weight copy #7315

Conversation

slaren commented May 15, 2024 • edited

JohannesGaessler commented May 16, 2024

ggerganov May 17, 2024

Choose a reason for hiding this comment

Dampfinchen commented May 17, 2024 • edited

slaren commented May 17, 2024

slaren commented May 15, 2024 •

edited

Dampfinchen commented May 17, 2024 •

edited