Enhanced GPU discovery and multi-gpu support with concurrency #4517

dhiltgen · 2024-05-18T23:07:08Z

Carries (and obsoletes if we move this one forward first) #4266 and #4441

This refines our GPU discovery to split it into bootstrapping where we discover information about the GPUs once at startup, and then incrementally refresh just free space information, instead of fully rediscovering the GPUs over and over.

Fixes #3158
Fixes #4198
Fixes #3765

The amdgpu drivers free VRAM reporting omits some other apps, so leverage the upstream DRM driver which keeps better tabs on things

Now that we call the GPU discovery routines many times to update memory, this splits initial discovery from free memory updating.

This worked remotely but wound up trying to spawn multiple servers locally which doesn't work

Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block

Our default behavior today is to try to fit into a single GPU if possible. Some users would prefer the old behavior of always spreading across multiple GPUs even if the model can fit into one. This exposes that tunable behavior.

adjust timing on some tests so they don't timeout on small/slow GPUs

jmorganca · 2024-06-02T04:14:49Z

gpu/gpu.go

+
+	switch runtime.GOOS {
+	case "windows":
+		oneapiMgmtName = "ze_intel_gpu64.dll"


This DLL gets installed on Windows with Intel iGPUs as part of the OS base install and doesn't always open reliably – it seems to be causing some crashes on both Win10 and Win11 and so we may want to put this behind a flag until we resolve those issues

jmorganca · 2024-06-02T04:20:34Z

gpu/types.go

+
+type RocmGPUInfo struct {
+	GpuInfo
+	usedFilepath string // nolint: unused


Suggested change

usedFilepath string // nolint: unused

usedFilepath string

I believe this is used now below

jmorganca · 2024-06-02T04:21:49Z

gpu/types.go

+type RocmGPUInfo struct {
+	GpuInfo
+	usedFilepath string // nolint: unused
+	index        int    // nolint: unused


Suggested change

index int // nolint: unused

index int

jmorganca · 2024-06-02T04:24:45Z

llm/server.go

@@ -232,6 +228,10 @@ func NewLlamaServer(gpus gpu.GpuInfoList, model string, ggml *GGML, adapters, pr

 	params = append(params, "--parallel", fmt.Sprintf("%d", numParallel))

+	if estimate.TensorSplit != "" {
+		params = append(params, "--tensor-split", estimate.TensorSplit)


This is super cool! Can't wait to try it more on 2x, 4x and 8x gpu systems

jmorganca · 2024-06-02T04:28:37Z

llm/ext_server/server.cpp

-#ifndef GGML_USE_CUBLAS
-            fprintf(stderr, "warning: llama.cpp was compiled without cuBLAS. Setting the split mode has no effect.\n");
-#endif // GGML_USE_CUBLAS
+#ifndef GGML_USE_CUDA


good catch on these...

jmorganca

Overall looks great! Small comment RE some oneapi dll open panics we are seeing on Windows boxes with iGPUs - we'd want to avoid making that part of the critical path until we resolve this

dhiltgen mentioned this pull request May 18, 2024

Enable concurrency by default #4218

Draft

dhiltgen force-pushed the gpu_incremental branch 4 times, most recently from 05ba1ca to 91be1fa Compare May 20, 2024 20:50

dhiltgen marked this pull request as ready for review May 20, 2024 23:44

dhiltgen force-pushed the gpu_incremental branch 4 times, most recently from ecde7d9 to d788717 Compare May 28, 2024 21:29

dhiltgen force-pushed the gpu_incremental branch 2 times, most recently from f02b076 to 076450a Compare May 30, 2024 20:13

dhiltgen marked this pull request as draft May 30, 2024 20:45

dhiltgen force-pushed the gpu_incremental branch from 076450a to 6b78c76 Compare May 30, 2024 21:37

dhiltgen marked this pull request as ready for review May 30, 2024 22:01

dhiltgen mentioned this pull request May 30, 2024

feat: enable OLLAMA Arc GPU support with SYCL backend #3796

Open

dhiltgen force-pushed the gpu_incremental branch 3 times, most recently from 12b47a0 to bfbb50e Compare May 31, 2024 21:30

dhiltgen added 9 commits June 1, 2024 12:21

Fix server.cpp for the new cuda build macros

bb1f84b

Use DRM driver for VRAM info for amd

fc204b7

The amdgpu drivers free VRAM reporting omits some other apps, so leverage the upstream DRM driver which keeps better tabs on things

Refine GPU discovery to bootstrap once

89a088b

Now that we call the GPU discovery routines many times to update memory, this splits initial discovery from free memory updating.

Fix concurrency integration test to work locally

4cb0734

This worked remotely but wound up trying to spawn multiple servers locally which doesn't work

Support forced spreading for multi GPU

b54b691

Our default behavior today is to try to fit into a single GPU if possible. Some users would prefer the old behavior of always spreading across multiple GPUs even if the model can fit into one. This exposes that tunable behavior.

refined test timing

c21e02d

adjust timing on some tests so they don't timeout on small/slow GPUs

Harden unload for empty runners

be7b40b

Refactor intel gpu discovery

137b4d9

dhiltgen force-pushed the gpu_incremental branch from bfbb50e to 137b4d9 Compare June 1, 2024 19:32

dhiltgen mentioned this pull request Jun 1, 2024

Support GPU runners on CPUs without AVX #2187

Open

This was referenced Jun 1, 2024

Support GPU runners with AVX2 #2281

Open

ROCM setup with two 7900 XTX outputs generate irrelevant content. #3158

Open

dual GPU 8G/16G - CUDA error: out of memory with dolphin-mixtral #3460

Open

jmorganca reviewed Jun 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced GPU discovery and multi-gpu support with concurrency #4517

Enhanced GPU discovery and multi-gpu support with concurrency #4517

dhiltgen commented May 18, 2024 •

edited

jmorganca Jun 2, 2024 •

edited

jmorganca Jun 2, 2024

jmorganca Jun 2, 2024

jmorganca Jun 2, 2024 •

edited

jmorganca Jun 2, 2024

jmorganca left a comment

Enhanced GPU discovery and multi-gpu support with concurrency #4517

Are you sure you want to change the base?

Enhanced GPU discovery and multi-gpu support with concurrency #4517

Conversation

dhiltgen commented May 18, 2024 • edited

jmorganca Jun 2, 2024 • edited

Choose a reason for hiding this comment

jmorganca Jun 2, 2024

Choose a reason for hiding this comment

jmorganca Jun 2, 2024

Choose a reason for hiding this comment

jmorganca Jun 2, 2024 • edited

Choose a reason for hiding this comment

jmorganca Jun 2, 2024

Choose a reason for hiding this comment

jmorganca left a comment

Choose a reason for hiding this comment

dhiltgen commented May 18, 2024 •

edited

jmorganca Jun 2, 2024 •

edited

jmorganca Jun 2, 2024 •

edited