Another threadpool: Avoid creating hundreds of threads in GGML #7342

besnardjb · 2024-05-17T14:01:55Z

TL/DR:

When profiling the code I noticed hundreds of threads were created, making profiling not practical and maybe leading to some performance overhead. I then attempted to make a simple threadpool inside GGML to see if it made any difference at all. Gains at small scale are not notable, and inference does not change much. However, the more threads the more overhead and more importantly, the code is easier to profile (not 100s of threads to look at).

This is my first attempt at hacking the code, there might be cleaner attempts or better ways, I have seen PR #710, #851 which are similar. I'm also not sure of how to properly benchmark this, did launch ./bin/benchmark.

Perf on x86-64 (Genoa)

One socket:
- ./bin/benchmark -t 96 -i 100
  - without: (last line) average flops 1497.53
  - with : (last line) average flops 2970.68
Two sockets:
- ./bin/benchmark -t 192 -i 100
  - without: (last line) average flops 1044.79
  - with : (last line) average flops 3488.96

Perf on x86-64 (Ivy Bridge)

One socket:
- ./bin/benchmark -t 18 -i 100
  - without: (last line) average flops 367.60
  - with : (last line) average flops 378.21
Two sockets:
- ./bin/benchmark -t 36 -i 100
  - without: (last line) average flops 633.22
  - with : (last line) average flops 706.8

github-actions · 2024-05-17T14:30:44Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 553 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8475.39ms p(95)=19968.52ms fails=, finish reason: stop=499 truncated=54
Prompt processing (pp): avg=104.29tk/s p(95)=476.74tk/s
Token generation (tg): avg=47.01tk/s p(95)=46.82tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=thread_pool commit=c711b028b03e74e3e63118d847c2646d95e66c34

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716173714 --> 1716174348
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 734.99, 734.99, 734.99, 734.99, 734.99, 882.85, 882.85, 882.85, 882.85, 882.85, 874.63, 874.63, 874.63, 874.63, 874.63, 882.47, 882.47, 882.47, 882.47, 882.47, 933.68, 933.68, 933.68, 933.68, 933.68, 949.65, 949.65, 949.65, 949.65, 949.65, 938.06, 938.06, 938.06, 938.06, 938.06, 941.98, 941.98, 941.98, 941.98, 941.98, 932.55, 932.55, 932.55, 932.55, 932.55, 935.48, 935.48, 935.48, 935.48, 935.48, 931.34, 931.34, 931.34, 931.34, 931.34, 919.5, 919.5, 919.5, 919.5, 919.5, 882.35, 882.35, 882.35, 882.35, 882.35, 883.61, 883.61, 883.61, 883.61, 883.61, 843.19, 843.19, 843.19, 843.19, 843.19, 834.29, 834.29, 834.29, 834.29, 834.29, 836.14, 836.14, 836.14, 836.14, 836.14, 844.47, 844.47, 844.47, 844.47, 844.47, 856.42, 856.42, 856.42, 856.42, 856.42, 857.4, 857.4, 857.4, 857.4, 857.4, 855.52, 855.52, 855.52, 855.52, 855.52, 860.4, 860.4, 860.4, 860.4, 860.4, 865.06, 865.06, 865.06, 865.06, 865.06, 861.9, 861.9, 861.9, 861.9, 861.9, 839.25, 839.25, 839.25, 839.25, 839.25, 840.77, 840.77, 840.77, 840.77, 840.77, 841.94, 841.94, 841.94, 841.94, 841.94, 802.0, 802.0, 802.0, 802.0, 802.0, 799.39, 799.39, 799.39, 799.39, 799.39, 800.92, 800.92, 800.92, 800.92, 800.92, 804.04, 804.04, 804.04, 804.04, 804.04, 804.67, 804.67, 804.67, 804.67, 804.67, 805.08, 805.08, 805.08, 805.08, 805.08, 808.49, 808.49, 808.49, 808.49, 808.49, 818.69, 818.69, 818.69, 818.69, 818.69, 808.33, 808.33, 808.33, 808.33, 808.33, 808.2, 808.2, 808.2, 808.2, 808.2, 806.1, 806.1, 806.1, 806.1, 806.1, 808.93, 808.93, 808.93, 808.93, 808.93, 812.68, 812.68, 812.68, 812.68, 812.68, 821.98, 821.98, 821.98, 821.98, 821.98, 803.74, 803.74, 803.74, 803.74, 803.74, 805.53, 805.53, 805.53, 805.53, 805.53, 805.08, 805.08, 805.08, 805.08, 805.08, 804.11, 804.11, 804.11, 804.11, 804.11, 808.24, 808.24, 808.24, 808.24, 808.24, 810.98, 810.98, 810.98, 810.98, 810.98, 811.4, 811.4, 811.4, 811.4, 811.4, 815.55, 815.55, 815.55, 815.55, 815.55, 815.9, 815.9, 815.9, 815.9, 815.9, 817.41, 817.41, 817.41, 817.41, 817.41, 814.0, 814.0, 814.0, 814.0, 814.0, 822.19, 822.19, 822.19, 822.19, 822.19, 821.26, 821.26, 821.26, 821.26, 821.26, 821.76, 821.76, 821.76, 821.76, 821.76, 822.02, 822.02, 822.02, 822.02, 822.02, 823.24, 823.24, 823.24, 823.24, 823.24, 825.23, 825.23, 825.23, 825.23, 825.23, 828.51, 828.51, 828.51, 828.51, 828.51, 828.92, 828.92, 828.92, 828.92, 828.92, 828.78, 828.78, 828.78]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716173714 --> 1716174348
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.02, 44.02, 44.02, 44.02, 44.02, 42.09, 42.09, 42.09, 42.09, 42.09, 25.6, 25.6, 25.6, 25.6, 25.6, 25.38, 25.38, 25.38, 25.38, 25.38, 27.16, 27.16, 27.16, 27.16, 27.16, 28.25, 28.25, 28.25, 28.25, 28.25, 29.75, 29.75, 29.75, 29.75, 29.75, 30.9, 30.9, 30.9, 30.9, 30.9, 31.49, 31.49, 31.49, 31.49, 31.49, 31.78, 31.78, 31.78, 31.78, 31.78, 31.76, 31.76, 31.76, 31.76, 31.76, 31.77, 31.77, 31.77, 31.77, 31.77, 31.28, 31.28, 31.28, 31.28, 31.28, 31.03, 31.03, 31.03, 31.03, 31.03, 30.61, 30.61, 30.61, 30.61, 30.61, 29.72, 29.72, 29.72, 29.72, 29.72, 29.81, 29.81, 29.81, 29.81, 29.81, 30.15, 30.15, 30.15, 30.15, 30.15, 29.74, 29.74, 29.74, 29.74, 29.74, 29.59, 29.59, 29.59, 29.59, 29.59, 29.58, 29.58, 29.58, 29.58, 29.58, 29.59, 29.59, 29.59, 29.59, 29.59, 29.9, 29.9, 29.9, 29.9, 29.9, 29.95, 29.95, 29.95, 29.95, 29.95, 29.74, 29.74, 29.74, 29.74, 29.74, 29.86, 29.86, 29.86, 29.86, 29.86, 29.96, 29.96, 29.96, 29.96, 29.96, 30.01, 30.01, 30.01, 30.01, 30.01, 29.98, 29.98, 29.98, 29.98, 29.98, 30.13, 30.13, 30.13, 30.13, 30.13, 30.43, 30.43, 30.43, 30.43, 30.43, 30.46, 30.46, 30.46, 30.46, 30.46, 30.66, 30.66, 30.66, 30.66, 30.66, 30.8, 30.8, 30.8, 30.8, 30.8, 30.77, 30.77, 30.77, 30.77, 30.77, 30.75, 30.75, 30.75, 30.75, 30.75, 30.59, 30.59, 30.59, 30.59, 30.59, 30.57, 30.57, 30.57, 30.57, 30.57, 30.73, 30.73, 30.73, 30.73, 30.73, 30.9, 30.9, 30.9, 30.9, 30.9, 30.93, 30.93, 30.93, 30.93, 30.93, 30.84, 30.84, 30.84, 30.84, 30.84, 30.77, 30.77, 30.77, 30.77, 30.77, 30.36, 30.36, 30.36, 30.36, 30.36, 29.22, 29.22, 29.22, 29.22, 29.22, 29.15, 29.15, 29.15, 29.15, 29.15, 29.15, 29.15, 29.15, 29.15, 29.15, 29.18, 29.18, 29.18, 29.18, 29.18, 29.2, 29.2, 29.2, 29.2, 29.2, 29.28, 29.28, 29.28, 29.28, 29.28, 29.27, 29.27, 29.27, 29.27, 29.27, 29.23, 29.23, 29.23, 29.23, 29.23, 29.23, 29.23, 29.23, 29.23, 29.23, 29.15, 29.15, 29.15, 29.15, 29.15, 29.23, 29.23, 29.23, 29.23, 29.23, 29.39, 29.39, 29.39, 29.39, 29.39, 29.5, 29.5, 29.5, 29.5, 29.5, 29.59, 29.59, 29.59, 29.59, 29.59, 29.67, 29.67, 29.67, 29.67, 29.67, 29.69, 29.69, 29.69, 29.69, 29.69, 29.71, 29.71, 29.71]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716173714 --> 1716174348
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.26, 0.26, 0.26, 0.26, 0.26, 0.43, 0.43, 0.43, 0.43, 0.43, 0.26, 0.26, 0.26, 0.26, 0.26, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.1, 0.1, 0.1, 0.1, 0.1, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.36, 0.36, 0.36, 0.36, 0.36, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.3, 0.3, 0.3, 0.3, 0.3, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.29, 0.29, 0.29, 0.29, 0.29, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26, 0.26, 0.26, 0.26, 0.09, 0.09, 0.09, 0.09, 0.09, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.33, 0.33, 0.33, 0.33, 0.33, 0.31, 0.31, 0.31, 0.31, 0.31, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.26, 0.26, 0.26, 0.26, 0.26, 0.53, 0.53, 0.53, 0.53, 0.53, 0.62, 0.62, 0.62, 0.62, 0.62, 0.57, 0.57, 0.57, 0.57, 0.57, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.25, 0.25, 0.25, 0.25, 0.25, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716173714 --> 1716174348
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0]

slaren · 2024-05-17T14:35:07Z

You can use llama-bench to test the overall performance. I ran a test with and without KV offload since this is one case where the threads are started multiple times in each evaluation.

LLAMA_CUDA=1 scripts/compare-commits.sh master thread_pool -nkvo 0,1

GPU	Model	NKVO	Test	t/s master	t/s thread_pool	Speedup
RTX 3090 Ti	llama 7B Q4_0	0	pp512	4313.72	4525.64	1.05
RTX 3090 Ti	llama 7B Q4_0	0	tg128	154.96	154.17	0.99
RTX 3090 Ti	llama 7B Q4_0	0	pp512+tg128	622.75	625.94	1.01
RTX 3090 Ti	llama 7B Q4_0	1	pp512	619.01	527.14	0.85
RTX 3090 Ti	llama 7B Q4_0	1	tg128	21.55	49.80	2.31
RTX 3090 Ti	llama 7B Q4_0	1	pp512+tg128	84.56	135.97	1.61

besnardjb · 2024-05-17T22:02:08Z

Thanks for the pointer to llama-bench! sorry for missing it. I did run a few CPU tests on a bi-socket machine. I was not able to use the convenient compare-commits.sh script yet, as there seems to be some warning output piped in SQL.

Overall, there seem to be punctual gains, but it's close to measurement noise. I've also cleaned up the code a bit and tried to fix the Windows build ( ⚠️ without access to a Windows machine, unfortunately). For Windows, as apparently there is no direct equivalent of a semaphore, I have used busy waiting for work in threads to skip dependencies (measured its impact below on Linux too, just to see what it does).

Ran: ./bin/llama-bench -m ../models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -nkvo 0,1

ThreadPool (Semaphore)

| model                          |       size |     params | backend    |    threads |          nkvo |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ------------: | ------------: | ---------------: |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |         pp512 |    248.15 ± 1.38 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |         tg128 |     22.39 ± 0.20 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |   pp512+tg128 |     81.91 ± 1.01 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |         pp512 |    211.54 ± 8.78 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |         tg128 |     19.73 ± 0.17 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |   pp512+tg128 |     70.52 ± 0.57 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |         pp512 |   241.72 ± 10.81 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |         tg128 |     22.25 ± 0.45 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |   pp512+tg128 |     82.73 ± 0.39 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |         pp512 |    227.08 ± 2.27 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |         tg128 |     19.42 ± 0.14 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |   pp512+tg128 |     70.71 ± 0.77 |

Threadpool (ACTIVE Poll -- as for windows measured on Linux)

| model                          |       size |     params | backend    |    threads |          nkvo |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ------------: | ------------: | ---------------: |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |         pp512 |    246.28 ± 0.34 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |         tg128 |     24.11 ± 0.10 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |   pp512+tg128 |     84.40 ± 0.54 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |         pp512 |    224.57 ± 0.51 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |         tg128 |     19.38 ± 0.23 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |   pp512+tg128 |     71.38 ± 0.79 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |         pp512 |    247.67 ± 1.07 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |         tg128 |     24.73 ± 0.48 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |   pp512+tg128 |     83.10 ± 0.75 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |         pp512 |    236.53 ± 1.82 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |         tg128 |     19.02 ± 0.08 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |   pp512+tg128 |     69.73 ± 0.14 |

Master (`27b0406`)

| model                          |       size |     params | backend    |    threads |          nkvo |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ------------: | ------------: | ---------------: |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |         pp512 |   246.89 ± 11.27 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |         tg128 |     19.84 ± 0.01 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |   pp512+tg128 |     75.51 ± 0.43 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |         pp512 |    238.93 ± 6.04 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |         tg128 |     18.17 ± 0.15 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |   pp512+tg128 |     69.23 ± 0.48 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |         pp512 |   253.94 ± 14.74 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |         tg128 |     19.85 ± 0.02 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |   pp512+tg128 |     75.34 ± 1.20 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |         pp512 |   236.17 ± 16.98 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |         tg128 |     18.25 ± 0.08 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |   pp512+tg128 |     69.36 ± 0.58 |

slaren · 2024-05-17T23:44:01Z

To use compare-commits.sh you need to have sqlite3 installed. The nkvo parameter only has an effect when offloading a model to a GPU, basically what happens with nkvo is that the attention part of each layer is run on the CPU and the rest on the GPU, and this causes a call to ggml_graph_compute on every layer, so the threads have to be launched once per layer.

besnardjb · 2024-05-18T06:30:20Z

Got it, I basically ran the same test twice 😆 ... For compare-commits.sh I did install sqlite and gitpython but I had two issues (pure CPU run again):

In ggml.c I had a warning (namely: GGML_PRINT("WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance\n");) it was sent to stdout and ended up mixed with the SQL; This may go to stderr and the script may only capture stdout.
Then tried commenting the line out to see, but ended up with out of range error in scripts/compare-llama-bench.py on lines 302-303 and did not take the time to investigate more.

        gpu_blas = bool(rows_full[0][KEY_PROPERTIES.index("gpu_blas")])
        ngl = int(rows_full[0][KEY_PROPERTIES.index("n_gpu_layers")])

Thank you will further look into it.

forworldm · 2024-05-18T08:59:01Z

it looks like you are always waiting on a variable to change. have your tried replacing sem_t and sched_yield with futex? see syscall(SYS_futex...) on linux and WaitOnAddress on windows

besnardjb · 2024-05-18T10:44:08Z

It is a very good idea. Will look into it after finding myself a windows VM. Also I'm currently not freeing threads will need to refcount init and release.

- Add OpenMP to dependency detection - Add GGML_NO_OMP if OpenMP is disabled or not found - Keep previous approach if there is no OMP

besnardjb · 2024-05-20T01:14:52Z

After looking into the Futex approach I realized it was about to add a lot of ifdefs and several lines of (relatively complex) codes. I then thought that I was in fact trying to implement an OpenMP parallel region. And thus, resorted to trying OpenMP directly. This basically translated in a single #pragma omp and avoids manual handling. In addition, what is nice is that it could be used on other for loops too elsewhere in the code without repeating the whole threadpool thingy (the runtime does it for free). I think it ends up simpler, but all comments are welcome of course.

I hope it works well on windows, lets see.

Also I realized performance counters in ggml_graph_compute were not accounting for thread creation cost, I thus moved them a few line up on master to do the same measure in the two case.

72 Cores (moved timings for thread creation)

CPU	Model	Test	t/s master	t/s openmp	Speedup
Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz	llama 7B Q4_0	pp512	47.17	58.55	1.24
Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz	llama 7B Q4_0	tg128	9.08	13.96	1.54
Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz	llama 7B Q4_0	pp512+tg128	27.37	35.15	1.28

192 cores (moved timings for thread creation)

CPU	Model	Test	t/s master	t/s openmp	Speedup
AMD EPYC 9654 96-Core Processor	llama 7B Q4_0	pp512	226.86	241.64	1.07
AMD EPYC 9654 96-Core Processor	llama 7B Q4_0	tg128	18.24	21.42	1.17
AMD EPYC 9654 96-Core Processor	llama 7B Q4_0	pp512+tg128	69.03	76.94	1.11

Old Laptop 4 cores (moved timings for thread creation)

CPU	Model	Test	t/s master	t/s openmp	Speedup
Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz	phi3 3B Q4_K_M	pp512	13.07	13.28	1.02
Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz	phi3 3B Q4_K_M	tg128	7.51	7.60	1.01
Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz	phi3 3B Q4_K_M	pp512+tg128	11.38	11.28	0.99

H100 (moved timings for thread creation)

Node is 96 cores.

GPU	Model	NKVO	Test	t/s master	t/s openmp	Speedup
NVIDIA H100 PCIe	llama 7B Q4_0	0	pp512	6366.18	6222.38	0.98
NVIDIA H100 PCIe	llama 7B Q4_0	0	tg128	130.68	120.55	0.92
NVIDIA H100 PCIe	llama 7B Q4_0	0	pp512+tg128	549.15	515.65	0.94
NVIDIA H100 PCIe	llama 7B Q4_0	1	pp512	601.44	711.63	1.18
NVIDIA H100 PCIe	llama 7B Q4_0	1	tg128	4.94	33.61	6.81
NVIDIA H100 PCIe	llama 7B Q4_0	1	pp512+tg128	22.21	128.38	5.78

slaren · 2024-05-20T01:58:36Z

ggml.c


+    /* Loop is reversed as in the NO_OMP case we want threads to start
+    before the main thread (j==0) */
+    #pragma omp parallel for shared(workers,state_shared)


Might need to add a num_threads(n_threads) here to make sure that omp always launches all the threads, otherwise it will deadlock.

You are right it is indeed fragile. I'm not even sure setting thread count covers all cases (https://www.openmp.org/spec-html/5.0/openmpsu35.html#x55-880002.6.1).

slaren · 2024-05-20T02:04:20Z

Unfortunately the omp version is slower for me, and it causes a significant performance regression without nkvo.

GPU	Model	NKVO	Test	t/s master	t/s thread_pool	Speedup
RTX 3090 Ti	llama 7B Q4_0	0	pp512	4512.80	4076.89	0.90
RTX 3090 Ti	llama 7B Q4_0	0	tg128	154.23	140.59	0.91
RTX 3090 Ti	llama 7B Q4_0	0	pp512+tg128	627.11	570.19	0.91
RTX 3090 Ti	llama 7B Q4_0	1	pp512	526.36	489.66	0.93
RTX 3090 Ti	llama 7B Q4_0	1	tg128	21.86	42.83	1.96
RTX 3090 Ti	llama 7B Q4_0	1	pp512+tg128	84.62	131.32	1.55

CPU only:

CPU	Model	Test	t/s master	t/s thread_pool	Speedup
13th Gen Intel(R) Core(TM) i9-13900K	llama 7B Q4_0	pp128	61.98	62.86	1.01
13th Gen Intel(R) Core(TM) i9-13900K	llama 7B Q4_0	tg32	19.78	18.48	0.93

besnardjb · 2024-05-20T07:04:41Z

Thank you very much for checking. Indeed turns out it cannot be so simple; it is hard to beat those hundreds of threads !

As Pools did perform better on GPU configs, maybe OMP is doing some busy waiting in between parallels interfering with Cuda. I will try first to fix the parallel for risky logic and will play a bit more with it out of curiosity.

besnardjb · 2024-06-04T08:44:04Z

Superseded by #7606 using Openmp 👍. I still think that there are gains to have at the level of the barriers ggml_graph_compute_thread_sync_node & ggml_graph_compute_thread_sync_task.

mofosyne added performance Speed related topics Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 17, 2024

besnardjb force-pushed the thread_pool branch 3 times, most recently from bf0722c to c9a4102 Compare May 17, 2024 21:29

OpenMP: remove repetitive thread creation using OpenMP

47edd35

- Add OpenMP to dependency detection - Add GGML_NO_OMP if OpenMP is disabled or not found - Keep previous approach if there is no OMP

besnardjb force-pushed the thread_pool branch from 9729d79 to 47edd35 Compare May 20, 2024 01:15

github-actions bot added the build Compilation issues label May 20, 2024

slaren reviewed May 20, 2024

View reviewed changes

Merge branch 'master' into thread_pool

c711b02

slaren mentioned this pull request May 28, 2024

Introduce ggml_threadpool #7526

Closed

msy-kato mentioned this pull request May 29, 2024

ggml: Support OpenMP for multi-thread processing #7606

Merged

besnardjb closed this Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another threadpool: Avoid creating hundreds of threads in GGML #7342

Another threadpool: Avoid creating hundreds of threads in GGML #7342

besnardjb commented May 17, 2024

github-actions bot commented May 17, 2024 •

edited

slaren commented May 17, 2024

besnardjb commented May 17, 2024 •

edited

slaren commented May 17, 2024 •

edited

besnardjb commented May 18, 2024 •

edited

forworldm commented May 18, 2024

besnardjb commented May 18, 2024

besnardjb commented May 20, 2024 •

edited

slaren May 20, 2024

besnardjb May 20, 2024 •

edited

slaren commented May 20, 2024 •

edited

besnardjb commented May 20, 2024

besnardjb commented Jun 4, 2024

Another threadpool: Avoid creating hundreds of threads in GGML #7342

Another threadpool: Avoid creating hundreds of threads in GGML #7342

Conversation

besnardjb commented May 17, 2024

TL/DR:

Perf on x86-64 (Genoa)

Perf on x86-64 (Ivy Bridge)

github-actions bot commented May 17, 2024 • edited

slaren commented May 17, 2024

besnardjb commented May 17, 2024 • edited

ThreadPool (Semaphore)

Threadpool (ACTIVE Poll -- as for windows measured on Linux)

Master (27b0406)

slaren commented May 17, 2024 • edited

besnardjb commented May 18, 2024 • edited

forworldm commented May 18, 2024

besnardjb commented May 18, 2024

besnardjb commented May 20, 2024 • edited

72 Cores (moved timings for thread creation)

192 cores (moved timings for thread creation)

Old Laptop 4 cores (moved timings for thread creation)

H100 (moved timings for thread creation)

slaren May 20, 2024

Choose a reason for hiding this comment

besnardjb May 20, 2024 • edited

Choose a reason for hiding this comment

slaren commented May 20, 2024 • edited

besnardjb commented May 20, 2024

besnardjb commented Jun 4, 2024

github-actions bot commented May 17, 2024 •

edited

besnardjb commented May 17, 2024 •

edited

Master (`27b0406`)

slaren commented May 17, 2024 •

edited

besnardjb commented May 18, 2024 •

edited

besnardjb commented May 20, 2024 •

edited

besnardjb May 20, 2024 •

edited

slaren commented May 20, 2024 •

edited