Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another threadpool: Avoid creating hundreds of threads in GGML #7342

Closed
wants to merge 2 commits into from

Conversation

besnardjb
Copy link

TL/DR:

When profiling the code I noticed hundreds of threads were created, making profiling not practical and maybe leading to some performance overhead. I then attempted to make a simple threadpool inside GGML to see if it made any difference at all. Gains at small scale are not notable, and inference does not change much. However, the more threads the more overhead and more importantly, the code is easier to profile (not 100s of threads to look at).

This is my first attempt at hacking the code, there might be cleaner attempts or better ways, I have seen PR #710, #851 which are similar. I'm also not sure of how to properly benchmark this, did launch ./bin/benchmark.

Perf on x86-64 (Genoa)

  • One socket:
    • ./bin/benchmark -t 96 -i 100
      • without: (last line) average flops 1497.53
      • with : (last line) average flops 2970.68
  • Two sockets:
    • ./bin/benchmark -t 192 -i 100
      • without: (last line) average flops 1044.79
      • with : (last line) average flops 3488.96

Perf on x86-64 (Ivy Bridge)

  • One socket:
    • ./bin/benchmark -t 18 -i 100
      • without: (last line) average flops 367.60
      • with : (last line) average flops 378.21
  • Two sockets:
    • ./bin/benchmark -t 36 -i 100
      • without: (last line) average flops 633.22
      • with : (last line) average flops 706.8

Copy link
Contributor

github-actions bot commented May 17, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 553 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8475.39ms p(95)=19968.52ms fails=, finish reason: stop=499 truncated=54
  • Prompt processing (pp): avg=104.29tk/s p(95)=476.74tk/s
  • Token generation (tg): avg=47.01tk/s p(95)=46.82tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=thread_pool commit=c711b028b03e74e3e63118d847c2646d95e66c34

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716173714 --> 1716174348
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 734.99, 734.99, 734.99, 734.99, 734.99, 882.85, 882.85, 882.85, 882.85, 882.85, 874.63, 874.63, 874.63, 874.63, 874.63, 882.47, 882.47, 882.47, 882.47, 882.47, 933.68, 933.68, 933.68, 933.68, 933.68, 949.65, 949.65, 949.65, 949.65, 949.65, 938.06, 938.06, 938.06, 938.06, 938.06, 941.98, 941.98, 941.98, 941.98, 941.98, 932.55, 932.55, 932.55, 932.55, 932.55, 935.48, 935.48, 935.48, 935.48, 935.48, 931.34, 931.34, 931.34, 931.34, 931.34, 919.5, 919.5, 919.5, 919.5, 919.5, 882.35, 882.35, 882.35, 882.35, 882.35, 883.61, 883.61, 883.61, 883.61, 883.61, 843.19, 843.19, 843.19, 843.19, 843.19, 834.29, 834.29, 834.29, 834.29, 834.29, 836.14, 836.14, 836.14, 836.14, 836.14, 844.47, 844.47, 844.47, 844.47, 844.47, 856.42, 856.42, 856.42, 856.42, 856.42, 857.4, 857.4, 857.4, 857.4, 857.4, 855.52, 855.52, 855.52, 855.52, 855.52, 860.4, 860.4, 860.4, 860.4, 860.4, 865.06, 865.06, 865.06, 865.06, 865.06, 861.9, 861.9, 861.9, 861.9, 861.9, 839.25, 839.25, 839.25, 839.25, 839.25, 840.77, 840.77, 840.77, 840.77, 840.77, 841.94, 841.94, 841.94, 841.94, 841.94, 802.0, 802.0, 802.0, 802.0, 802.0, 799.39, 799.39, 799.39, 799.39, 799.39, 800.92, 800.92, 800.92, 800.92, 800.92, 804.04, 804.04, 804.04, 804.04, 804.04, 804.67, 804.67, 804.67, 804.67, 804.67, 805.08, 805.08, 805.08, 805.08, 805.08, 808.49, 808.49, 808.49, 808.49, 808.49, 818.69, 818.69, 818.69, 818.69, 818.69, 808.33, 808.33, 808.33, 808.33, 808.33, 808.2, 808.2, 808.2, 808.2, 808.2, 806.1, 806.1, 806.1, 806.1, 806.1, 808.93, 808.93, 808.93, 808.93, 808.93, 812.68, 812.68, 812.68, 812.68, 812.68, 821.98, 821.98, 821.98, 821.98, 821.98, 803.74, 803.74, 803.74, 803.74, 803.74, 805.53, 805.53, 805.53, 805.53, 805.53, 805.08, 805.08, 805.08, 805.08, 805.08, 804.11, 804.11, 804.11, 804.11, 804.11, 808.24, 808.24, 808.24, 808.24, 808.24, 810.98, 810.98, 810.98, 810.98, 810.98, 811.4, 811.4, 811.4, 811.4, 811.4, 815.55, 815.55, 815.55, 815.55, 815.55, 815.9, 815.9, 815.9, 815.9, 815.9, 817.41, 817.41, 817.41, 817.41, 817.41, 814.0, 814.0, 814.0, 814.0, 814.0, 822.19, 822.19, 822.19, 822.19, 822.19, 821.26, 821.26, 821.26, 821.26, 821.26, 821.76, 821.76, 821.76, 821.76, 821.76, 822.02, 822.02, 822.02, 822.02, 822.02, 823.24, 823.24, 823.24, 823.24, 823.24, 825.23, 825.23, 825.23, 825.23, 825.23, 828.51, 828.51, 828.51, 828.51, 828.51, 828.92, 828.92, 828.92, 828.92, 828.92, 828.78, 828.78, 828.78]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716173714 --> 1716174348
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.02, 44.02, 44.02, 44.02, 44.02, 42.09, 42.09, 42.09, 42.09, 42.09, 25.6, 25.6, 25.6, 25.6, 25.6, 25.38, 25.38, 25.38, 25.38, 25.38, 27.16, 27.16, 27.16, 27.16, 27.16, 28.25, 28.25, 28.25, 28.25, 28.25, 29.75, 29.75, 29.75, 29.75, 29.75, 30.9, 30.9, 30.9, 30.9, 30.9, 31.49, 31.49, 31.49, 31.49, 31.49, 31.78, 31.78, 31.78, 31.78, 31.78, 31.76, 31.76, 31.76, 31.76, 31.76, 31.77, 31.77, 31.77, 31.77, 31.77, 31.28, 31.28, 31.28, 31.28, 31.28, 31.03, 31.03, 31.03, 31.03, 31.03, 30.61, 30.61, 30.61, 30.61, 30.61, 29.72, 29.72, 29.72, 29.72, 29.72, 29.81, 29.81, 29.81, 29.81, 29.81, 30.15, 30.15, 30.15, 30.15, 30.15, 29.74, 29.74, 29.74, 29.74, 29.74, 29.59, 29.59, 29.59, 29.59, 29.59, 29.58, 29.58, 29.58, 29.58, 29.58, 29.59, 29.59, 29.59, 29.59, 29.59, 29.9, 29.9, 29.9, 29.9, 29.9, 29.95, 29.95, 29.95, 29.95, 29.95, 29.74, 29.74, 29.74, 29.74, 29.74, 29.86, 29.86, 29.86, 29.86, 29.86, 29.96, 29.96, 29.96, 29.96, 29.96, 30.01, 30.01, 30.01, 30.01, 30.01, 29.98, 29.98, 29.98, 29.98, 29.98, 30.13, 30.13, 30.13, 30.13, 30.13, 30.43, 30.43, 30.43, 30.43, 30.43, 30.46, 30.46, 30.46, 30.46, 30.46, 30.66, 30.66, 30.66, 30.66, 30.66, 30.8, 30.8, 30.8, 30.8, 30.8, 30.77, 30.77, 30.77, 30.77, 30.77, 30.75, 30.75, 30.75, 30.75, 30.75, 30.59, 30.59, 30.59, 30.59, 30.59, 30.57, 30.57, 30.57, 30.57, 30.57, 30.73, 30.73, 30.73, 30.73, 30.73, 30.9, 30.9, 30.9, 30.9, 30.9, 30.93, 30.93, 30.93, 30.93, 30.93, 30.84, 30.84, 30.84, 30.84, 30.84, 30.77, 30.77, 30.77, 30.77, 30.77, 30.36, 30.36, 30.36, 30.36, 30.36, 29.22, 29.22, 29.22, 29.22, 29.22, 29.15, 29.15, 29.15, 29.15, 29.15, 29.15, 29.15, 29.15, 29.15, 29.15, 29.18, 29.18, 29.18, 29.18, 29.18, 29.2, 29.2, 29.2, 29.2, 29.2, 29.28, 29.28, 29.28, 29.28, 29.28, 29.27, 29.27, 29.27, 29.27, 29.27, 29.23, 29.23, 29.23, 29.23, 29.23, 29.23, 29.23, 29.23, 29.23, 29.23, 29.15, 29.15, 29.15, 29.15, 29.15, 29.23, 29.23, 29.23, 29.23, 29.23, 29.39, 29.39, 29.39, 29.39, 29.39, 29.5, 29.5, 29.5, 29.5, 29.5, 29.59, 29.59, 29.59, 29.59, 29.59, 29.67, 29.67, 29.67, 29.67, 29.67, 29.69, 29.69, 29.69, 29.69, 29.69, 29.71, 29.71, 29.71]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716173714 --> 1716174348
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.26, 0.26, 0.26, 0.26, 0.26, 0.43, 0.43, 0.43, 0.43, 0.43, 0.26, 0.26, 0.26, 0.26, 0.26, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.1, 0.1, 0.1, 0.1, 0.1, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.36, 0.36, 0.36, 0.36, 0.36, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.3, 0.3, 0.3, 0.3, 0.3, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.29, 0.29, 0.29, 0.29, 0.29, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26, 0.26, 0.26, 0.26, 0.09, 0.09, 0.09, 0.09, 0.09, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.33, 0.33, 0.33, 0.33, 0.33, 0.31, 0.31, 0.31, 0.31, 0.31, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.26, 0.26, 0.26, 0.26, 0.26, 0.53, 0.53, 0.53, 0.53, 0.53, 0.62, 0.62, 0.62, 0.62, 0.62, 0.57, 0.57, 0.57, 0.57, 0.57, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.25, 0.25, 0.25, 0.25, 0.25, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716173714 --> 1716174348
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0]
                    

@slaren
Copy link
Collaborator

slaren commented May 17, 2024

You can use llama-bench to test the overall performance. I ran a test with and without KV offload since this is one case where the threads are started multiple times in each evaluation.

LLAMA_CUDA=1 scripts/compare-commits.sh master thread_pool -nkvo 0,1

GPU Model NKVO Test t/s master t/s thread_pool Speedup
RTX 3090 Ti llama 7B Q4_0 0 pp512 4313.72 4525.64 1.05
RTX 3090 Ti llama 7B Q4_0 0 tg128 154.96 154.17 0.99
RTX 3090 Ti llama 7B Q4_0 0 pp512+tg128 622.75 625.94 1.01
RTX 3090 Ti llama 7B Q4_0 1 pp512 619.01 527.14 0.85
RTX 3090 Ti llama 7B Q4_0 1 tg128 21.55 49.80 2.31
RTX 3090 Ti llama 7B Q4_0 1 pp512+tg128 84.56 135.97 1.61

@mofosyne mofosyne added performance Speed related topics Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 17, 2024
@besnardjb besnardjb force-pushed the thread_pool branch 3 times, most recently from bf0722c to c9a4102 Compare May 17, 2024 21:29
@besnardjb
Copy link
Author

besnardjb commented May 17, 2024

Thanks for the pointer to llama-bench! sorry for missing it. I did run a few CPU tests on a bi-socket machine. I was not able to use the convenient compare-commits.sh script yet, as there seems to be some warning output piped in SQL.

Overall, there seem to be punctual gains, but it's close to measurement noise. I've also cleaned up the code a bit and tried to fix the Windows build ( ⚠️ without access to a Windows machine, unfortunately). For Windows, as apparently there is no direct equivalent of a semaphore, I have used busy waiting for work in threads to skip dependencies (measured its impact below on Linux too, just to see what it does).

Ran: ./bin/llama-bench -m ../models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -nkvo 0,1

ThreadPool (Semaphore)

| model                          |       size |     params | backend    |    threads |          nkvo |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ------------: | ------------: | ---------------: |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |         pp512 |    248.15 ± 1.38 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |         tg128 |     22.39 ± 0.20 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |   pp512+tg128 |     81.91 ± 1.01 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |         pp512 |    211.54 ± 8.78 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |         tg128 |     19.73 ± 0.17 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |   pp512+tg128 |     70.52 ± 0.57 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |         pp512 |   241.72 ± 10.81 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |         tg128 |     22.25 ± 0.45 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |   pp512+tg128 |     82.73 ± 0.39 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |         pp512 |    227.08 ± 2.27 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |         tg128 |     19.42 ± 0.14 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |   pp512+tg128 |     70.71 ± 0.77 |

Threadpool (ACTIVE Poll -- as for windows measured on Linux)

| model                          |       size |     params | backend    |    threads |          nkvo |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ------------: | ------------: | ---------------: |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |         pp512 |    246.28 ± 0.34 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |         tg128 |     24.11 ± 0.10 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |   pp512+tg128 |     84.40 ± 0.54 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |         pp512 |    224.57 ± 0.51 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |         tg128 |     19.38 ± 0.23 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |   pp512+tg128 |     71.38 ± 0.79 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |         pp512 |    247.67 ± 1.07 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |         tg128 |     24.73 ± 0.48 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |   pp512+tg128 |     83.10 ± 0.75 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |         pp512 |    236.53 ± 1.82 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |         tg128 |     19.02 ± 0.08 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |   pp512+tg128 |     69.73 ± 0.14 |

Master (27b0406)

| model                          |       size |     params | backend    |    threads |          nkvo |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ------------: | ------------: | ---------------: |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |         pp512 |   246.89 ± 11.27 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |         tg128 |     19.84 ± 0.01 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             0 |   pp512+tg128 |     75.51 ± 0.43 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |         pp512 |    238.93 ± 6.04 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |         tg128 |     18.17 ± 0.15 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             0 |   pp512+tg128 |     69.23 ± 0.48 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |         pp512 |   253.94 ± 14.74 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |         tg128 |     19.85 ± 0.02 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |         96 |             1 |   pp512+tg128 |     75.34 ± 1.20 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |         pp512 |   236.17 ± 16.98 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |         tg128 |     18.25 ± 0.08 |
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | CPU        |        192 |             1 |   pp512+tg128 |     69.36 ± 0.58 |

@slaren
Copy link
Collaborator

slaren commented May 17, 2024

To use compare-commits.sh you need to have sqlite3 installed. The nkvo parameter only has an effect when offloading a model to a GPU, basically what happens with nkvo is that the attention part of each layer is run on the CPU and the rest on the GPU, and this causes a call to ggml_graph_compute on every layer, so the threads have to be launched once per layer.

@besnardjb
Copy link
Author

besnardjb commented May 18, 2024

Got it, I basically ran the same test twice 😆 ... For compare-commits.sh I did install sqlite and gitpython but I had two issues (pure CPU run again):

  • In ggml.c I had a warning (namely: GGML_PRINT("WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance\n");) it was sent to stdout and ended up mixed with the SQL; This may go to stderr and the script may only capture stdout.

  • Then tried commenting the line out to see, but ended up with out of range error in scripts/compare-llama-bench.py on lines 302-303 and did not take the time to investigate more.

        gpu_blas = bool(rows_full[0][KEY_PROPERTIES.index("gpu_blas")])
        ngl = int(rows_full[0][KEY_PROPERTIES.index("n_gpu_layers")])

Thank you will further look into it.

@forworldm
Copy link

it looks like you are always waiting on a variable to change. have your tried replacing sem_t and sched_yield with futex? see syscall(SYS_futex...) on linux and WaitOnAddress on windows

@besnardjb
Copy link
Author

It is a very good idea. Will look into it after finding myself a windows VM. Also I'm currently not freeing threads will need to refcount init and release.

- Add OpenMP to dependency detection
- Add GGML_NO_OMP if OpenMP is disabled or not found
- Keep previous approach if there is no OMP
@besnardjb
Copy link
Author

besnardjb commented May 20, 2024

After looking into the Futex approach I realized it was about to add a lot of ifdefs and several lines of (relatively complex) codes. I then thought that I was in fact trying to implement an OpenMP parallel region. And thus, resorted to trying OpenMP directly. This basically translated in a single #pragma omp and avoids manual handling. In addition, what is nice is that it could be used on other for loops too elsewhere in the code without repeating the whole threadpool thingy (the runtime does it for free). I think it ends up simpler, but all comments are welcome of course.

I hope it works well on windows, lets see.

Also I realized performance counters in ggml_graph_compute were not accounting for thread creation cost, I thus moved them a few line up on master to do the same measure in the two case.

72 Cores (moved timings for thread creation)

CPU Model Test t/s master t/s openmp Speedup
Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz llama 7B Q4_0 pp512 47.17 58.55 1.24
Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz llama 7B Q4_0 tg128 9.08 13.96 1.54
Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz llama 7B Q4_0 pp512+tg128 27.37 35.15 1.28

192 cores (moved timings for thread creation)

CPU Model Test t/s master t/s openmp Speedup
AMD EPYC 9654 96-Core Processor llama 7B Q4_0 pp512 226.86 241.64 1.07
AMD EPYC 9654 96-Core Processor llama 7B Q4_0 tg128 18.24 21.42 1.17
AMD EPYC 9654 96-Core Processor llama 7B Q4_0 pp512+tg128 69.03 76.94 1.11

Old Laptop 4 cores (moved timings for thread creation)

CPU Model Test t/s master t/s openmp Speedup
Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz phi3 3B Q4_K_M pp512 13.07 13.28 1.02
Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz phi3 3B Q4_K_M tg128 7.51 7.60 1.01
Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz phi3 3B Q4_K_M pp512+tg128 11.38 11.28 0.99

H100 (moved timings for thread creation)

Node is 96 cores.

GPU Model NKVO Test t/s master t/s openmp Speedup
NVIDIA H100 PCIe llama 7B Q4_0 0 pp512 6366.18 6222.38 0.98
NVIDIA H100 PCIe llama 7B Q4_0 0 tg128 130.68 120.55 0.92
NVIDIA H100 PCIe llama 7B Q4_0 0 pp512+tg128 549.15 515.65 0.94
NVIDIA H100 PCIe llama 7B Q4_0 1 pp512 601.44 711.63 1.18
NVIDIA H100 PCIe llama 7B Q4_0 1 tg128 4.94 33.61 6.81
NVIDIA H100 PCIe llama 7B Q4_0 1 pp512+tg128 22.21 128.38 5.78

@github-actions github-actions bot added the build Compilation issues label May 20, 2024

/* Loop is reversed as in the NO_OMP case we want threads to start
before the main thread (j==0) */
#pragma omp parallel for shared(workers,state_shared)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need to add a num_threads(n_threads) here to make sure that omp always launches all the threads, otherwise it will deadlock.

Copy link
Author

@besnardjb besnardjb May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right it is indeed fragile. I'm not even sure setting thread count covers all cases (https://www.openmp.org/spec-html/5.0/openmpsu35.html#x55-880002.6.1).

@slaren
Copy link
Collaborator

slaren commented May 20, 2024

Unfortunately the omp version is slower for me, and it causes a significant performance regression without nkvo.

GPU Model NKVO Test t/s master t/s thread_pool Speedup
RTX 3090 Ti llama 7B Q4_0 0 pp512 4512.80 4076.89 0.90
RTX 3090 Ti llama 7B Q4_0 0 tg128 154.23 140.59 0.91
RTX 3090 Ti llama 7B Q4_0 0 pp512+tg128 627.11 570.19 0.91
RTX 3090 Ti llama 7B Q4_0 1 pp512 526.36 489.66 0.93
RTX 3090 Ti llama 7B Q4_0 1 tg128 21.86 42.83 1.96
RTX 3090 Ti llama 7B Q4_0 1 pp512+tg128 84.62 131.32 1.55

CPU only:

CPU Model Test t/s master t/s thread_pool Speedup
13th Gen Intel(R) Core(TM) i9-13900K llama 7B Q4_0 pp128 61.98 62.86 1.01
13th Gen Intel(R) Core(TM) i9-13900K llama 7B Q4_0 tg32 19.78 18.48 0.93

@besnardjb
Copy link
Author

Thank you very much for checking. Indeed turns out it cannot be so simple; it is hard to beat those hundreds of threads !

As Pools did perform better on GPU configs, maybe OMP is doing some busy waiting in between parallels interfering with Cuda. I will try first to fix the parallel for risky logic and will play a bit more with it out of curiosity.

@besnardjb
Copy link
Author

Superseded by #7606 using Openmp 👍. I still think that there are gains to have at the level of the barriers ggml_graph_compute_thread_sync_node & ggml_graph_compute_thread_sync_task.

@besnardjb besnardjb closed this Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues performance Speed related topics Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants