ggml: implement quantized KV cache for FA #7372

JohannesGaessler · 2024-05-18T20:34:22Z

This PR implements the use of quantized KV caches for the CPU backend when using FlashAttention. This is done via switch statements in ggml_compute_forward_flash_attn_ext_f16; from what I can tell this does not significantly affect performance. I also added comments and did some small performance optimizations such as converting Q only once to FP16/q8_0/q8_1 and using multiplications for scaling rather than divisions. The amount of wdata has been increased but only on the order of kiB which should be negligible. On my desktop with a Ryzen 5950X power limited to 95W the performance changes as follows:

model	backend	threads	fa	test	t/s master	t/s PR	Speedup
llama 7B Q4_0	CPU	16	1	pp2048	33.43	35.29	1.06
llama 7B Q4_0	CPU	16	1	tg512	10.59	10.71	1.01

When benchmarking, be mindful of the order in which you run the tests since a hot CPU will perform worse and this difference can easily be larger than the performance difference from this PR. I did a warmup run prior to the actual measurements.

slaren · 2024-05-18T22:49:36Z

It would be preferable to use the more generic function pointers in type_traits in the same way mul_mat does.

github-actions · 2024-05-19T10:47:01Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 547 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8537.9ms p(95)=20441.75ms fails=, finish reason: stop=493 truncated=54
Prompt processing (pp): avg=101.07tk/s p(95)=423.38tk/s
Token generation (tg): avg=51.07tk/s p(95)=48.17tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=fa-quantize-3 commit=b7da2e86db9836b4b85a30675f39c0571de0ec94

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716145964 --> 1716146596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 875.21, 875.21, 875.21, 875.21, 875.21, 827.76, 827.76, 827.76, 827.76, 827.76, 839.07, 839.07, 839.07, 839.07, 839.07, 887.27, 887.27, 887.27, 887.27, 887.27, 906.6, 906.6, 906.6, 906.6, 906.6, 892.65, 892.65, 892.65, 892.65, 892.65, 901.19, 901.19, 901.19, 901.19, 901.19, 905.11, 905.11, 905.11, 905.11, 905.11, 914.48, 914.48, 914.48, 914.48, 914.48, 914.94, 914.94, 914.94, 914.94, 914.94, 939.87, 939.87, 939.87, 939.87, 939.87, 917.31, 917.31, 917.31, 917.31, 917.31, 928.28, 928.28, 928.28, 928.28, 928.28, 932.96, 932.96, 932.96, 932.96, 932.96, 932.64, 932.64, 932.64, 932.64, 932.64, 932.47, 932.47, 932.47, 932.47, 932.47, 925.58, 925.58, 925.58, 925.58, 925.58, 938.79, 938.79, 938.79, 938.79, 938.79, 938.52, 938.52, 938.52, 938.52, 938.52, 941.35, 941.35, 941.35, 941.35, 941.35, 940.47, 940.47, 940.47, 940.47, 940.47, 940.23, 940.23, 940.23, 940.23, 940.23, 903.32, 903.32, 903.32, 903.32, 903.32, 903.77, 903.77, 903.77, 903.77, 903.77, 913.4, 913.4, 913.4, 913.4, 913.4, 916.13, 916.13, 916.13, 916.13, 916.13, 915.08, 915.08, 915.08, 915.08, 915.08, 915.15, 915.15, 915.15, 915.15, 915.15, 914.7, 914.7, 914.7, 914.7, 914.7, 915.77, 915.77, 915.77, 915.77, 915.77, 912.89, 912.89, 912.89, 912.89, 912.89, 914.47, 914.47, 914.47, 914.47, 914.47, 912.92, 912.92, 912.92, 912.92, 912.92, 912.52, 912.52, 912.52, 912.52, 912.52, 916.64, 916.64, 916.64, 916.64, 916.64, 912.77, 912.77, 912.77, 912.77, 912.77, 910.04, 910.04, 910.04, 910.04, 910.04, 911.36, 911.36, 911.36, 911.36, 911.36, 913.73, 913.73, 913.73, 913.73, 913.73, 911.06, 911.06, 911.06, 911.06, 911.06, 913.16, 913.16, 913.16, 913.16, 913.16, 911.52, 911.52, 911.52, 911.52, 911.52, 908.89, 908.89, 908.89, 908.89, 908.89, 907.08, 907.08, 907.08, 907.08, 907.08, 900.25, 900.25, 900.25, 900.25, 900.25, 903.67, 903.67, 903.67, 903.67, 903.67, 900.27, 900.27, 900.27, 900.27, 900.27, 899.22, 899.22, 899.22, 899.22, 899.22, 901.12, 901.12, 901.12, 901.12, 901.12, 901.99, 901.99, 901.99, 901.99, 901.99, 902.03, 902.03, 902.03, 902.03, 902.03, 904.79, 904.79, 904.79, 904.79, 904.79, 897.11, 897.11, 897.11, 897.11, 897.11, 903.84, 903.84, 903.84, 903.84, 903.84, 902.57, 902.57, 902.57, 902.57, 902.57, 903.66, 903.66, 903.66, 903.66, 903.66, 902.45, 902.45, 902.45, 902.45, 902.45, 902.98, 902.98, 902.98, 902.98, 902.98, 904.22, 904.22, 904.22, 904.22, 904.22, 905.74, 905.74, 905.74, 905.74, 905.74, 904.84, 904.84]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716145964 --> 1716146596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 37.77, 37.77, 37.77, 37.77, 37.77, 31.81, 31.81, 31.81, 31.81, 31.81, 30.19, 30.19, 30.19, 30.19, 30.19, 32.39, 32.39, 32.39, 32.39, 32.39, 32.6, 32.6, 32.6, 32.6, 32.6, 32.48, 32.48, 32.48, 32.48, 32.48, 33.71, 33.71, 33.71, 33.71, 33.71, 34.12, 34.12, 34.12, 34.12, 34.12, 33.95, 33.95, 33.95, 33.95, 33.95, 34.15, 34.15, 34.15, 34.15, 34.15, 34.32, 34.32, 34.32, 34.32, 34.32, 34.1, 34.1, 34.1, 34.1, 34.1, 33.47, 33.47, 33.47, 33.47, 33.47, 33.46, 33.46, 33.46, 33.46, 33.46, 32.31, 32.31, 32.31, 32.31, 32.31, 30.38, 30.38, 30.38, 30.38, 30.38, 30.71, 30.71, 30.71, 30.71, 30.71, 30.81, 30.81, 30.81, 30.81, 30.81, 30.77, 30.77, 30.77, 30.77, 30.77, 30.79, 30.79, 30.79, 30.79, 30.79, 30.85, 30.85, 30.85, 30.85, 30.85, 30.98, 30.98, 30.98, 30.98, 30.98, 30.68, 30.68, 30.68, 30.68, 30.68, 30.8, 30.8, 30.8, 30.8, 30.8, 31.07, 31.07, 31.07, 31.07, 31.07, 30.87, 30.87, 30.87, 30.87, 30.87, 31.0, 31.0, 31.0, 31.0, 31.0, 31.06, 31.06, 31.06, 31.06, 31.06, 31.23, 31.23, 31.23, 31.23, 31.23, 31.28, 31.28, 31.28, 31.28, 31.28, 31.39, 31.39, 31.39, 31.39, 31.39, 31.61, 31.61, 31.61, 31.61, 31.61, 31.66, 31.66, 31.66, 31.66, 31.66, 31.39, 31.39, 31.39, 31.39, 31.39, 31.24, 31.24, 31.24, 31.24, 31.24, 30.62, 30.62, 30.62, 30.62, 30.62, 30.54, 30.54, 30.54, 30.54, 30.54, 30.71, 30.71, 30.71, 30.71, 30.71, 30.91, 30.91, 30.91, 30.91, 30.91, 31.02, 31.02, 31.02, 31.02, 31.02, 30.91, 30.91, 30.91, 30.91, 30.91, 30.75, 30.75, 30.75, 30.75, 30.75, 30.54, 30.54, 30.54, 30.54, 30.54, 29.81, 29.81, 29.81, 29.81, 29.81, 28.77, 28.77, 28.77, 28.77, 28.77, 28.79, 28.79, 28.79, 28.79, 28.79, 28.82, 28.82, 28.82, 28.82, 28.82, 28.83, 28.83, 28.83, 28.83, 28.83, 28.9, 28.9, 28.9, 28.9, 28.9, 28.93, 28.93, 28.93, 28.93, 28.93, 29.02, 29.02, 29.02, 29.02, 29.02, 29.06, 29.06, 29.06, 29.06, 29.06, 28.98, 28.98, 28.98, 28.98, 28.98, 28.99, 28.99, 28.99, 28.99, 28.99, 29.0, 29.0, 29.0, 29.0, 29.0, 29.11, 29.11, 29.11, 29.11, 29.11, 29.28, 29.28, 29.28, 29.28, 29.28, 29.37, 29.37, 29.37, 29.37, 29.37, 29.42, 29.42, 29.42, 29.42, 29.42, 29.51, 29.51, 29.51, 29.51, 29.51, 29.58, 29.58]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716145964 --> 1716146596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.38, 0.38, 0.38, 0.38, 0.38, 0.29, 0.29, 0.29, 0.29, 0.29, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.4, 0.4, 0.4, 0.4, 0.4, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.24, 0.24, 0.24, 0.24, 0.24, 0.1, 0.1, 0.1, 0.1, 0.1, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.21, 0.21, 0.21, 0.21, 0.21, 0.39, 0.39, 0.39, 0.39, 0.39, 0.26, 0.26, 0.26, 0.26, 0.26, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.34, 0.34, 0.34, 0.34, 0.34, 0.52, 0.52, 0.52, 0.52, 0.52, 0.68, 0.68, 0.68, 0.68, 0.68, 0.62, 0.62, 0.62, 0.62, 0.62, 0.46, 0.46, 0.46, 0.46, 0.46, 0.1, 0.1, 0.1, 0.1, 0.1, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.24, 0.24, 0.24, 0.24, 0.24, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716145964 --> 1716146596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0]

JohannesGaessler · 2024-05-19T12:27:36Z

I tested using the same FP32 code for both FP16 and quantized V cache but the performance was worse.

ggerganov · 2024-05-19T14:36:21Z

ggml.c

-    GGML_ASSERT(nbq0 == sizeof(float));
-    GGML_ASSERT(nbk0 == sizeof(ggml_fp16_t));
-    GGML_ASSERT(nbv0 == sizeof(ggml_fp16_t));
+    // input tensors must be contiguous


Needs just contiguous rows, not the entire tensors

JohannesGaessler force-pushed the fa-quantize-3 branch from c1c6d84 to d7168fd Compare May 19, 2024 11:39

ggerganov approved these changes May 19, 2024

View reviewed changes

ggml: implement quantized KV cache for FA

b7da2e8

JohannesGaessler force-pushed the fa-quantize-3 branch from d7168fd to b7da2e8 Compare May 19, 2024 14:46

JohannesGaessler merged commit 5ca49cb into ggerganov:master May 19, 2024
43 of 45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: implement quantized KV cache for FA #7372

ggml: implement quantized KV cache for FA #7372

JohannesGaessler commented May 18, 2024

slaren commented May 18, 2024

github-actions bot commented May 19, 2024 •

edited

JohannesGaessler commented May 19, 2024

ggerganov May 19, 2024

ggml: implement quantized KV cache for FA #7372

ggml: implement quantized KV cache for FA #7372

Conversation

JohannesGaessler commented May 18, 2024

slaren commented May 18, 2024

github-actions bot commented May 19, 2024 • edited

JohannesGaessler commented May 19, 2024

ggerganov May 19, 2024

Choose a reason for hiding this comment

github-actions bot commented May 19, 2024 •

edited