grammars: early exit when no next_candidates to reject #7370

ochafik · 2024-05-18T17:46:24Z

Edit: superseded by #7424 (see benchmarks comparing different fix combinations)

This speeds up grammar sampling only in some cases (#4218).

(extracted from #6811 which other change needs a bit more work)

For instance the example taken from @AlienKevin (see #4218 (comment); download issue4218.gbnf and issue4218.txt) runs 1.4x faster (total time; sampling itself went from 28.5 ms per token to 11.7 ms per token, a 2.4x sampling speedup)

# git remote add ochafik https://github.com/ochafik/llama.cpp.git
# git fetch ochafik
( export COMMON_ARGS=(
    -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
    --prompt-cache issue4218.bin
    --grammar-file issue4218.gbnf
    -f issue4218.txt
    -c 3400
  ) && \
  hyperfine --warmup 1 --runs 5 \
    -L branch ochafik/grammars-early-exit,master \
    --setup "\
      git checkout {branch} && \
      make clean && make -j LLAMA_CURL=1 main && \
      rm -f issue4218.bin && \
      ./main ${COMMON_ARGS[*]} -n 1" \
    "BRANCH={branch} \
      ./main ${COMMON_ARGS[*]} -n 128 --prompt-cache-ro --seed 12345 --no-display-prompt" )

Show output

Benchmark 1: BRANCH=grammars-early-exit       ./main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf --prompt-cache issue4218.bin --grammar-file issue4218.gbnf -f issue4218.txt -c 3400 -n 128 --prompt-cache-ro --seed 12345 --no-display-prompt
  Time (mean ± σ):      5.697 s ±  0.057 s    [User: 1.705 s, System: 0.247 s]
  Range (min … max):    5.639 s …  5.772 s    5 runs
 
Benchmark 2: BRANCH=master       ./main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf --prompt-cache issue4218.bin --grammar-file issue4218.gbnf -f issue4218.txt -c 3400 -n 128 --prompt-cache-ro --seed 12345 --no-display-prompt
  Time (mean ± σ):      7.946 s ±  0.048 s    [User: 3.838 s, System: 0.257 s]
  Range (min … max):    7.915 s …  8.027 s    5 runs
 
  Warning: The first benchmarking run for this command was significantly slower than the rest (8.027 s). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.
 
Summary
  'BRANCH=grammars-early-exit       ./main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf --prompt-cache issue4218.bin --grammar-file issue4218.gbnf -f issue4218.txt -c 3400 -n 128 --prompt-cache-ro --seed 12345 --no-display-prompt' ran
    1.39 ± 0.02 times faster than 'BRANCH=master       ./main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf --prompt-cache issue4218.bin --grammar-file issue4218.gbnf -f issue4218.txt -c 3400 -n 128 --prompt-cache-ro --seed 12345 --no-display-prompt'

master:

llama_print_timings:        load time =     323.50 ms
llama_print_timings:      sample time =    3651.29 ms /   128 runs   (   28.53 ms per token,    35.06 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =    3408.83 ms /   127 runs   (   26.84 ms per token,    37.26 tokens per second)
llama_print_timings:       total time =    7273.18 ms /   127 tokens

this PR:

llama_print_timings:        load time =     311.38 ms
llama_print_timings:      sample time =    1491.73 ms /   128 runs   (   11.65 ms per token,    85.81 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =    3363.72 ms /   127 runs   (   26.49 ms per token,    37.76 tokens per second)
llama_print_timings:       total time =    5073.34 ms /   127 tokens

cc/ @HanClinto

github-actions · 2024-05-19T10:03:27Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 528 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8873.01ms p(95)=22039.97ms fails=, finish reason: stop=466 truncated=62
Prompt processing (pp): avg=103.75tk/s p(95)=500.38tk/s
Token generation (tg): avg=45.11tk/s p(95)=45.95tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=grammars-early-exit commit=1cc12bad500da6f88f3375733693530bca3ce05c

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 528 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716112375 --> 1716113001
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 621.51, 621.51, 621.51, 621.51, 621.51, 694.34, 694.34, 694.34, 694.34, 694.34, 687.97, 687.97, 687.97, 687.97, 687.97, 727.98, 727.98, 727.98, 727.98, 727.98, 800.64, 800.64, 800.64, 800.64, 800.64, 797.91, 797.91, 797.91, 797.91, 797.91, 798.83, 798.83, 798.83, 798.83, 798.83, 813.13, 813.13, 813.13, 813.13, 813.13, 808.85, 808.85, 808.85, 808.85, 808.85, 825.57, 825.57, 825.57, 825.57, 825.57, 851.34, 851.34, 851.34, 851.34, 851.34, 879.13, 879.13, 879.13, 879.13, 879.13, 884.79, 884.79, 884.79, 884.79, 884.79, 815.12, 815.12, 815.12, 815.12, 815.12, 831.63, 831.63, 831.63, 831.63, 831.63, 833.7, 833.7, 833.7, 833.7, 833.7, 835.65, 835.65, 835.65, 835.65, 835.65, 833.89, 833.89, 833.89, 833.89, 833.89, 851.05, 851.05, 851.05, 851.05, 851.05, 851.9, 851.9, 851.9, 851.9, 851.9, 849.49, 849.49, 849.49, 849.49, 849.49, 854.74, 854.74, 854.74, 854.74, 854.74, 857.57, 857.57, 857.57, 857.57, 857.57, 858.67, 858.67, 858.67, 858.67, 858.67, 826.39, 826.39, 826.39, 826.39, 826.39, 826.4, 826.4, 826.4, 826.4, 826.4, 828.9, 828.9, 828.9, 828.9, 828.9, 842.13, 842.13, 842.13, 842.13, 842.13, 839.76, 839.76, 839.76, 839.76, 839.76, 837.33, 837.33, 837.33, 837.33, 837.33, 840.27, 840.27, 840.27, 840.27, 840.27, 843.46, 843.46, 843.46, 843.46, 843.46, 841.82, 841.82, 841.82, 841.82, 841.82, 842.95, 842.95, 842.95, 842.95, 842.95, 846.04, 846.04, 846.04, 846.04, 846.04, 850.26, 850.26, 850.26, 850.26, 850.26, 855.34, 855.34, 855.34, 855.34, 855.34, 854.85, 854.85, 854.85, 854.85, 854.85, 852.99, 852.99, 852.99, 852.99, 852.99, 855.69, 855.69, 855.69, 855.69, 855.69, 855.43, 855.43, 855.43, 855.43, 855.43, 856.0, 856.0, 856.0, 856.0, 856.0, 832.78, 832.78, 832.78, 832.78, 832.78, 836.5, 836.5, 836.5, 836.5, 836.5, 798.58, 798.58, 798.58, 798.58, 798.58, 797.08, 797.08, 797.08, 797.08, 797.08, 796.01, 796.01, 796.01, 796.01, 796.01, 795.42, 795.42, 795.42, 795.42, 795.42, 800.54, 800.54, 800.54, 800.54, 800.54, 800.04, 800.04, 800.04, 800.04, 800.04, 801.91, 801.91, 801.91, 801.91, 801.91, 805.39, 805.39, 805.39, 805.39, 805.39, 805.82, 805.82, 805.82, 805.82, 805.82, 807.73, 807.73, 807.73, 807.73, 807.73, 803.34, 803.34, 803.34, 803.34, 803.34, 811.03, 811.03, 811.03, 811.03, 811.03, 810.58, 810.58, 810.58, 810.58, 810.58, 812.14, 812.14, 812.14, 812.14, 812.14, 812.08, 812.08, 812.08, 812.08, 812.08, 812.66, 812.66, 812.66, 812.66, 812.66, 812.44, 812.44, 812.44, 812.44]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 528 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716112375 --> 1716113001
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.1, 44.1, 44.1, 44.1, 44.1, 26.84, 26.84, 26.84, 26.84, 26.84, 26.84, 26.84, 26.84, 26.84, 26.84, 30.65, 30.65, 30.65, 30.65, 30.65, 31.07, 31.07, 31.07, 31.07, 31.07, 32.38, 32.38, 32.38, 32.38, 32.38, 32.67, 32.67, 32.67, 32.67, 32.67, 34.06, 34.06, 34.06, 34.06, 34.06, 34.3, 34.3, 34.3, 34.3, 34.3, 34.04, 34.04, 34.04, 34.04, 34.04, 33.91, 33.91, 33.91, 33.91, 33.91, 33.9, 33.9, 33.9, 33.9, 33.9, 32.92, 32.92, 32.92, 32.92, 32.92, 32.97, 32.97, 32.97, 32.97, 32.97, 31.74, 31.74, 31.74, 31.74, 31.74, 30.92, 30.92, 30.92, 30.92, 30.92, 29.9, 29.9, 29.9, 29.9, 29.9, 30.03, 30.03, 30.03, 30.03, 30.03, 30.1, 30.1, 30.1, 30.1, 30.1, 30.12, 30.12, 30.12, 30.12, 30.12, 29.83, 29.83, 29.83, 29.83, 29.83, 29.76, 29.76, 29.76, 29.76, 29.76, 29.69, 29.69, 29.69, 29.69, 29.69, 29.85, 29.85, 29.85, 29.85, 29.85, 30.0, 30.0, 30.0, 30.0, 30.0, 30.02, 30.02, 30.02, 30.02, 30.02, 30.23, 30.23, 30.23, 30.23, 30.23, 30.3, 30.3, 30.3, 30.3, 30.3, 30.11, 30.11, 30.11, 30.11, 30.11, 30.22, 30.22, 30.22, 30.22, 30.22, 30.42, 30.42, 30.42, 30.42, 30.42, 30.59, 30.59, 30.59, 30.59, 30.59, 30.78, 30.78, 30.78, 30.78, 30.78, 31.0, 31.0, 31.0, 31.0, 31.0, 31.11, 31.11, 31.11, 31.11, 31.11, 30.98, 30.98, 30.98, 30.98, 30.98, 30.88, 30.88, 30.88, 30.88, 30.88, 30.49, 30.49, 30.49, 30.49, 30.49, 30.48, 30.48, 30.48, 30.48, 30.48, 30.68, 30.68, 30.68, 30.68, 30.68, 30.71, 30.71, 30.71, 30.71, 30.71, 30.82, 30.82, 30.82, 30.82, 30.82, 30.9, 30.9, 30.9, 30.9, 30.9, 30.62, 30.62, 30.62, 30.62, 30.62, 30.59, 30.59, 30.59, 30.59, 30.59, 30.27, 30.27, 30.27, 30.27, 30.27, 29.71, 29.71, 29.71, 29.71, 29.71, 29.31, 29.31, 29.31, 29.31, 29.31, 29.08, 29.08, 29.08, 29.08, 29.08, 29.05, 29.05, 29.05, 29.05, 29.05, 29.09, 29.09, 29.09, 29.09, 29.09, 29.05, 29.05, 29.05, 29.05, 29.05, 29.07, 29.07, 29.07, 29.07, 29.07, 29.13, 29.13, 29.13, 29.13, 29.13, 29.1, 29.1, 29.1, 29.1, 29.1, 29.11, 29.11, 29.11, 29.11, 29.11, 29.11, 29.11, 29.11, 29.11, 29.11, 29.21, 29.21, 29.21, 29.21, 29.21, 29.31, 29.31, 29.31, 29.31, 29.31, 29.35, 29.35, 29.35, 29.35, 29.35, 29.47, 29.47, 29.47, 29.47]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 528 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716112375 --> 1716113001
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19, 0.19, 0.19, 0.19, 0.19, 0.43, 0.43, 0.43, 0.43, 0.43, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.16, 0.16, 0.16, 0.16, 0.16, 0.38, 0.38, 0.38, 0.38, 0.38, 0.36, 0.36, 0.36, 0.36, 0.36, 0.4, 0.4, 0.4, 0.4, 0.4, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.26, 0.26, 0.26, 0.26, 0.26, 0.3, 0.3, 0.3, 0.3, 0.3, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.31, 0.31, 0.31, 0.31, 0.31, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.37, 0.37, 0.37, 0.37, 0.37, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.19, 0.19, 0.19, 0.19, 0.19, 0.31, 0.31, 0.31, 0.31, 0.31, 0.26, 0.26, 0.26, 0.26, 0.26, 0.31, 0.31, 0.31, 0.31, 0.31, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.34, 0.34, 0.34, 0.34, 0.34, 0.5, 0.5, 0.5, 0.5, 0.5, 0.56, 0.56, 0.56, 0.56, 0.56, 0.49, 0.49, 0.49, 0.49, 0.49, 0.42, 0.42, 0.42, 0.42, 0.42, 0.19, 0.19, 0.19, 0.19, 0.19, 0.24, 0.24, 0.24, 0.24, 0.24, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.21, 0.21, 0.21, 0.21, 0.21, 0.23, 0.23, 0.23, 0.23, 0.23, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 528 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716112375 --> 1716113001
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0]

ochafik · 2024-05-21T00:35:25Z

Superseded by #7424

grammars: cache decoded tokens grammars: faster llama_grammar_copy grammars: fix bad merge grammars: keep llama_grammar_copy non-quadratic optim for later grammars: move token caches to llama_context grammars: cache codepoints in llama_new_context_with_model grammar: nit (layout) grammars: nits (revert const grammar sig, fix comment) Update llama.cpp Co-authored-by: Clint Herron <hanclinto@gmail.com> grammars: mutex-guarded lazy caching of token pieces in llama_sample_grammar grammars: remove early exit --> ggerganov#7370

grammars: early exit when no next_candidates to reject

1cc12ba

ochafik mentioned this pull request May 18, 2024

grammars: cache decoded token codepoints for faster sampling #6811

Open

ochafik marked this pull request as ready for review May 18, 2024 21:29

ochafik added a commit to ochafik/llama.cpp that referenced this pull request May 18, 2024

grammars: remove early exit --> ggerganov#7370

60745ac

mofosyne added review complexity : low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix performance Speed related topics labels May 20, 2024

ochafik added a commit to ochafik/llama.cpp that referenced this pull request May 20, 2024

grammars: early exit from ggerganov#7370

5d60059

ochafik mentioned this pull request May 21, 2024

grammars: fix resampling logic regression #7424

Merged

ochafik closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grammars: early exit when no next_candidates to reject #7370

grammars: early exit when no next_candidates to reject #7370

ochafik commented May 18, 2024 •

edited

github-actions bot commented May 19, 2024

ochafik commented May 21, 2024

grammars: early exit when no next_candidates to reject #7370

grammars: early exit when no next_candidates to reject #7370

Conversation

ochafik commented May 18, 2024 • edited

github-actions bot commented May 19, 2024

ochafik commented May 21, 2024

ochafik commented May 18, 2024 •

edited