Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grammars: early exit when no next_candidates to reject #7370

Closed
wants to merge 1 commit into from

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented May 18, 2024

Edit: superseded by #7424 (see benchmarks comparing different fix combinations)

This speeds up grammar sampling only in some cases (#4218).

(extracted from #6811 which other change needs a bit more work)

For instance the example taken from @AlienKevin (see #4218 (comment); download issue4218.gbnf and issue4218.txt) runs 1.4x faster (total time; sampling itself went from 28.5 ms per token to 11.7 ms per token, a 2.4x sampling speedup)

# git remote add ochafik https://github.com/ochafik/llama.cpp.git
# git fetch ochafik
( export COMMON_ARGS=(
    -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
    --prompt-cache issue4218.bin
    --grammar-file issue4218.gbnf
    -f issue4218.txt
    -c 3400
  ) && \
  hyperfine --warmup 1 --runs 5 \
    -L branch ochafik/grammars-early-exit,master \
    --setup "\
      git checkout {branch} && \
      make clean && make -j LLAMA_CURL=1 main && \
      rm -f issue4218.bin && \
      ./main ${COMMON_ARGS[*]} -n 1" \
    "BRANCH={branch} \
      ./main ${COMMON_ARGS[*]} -n 128 --prompt-cache-ro --seed 12345 --no-display-prompt" )
Show output
Benchmark 1: BRANCH=grammars-early-exit       ./main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf --prompt-cache issue4218.bin --grammar-file issue4218.gbnf -f issue4218.txt -c 3400 -n 128 --prompt-cache-ro --seed 12345 --no-display-prompt
  Time (mean ± σ):      5.697 s ±  0.057 s    [User: 1.705 s, System: 0.247 s]
  Range (min … max):    5.639 s …  5.772 s    5 runs
 
Benchmark 2: BRANCH=master       ./main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf --prompt-cache issue4218.bin --grammar-file issue4218.gbnf -f issue4218.txt -c 3400 -n 128 --prompt-cache-ro --seed 12345 --no-display-prompt
  Time (mean ± σ):      7.946 s ±  0.048 s    [User: 3.838 s, System: 0.257 s]
  Range (min … max):    7.915 s …  8.027 s    5 runs
 
  Warning: The first benchmarking run for this command was significantly slower than the rest (8.027 s). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.
 
Summary
  'BRANCH=grammars-early-exit       ./main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf --prompt-cache issue4218.bin --grammar-file issue4218.gbnf -f issue4218.txt -c 3400 -n 128 --prompt-cache-ro --seed 12345 --no-display-prompt' ran
    1.39 ± 0.02 times faster than 'BRANCH=master       ./main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf --prompt-cache issue4218.bin --grammar-file issue4218.gbnf -f issue4218.txt -c 3400 -n 128 --prompt-cache-ro --seed 12345 --no-display-prompt'

master:

llama_print_timings:        load time =     323.50 ms
llama_print_timings:      sample time =    3651.29 ms /   128 runs   (   28.53 ms per token,    35.06 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =    3408.83 ms /   127 runs   (   26.84 ms per token,    37.26 tokens per second)
llama_print_timings:       total time =    7273.18 ms /   127 tokens

this PR:

llama_print_timings:        load time =     311.38 ms
llama_print_timings:      sample time =    1491.73 ms /   128 runs   (   11.65 ms per token,    85.81 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =    3363.72 ms /   127 runs   (   26.49 ms per token,    37.76 tokens per second)
llama_print_timings:       total time =    5073.34 ms /   127 tokens

cc/ @HanClinto

@ochafik ochafik marked this pull request as ready for review May 18, 2024 21:29
ochafik added a commit to ochafik/llama.cpp that referenced this pull request May 18, 2024
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 528 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8873.01ms p(95)=22039.97ms fails=, finish reason: stop=466 truncated=62
  • Prompt processing (pp): avg=103.75tk/s p(95)=500.38tk/s
  • Token generation (tg): avg=45.11tk/s p(95)=45.95tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=grammars-early-exit commit=1cc12bad500da6f88f3375733693530bca3ce05c

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 528 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716112375 --> 1716113001
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 621.51, 621.51, 621.51, 621.51, 621.51, 694.34, 694.34, 694.34, 694.34, 694.34, 687.97, 687.97, 687.97, 687.97, 687.97, 727.98, 727.98, 727.98, 727.98, 727.98, 800.64, 800.64, 800.64, 800.64, 800.64, 797.91, 797.91, 797.91, 797.91, 797.91, 798.83, 798.83, 798.83, 798.83, 798.83, 813.13, 813.13, 813.13, 813.13, 813.13, 808.85, 808.85, 808.85, 808.85, 808.85, 825.57, 825.57, 825.57, 825.57, 825.57, 851.34, 851.34, 851.34, 851.34, 851.34, 879.13, 879.13, 879.13, 879.13, 879.13, 884.79, 884.79, 884.79, 884.79, 884.79, 815.12, 815.12, 815.12, 815.12, 815.12, 831.63, 831.63, 831.63, 831.63, 831.63, 833.7, 833.7, 833.7, 833.7, 833.7, 835.65, 835.65, 835.65, 835.65, 835.65, 833.89, 833.89, 833.89, 833.89, 833.89, 851.05, 851.05, 851.05, 851.05, 851.05, 851.9, 851.9, 851.9, 851.9, 851.9, 849.49, 849.49, 849.49, 849.49, 849.49, 854.74, 854.74, 854.74, 854.74, 854.74, 857.57, 857.57, 857.57, 857.57, 857.57, 858.67, 858.67, 858.67, 858.67, 858.67, 826.39, 826.39, 826.39, 826.39, 826.39, 826.4, 826.4, 826.4, 826.4, 826.4, 828.9, 828.9, 828.9, 828.9, 828.9, 842.13, 842.13, 842.13, 842.13, 842.13, 839.76, 839.76, 839.76, 839.76, 839.76, 837.33, 837.33, 837.33, 837.33, 837.33, 840.27, 840.27, 840.27, 840.27, 840.27, 843.46, 843.46, 843.46, 843.46, 843.46, 841.82, 841.82, 841.82, 841.82, 841.82, 842.95, 842.95, 842.95, 842.95, 842.95, 846.04, 846.04, 846.04, 846.04, 846.04, 850.26, 850.26, 850.26, 850.26, 850.26, 855.34, 855.34, 855.34, 855.34, 855.34, 854.85, 854.85, 854.85, 854.85, 854.85, 852.99, 852.99, 852.99, 852.99, 852.99, 855.69, 855.69, 855.69, 855.69, 855.69, 855.43, 855.43, 855.43, 855.43, 855.43, 856.0, 856.0, 856.0, 856.0, 856.0, 832.78, 832.78, 832.78, 832.78, 832.78, 836.5, 836.5, 836.5, 836.5, 836.5, 798.58, 798.58, 798.58, 798.58, 798.58, 797.08, 797.08, 797.08, 797.08, 797.08, 796.01, 796.01, 796.01, 796.01, 796.01, 795.42, 795.42, 795.42, 795.42, 795.42, 800.54, 800.54, 800.54, 800.54, 800.54, 800.04, 800.04, 800.04, 800.04, 800.04, 801.91, 801.91, 801.91, 801.91, 801.91, 805.39, 805.39, 805.39, 805.39, 805.39, 805.82, 805.82, 805.82, 805.82, 805.82, 807.73, 807.73, 807.73, 807.73, 807.73, 803.34, 803.34, 803.34, 803.34, 803.34, 811.03, 811.03, 811.03, 811.03, 811.03, 810.58, 810.58, 810.58, 810.58, 810.58, 812.14, 812.14, 812.14, 812.14, 812.14, 812.08, 812.08, 812.08, 812.08, 812.08, 812.66, 812.66, 812.66, 812.66, 812.66, 812.44, 812.44, 812.44, 812.44]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 528 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716112375 --> 1716113001
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.1, 44.1, 44.1, 44.1, 44.1, 26.84, 26.84, 26.84, 26.84, 26.84, 26.84, 26.84, 26.84, 26.84, 26.84, 30.65, 30.65, 30.65, 30.65, 30.65, 31.07, 31.07, 31.07, 31.07, 31.07, 32.38, 32.38, 32.38, 32.38, 32.38, 32.67, 32.67, 32.67, 32.67, 32.67, 34.06, 34.06, 34.06, 34.06, 34.06, 34.3, 34.3, 34.3, 34.3, 34.3, 34.04, 34.04, 34.04, 34.04, 34.04, 33.91, 33.91, 33.91, 33.91, 33.91, 33.9, 33.9, 33.9, 33.9, 33.9, 32.92, 32.92, 32.92, 32.92, 32.92, 32.97, 32.97, 32.97, 32.97, 32.97, 31.74, 31.74, 31.74, 31.74, 31.74, 30.92, 30.92, 30.92, 30.92, 30.92, 29.9, 29.9, 29.9, 29.9, 29.9, 30.03, 30.03, 30.03, 30.03, 30.03, 30.1, 30.1, 30.1, 30.1, 30.1, 30.12, 30.12, 30.12, 30.12, 30.12, 29.83, 29.83, 29.83, 29.83, 29.83, 29.76, 29.76, 29.76, 29.76, 29.76, 29.69, 29.69, 29.69, 29.69, 29.69, 29.85, 29.85, 29.85, 29.85, 29.85, 30.0, 30.0, 30.0, 30.0, 30.0, 30.02, 30.02, 30.02, 30.02, 30.02, 30.23, 30.23, 30.23, 30.23, 30.23, 30.3, 30.3, 30.3, 30.3, 30.3, 30.11, 30.11, 30.11, 30.11, 30.11, 30.22, 30.22, 30.22, 30.22, 30.22, 30.42, 30.42, 30.42, 30.42, 30.42, 30.59, 30.59, 30.59, 30.59, 30.59, 30.78, 30.78, 30.78, 30.78, 30.78, 31.0, 31.0, 31.0, 31.0, 31.0, 31.11, 31.11, 31.11, 31.11, 31.11, 30.98, 30.98, 30.98, 30.98, 30.98, 30.88, 30.88, 30.88, 30.88, 30.88, 30.49, 30.49, 30.49, 30.49, 30.49, 30.48, 30.48, 30.48, 30.48, 30.48, 30.68, 30.68, 30.68, 30.68, 30.68, 30.71, 30.71, 30.71, 30.71, 30.71, 30.82, 30.82, 30.82, 30.82, 30.82, 30.9, 30.9, 30.9, 30.9, 30.9, 30.62, 30.62, 30.62, 30.62, 30.62, 30.59, 30.59, 30.59, 30.59, 30.59, 30.27, 30.27, 30.27, 30.27, 30.27, 29.71, 29.71, 29.71, 29.71, 29.71, 29.31, 29.31, 29.31, 29.31, 29.31, 29.08, 29.08, 29.08, 29.08, 29.08, 29.05, 29.05, 29.05, 29.05, 29.05, 29.09, 29.09, 29.09, 29.09, 29.09, 29.05, 29.05, 29.05, 29.05, 29.05, 29.07, 29.07, 29.07, 29.07, 29.07, 29.13, 29.13, 29.13, 29.13, 29.13, 29.1, 29.1, 29.1, 29.1, 29.1, 29.11, 29.11, 29.11, 29.11, 29.11, 29.11, 29.11, 29.11, 29.11, 29.11, 29.21, 29.21, 29.21, 29.21, 29.21, 29.31, 29.31, 29.31, 29.31, 29.31, 29.35, 29.35, 29.35, 29.35, 29.35, 29.47, 29.47, 29.47, 29.47]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 528 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716112375 --> 1716113001
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19, 0.19, 0.19, 0.19, 0.19, 0.43, 0.43, 0.43, 0.43, 0.43, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.16, 0.16, 0.16, 0.16, 0.16, 0.38, 0.38, 0.38, 0.38, 0.38, 0.36, 0.36, 0.36, 0.36, 0.36, 0.4, 0.4, 0.4, 0.4, 0.4, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.26, 0.26, 0.26, 0.26, 0.26, 0.3, 0.3, 0.3, 0.3, 0.3, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.31, 0.31, 0.31, 0.31, 0.31, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.37, 0.37, 0.37, 0.37, 0.37, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.19, 0.19, 0.19, 0.19, 0.19, 0.31, 0.31, 0.31, 0.31, 0.31, 0.26, 0.26, 0.26, 0.26, 0.26, 0.31, 0.31, 0.31, 0.31, 0.31, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.34, 0.34, 0.34, 0.34, 0.34, 0.5, 0.5, 0.5, 0.5, 0.5, 0.56, 0.56, 0.56, 0.56, 0.56, 0.49, 0.49, 0.49, 0.49, 0.49, 0.42, 0.42, 0.42, 0.42, 0.42, 0.19, 0.19, 0.19, 0.19, 0.19, 0.24, 0.24, 0.24, 0.24, 0.24, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.21, 0.21, 0.21, 0.21, 0.21, 0.23, 0.23, 0.23, 0.23, 0.23, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 528 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716112375 --> 1716113001
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0]
                    

@mofosyne mofosyne added review complexity : low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix performance Speed related topics labels May 20, 2024
ochafik added a commit to ochafik/llama.cpp that referenced this pull request May 20, 2024
@ochafik
Copy link
Collaborator Author

ochafik commented May 21, 2024

Superseded by #7424

@ochafik ochafik closed this May 21, 2024
ochafik added a commit to ochafik/llama.cpp that referenced this pull request May 21, 2024
grammars: cache decoded tokens

grammars: faster llama_grammar_copy

grammars: fix bad merge

grammars: keep llama_grammar_copy non-quadratic optim for later

grammars: move token caches to llama_context

grammars: cache codepoints in llama_new_context_with_model

grammar: nit (layout)

grammars: nits (revert const grammar sig, fix comment)

Update llama.cpp

Co-authored-by: Clint Herron <hanclinto@gmail.com>

grammars: mutex-guarded lazy caching of token pieces in llama_sample_grammar

grammars: remove early exit --> ggerganov#7370
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics review complexity : low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants