Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid to get prompt in infill mode and embedding mode #7286

Merged
merged 4 commits into from
Jun 7, 2024

Conversation

woodx9
Copy link
Contributor

@woodx9 woodx9 commented May 14, 2024

I run server with command './server -m ../1.3b-instruct.gguf --port 8760'
and then run a py client below to get the result.
`import asyncio
import aiohttp
import json
import time

prompt = 'finish the code completion:'
input_prefix = 'def bubbleSort()\n'

async def test():
async with aiohttp.ClientSession() as session:
start_time = time.time()

    payload = {
        'input_prefix': input_prefix,  # 'prompt' needs to be defined before using it
        'input_suffix': '',
        # 'prompt': prompt,
        'n_predict': 10
    }

    async with session.post('http://127.0.0.1:8760/infill', data=json.dumps(payload)) as response:
        elapsed_time = time.time() - start_time
        response_json = await response.json()
        print(response_json)
        print(f"Elapsed time: {elapsed_time} seconds")

asyncio.run(test())`

but I get a error:
{'error': {'code': 400, 'message': 'Either "prompt" or "messages" must be provided', 'type': 'invalid_request_error'}} Elapsed time: 0.0010190010070800781 seconds

I read the code and find out We don't need to get prompt when in infill mode and embedding mode.
So I add a if for the infill mode and embedding mode. It seems fine after I add the code.

Copy link
Contributor

github-actions bot commented May 14, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 528 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8880.66ms p(95)=21492.45ms fails=, finish reason: stop=474 truncated=54
  • Prompt processing (pp): avg=106.72tk/s p(95)=519.48tk/s
  • Token generation (tg): avg=30.93tk/s p(95)=44.32tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=eb42fb79da14121d98406fa9d58e593c484cccba

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 528 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717464179 --> 1717464809
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 886.49, 886.49, 886.49, 886.49, 886.49, 677.75, 677.75, 677.75, 677.75, 677.75, 744.39, 744.39, 744.39, 744.39, 744.39, 748.97, 748.97, 748.97, 748.97, 748.97, 813.38, 813.38, 813.38, 813.38, 813.38, 838.65, 838.65, 838.65, 838.65, 838.65, 834.78, 834.78, 834.78, 834.78, 834.78, 849.28, 849.28, 849.28, 849.28, 849.28, 854.0, 854.0, 854.0, 854.0, 854.0, 864.32, 864.32, 864.32, 864.32, 864.32, 863.37, 863.37, 863.37, 863.37, 863.37, 880.17, 880.17, 880.17, 880.17, 880.17, 913.82, 913.82, 913.82, 913.82, 913.82, 920.43, 920.43, 920.43, 920.43, 920.43, 921.7, 921.7, 921.7, 921.7, 921.7, 926.86, 926.86, 926.86, 926.86, 926.86, 929.49, 929.49, 929.49, 929.49, 929.49, 925.33, 925.33, 925.33, 925.33, 925.33, 917.73, 917.73, 917.73, 917.73, 917.73, 931.95, 931.95, 931.95, 931.95, 931.95, 930.79, 930.79, 930.79, 930.79, 930.79, 930.52, 930.52, 930.52, 930.52, 930.52, 928.84, 928.84, 928.84, 928.84, 928.84, 928.36, 928.36, 928.36, 928.36, 928.36, 944.03, 944.03, 944.03, 944.03, 944.03, 939.25, 939.25, 939.25, 939.25, 939.25, 938.55, 938.55, 938.55, 938.55, 938.55, 938.35, 938.35, 938.35, 938.35, 938.35, 941.37, 941.37, 941.37, 941.37, 941.37, 939.96, 939.96, 939.96, 939.96, 939.96, 938.53, 938.53, 938.53, 938.53, 938.53, 940.41, 940.41, 940.41, 940.41, 940.41, 940.62, 940.62, 940.62, 940.62, 940.62, 936.63, 936.63, 936.63, 936.63, 936.63, 937.44, 937.44, 937.44, 937.44, 937.44, 946.27, 946.27, 946.27, 946.27, 946.27, 950.16, 950.16, 950.16, 950.16, 950.16, 904.26, 904.26, 904.26, 904.26, 904.26, 902.06, 902.06, 902.06, 902.06, 902.06, 899.66, 899.66, 899.66, 899.66, 899.66, 900.89, 900.89, 900.89, 900.89, 900.89, 899.44, 899.44, 899.44, 899.44, 899.44, 898.73, 898.73, 898.73, 898.73, 898.73, 897.95, 897.95, 897.95, 897.95, 897.95, 888.7, 888.7, 888.7, 888.7, 888.7, 864.27, 864.27, 864.27, 864.27, 864.27, 862.69, 862.69, 862.69, 862.69, 862.69, 861.99, 861.99, 861.99, 861.99, 861.99, 866.48, 866.48, 866.48, 866.48, 866.48, 865.5, 865.5, 865.5, 865.5, 865.5, 869.32, 869.32, 869.32, 869.32, 869.32, 868.09, 868.09, 868.09, 868.09, 868.09, 869.8, 869.8, 869.8, 869.8, 869.8, 870.71, 870.71, 870.71, 870.71, 870.71, 869.86, 869.86, 869.86, 869.86, 869.86, 870.88, 870.88, 870.88, 870.88, 870.88, 872.42, 872.42, 872.42, 872.42, 872.42, 873.75, 873.75, 873.75, 873.75, 873.75, 873.52, 873.52, 873.52, 873.52, 873.52, 874.58, 874.58, 874.58, 874.58, 874.58, 873.91, 873.91, 873.91, 873.91, 873.91, 873.91]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 528 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717464179 --> 1717464809
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 43.34, 43.34, 43.34, 43.34, 43.34, 30.42, 30.42, 30.42, 30.42, 30.42, 30.93, 30.93, 30.93, 30.93, 30.93, 32.38, 32.38, 32.38, 32.38, 32.38, 32.43, 32.43, 32.43, 32.43, 32.43, 33.79, 33.79, 33.79, 33.79, 33.79, 34.74, 34.74, 34.74, 34.74, 34.74, 34.83, 34.83, 34.83, 34.83, 34.83, 34.79, 34.79, 34.79, 34.79, 34.79, 34.35, 34.35, 34.35, 34.35, 34.35, 34.42, 34.42, 34.42, 34.42, 34.42, 34.36, 34.36, 34.36, 34.36, 34.36, 33.2, 33.2, 33.2, 33.2, 33.2, 32.48, 32.48, 32.48, 32.48, 32.48, 31.49, 31.49, 31.49, 31.49, 31.49, 30.74, 30.74, 30.74, 30.74, 30.74, 30.66, 30.66, 30.66, 30.66, 30.66, 30.8, 30.8, 30.8, 30.8, 30.8, 30.39, 30.39, 30.39, 30.39, 30.39, 30.35, 30.35, 30.35, 30.35, 30.35, 30.15, 30.15, 30.15, 30.15, 30.15, 30.15, 30.15, 30.15, 30.15, 30.15, 30.41, 30.41, 30.41, 30.41, 30.41, 30.32, 30.32, 30.32, 30.32, 30.32, 30.12, 30.12, 30.12, 30.12, 30.12, 30.26, 30.26, 30.26, 30.26, 30.26, 30.47, 30.47, 30.47, 30.47, 30.47, 30.38, 30.38, 30.38, 30.38, 30.38, 30.43, 30.43, 30.43, 30.43, 30.43, 30.66, 30.66, 30.66, 30.66, 30.66, 30.74, 30.74, 30.74, 30.74, 30.74, 30.85, 30.85, 30.85, 30.85, 30.85, 30.97, 30.97, 30.97, 30.97, 30.97, 31.1, 31.1, 31.1, 31.1, 31.1, 30.97, 30.97, 30.97, 30.97, 30.97, 30.94, 30.94, 30.94, 30.94, 30.94, 30.78, 30.78, 30.78, 30.78, 30.78, 30.54, 30.54, 30.54, 30.54, 30.54, 30.49, 30.49, 30.49, 30.49, 30.49, 30.58, 30.58, 30.58, 30.58, 30.58, 30.66, 30.66, 30.66, 30.66, 30.66, 30.84, 30.84, 30.84, 30.84, 30.84, 30.77, 30.77, 30.77, 30.77, 30.77, 30.68, 30.68, 30.68, 30.68, 30.68, 30.09, 30.09, 30.09, 30.09, 30.09, 30.0, 30.0, 30.0, 30.0, 30.0, 29.16, 29.16, 29.16, 29.16, 29.16, 29.08, 29.08, 29.08, 29.08, 29.08, 29.1, 29.1, 29.1, 29.1, 29.1, 29.19, 29.19, 29.19, 29.19, 29.19, 29.2, 29.2, 29.2, 29.2, 29.2, 29.23, 29.23, 29.23, 29.23, 29.23, 29.2, 29.2, 29.2, 29.2, 29.2, 29.07, 29.07, 29.07, 29.07, 29.07, 29.04, 29.04, 29.04, 29.04, 29.04, 28.94, 28.94, 28.94, 28.94, 28.94, 28.99, 28.99, 28.99, 28.99, 28.99, 29.17, 29.17, 29.17, 29.17, 29.17, 29.2, 29.2, 29.2, 29.2, 29.2, 29.3, 29.3, 29.3, 29.3, 29.3, 29.4]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 528 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717464179 --> 1717464809
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.01, 0.01, 0.01, 0.01, 0.01, 0.38, 0.38, 0.38, 0.38, 0.38, 0.25, 0.25, 0.25, 0.25, 0.25, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.31, 0.31, 0.31, 0.31, 0.31, 0.22, 0.22, 0.22, 0.22, 0.22, 0.36, 0.36, 0.36, 0.36, 0.36, 0.27, 0.27, 0.27, 0.27, 0.27, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.31, 0.31, 0.31, 0.31, 0.31, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.32, 0.32, 0.32, 0.32, 0.32, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.33, 0.33, 0.33, 0.33, 0.33, 0.18, 0.18, 0.18, 0.18, 0.18, 0.39, 0.39, 0.39, 0.39, 0.39, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.08, 0.08, 0.08, 0.08, 0.08, 0.16, 0.16, 0.16, 0.16, 0.16, 0.36, 0.36, 0.36, 0.36, 0.36, 0.59, 0.59, 0.59, 0.59, 0.59, 0.51, 0.51, 0.51, 0.51, 0.51, 0.54, 0.54, 0.54, 0.54, 0.54, 0.21, 0.21, 0.21, 0.21, 0.21, 0.23, 0.23, 0.23, 0.23, 0.23, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.18, 0.18, 0.18, 0.18, 0.18, 0.3, 0.3, 0.3, 0.3, 0.3, 0.27, 0.27, 0.27, 0.27, 0.27, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 528 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717464179 --> 1717464809
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0]
                    

@mofosyne mofosyne added examples Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix server labels May 15, 2024
@mofosyne mofosyne marked this pull request as draft May 15, 2024 06:52
@ggerganov
Copy link
Owner

The embedding CI seems to be failing

@woodx9
Copy link
Contributor Author

woodx9 commented May 28, 2024

The embedding CI seems to be failing

I will test it

@woodx9 woodx9 marked this pull request as ready for review June 2, 2024 11:58
@woodx9
Copy link
Contributor Author

woodx9 commented Jun 2, 2024

It turns out I misunderstood the way embedding is. I think it's sentence embedding model mode.

@woodx9
Copy link
Contributor Author

woodx9 commented Jun 2, 2024

I try this new pr with deepseek code model, it seems fine.

@woodx9
Copy link
Contributor Author

woodx9 commented Jun 7, 2024

@ggerganov @mofosyne hi, plz have a look. This will block the infill mode. I want to pr support for deepseek code model as well after this pr.

@ggerganov ggerganov merged commit a5cabd7 into ggerganov:master Jun 7, 2024
73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants