`server` production readiness #6398

phymbert · 2024-03-30T09:59:00Z

phymbert
Mar 30, 2024
Collaborator

llama.cpp is considered as production ready.

But what about the server ?

In general, a production-ready system can include the following aspects:

Sufficient testing: The system has been thoroughly tested to ensure it functions as expected under various conditions
Reliability: The system can consistently perform its intended functions without failure
Scalability: The system can handle increased loads or expand in response to demand
Documentation: There is clear and comprehensive documentation that explains how the system works and how to use it
Monitoring: There are tools and processes in place to track the system's performance and alert the necessary parties if something goes wrong.

Since a couple of month, the following PR have been added:

What are the missing features or steps to make the server prod ready ?

References:

phymbert · 2024-03-30T10:04:12Z

phymbert
Mar 30, 2024
Collaborator Author

I am using the server since 3 months on a technical production environment with active users on kubernetes and I am pretty happy with the current server version in terms of features, code, stability, tests and performances.

Especially /chat/completions, /embeddings, /health and /metrics endpoints matter for me.

I think if we continue in this direction we can claim that the server is production ready and we have an efficient LLM serving solution.

@ggerganov @ngxson What do you think ?

2 replies

ngxson Mar 30, 2024
Collaborator

Thanks for the insight and also for your commitment on the project. Yeah I agree that from a high-level perspective, server is more or less ready for prod. However, IMO from a low-level view, I'm still feel that there's missing something (or maybe just because I'm a bit perfectionism when talking about low-level development).

For example, on the HTTP part of the server, we still use thread pool instead of modern coroutine approach. I'm not saying that the current httplib is good or bad, but just saying that in llama.cpp we focus mostly on inference, so the HTTP server is kinda "nice-to-have" to me.

For security, while the core llama.h library can be very memory safe, the interface between it and the HTTP world may have potential buffer overflow somewhere. I don't know exactly where, but just feel that. We may need to do some fuzzing if we really want to consider server as ready for prod.

But in the other hand, I also believe that server example is a good start for llamax. With llamax, we may see many other server implementations that are simply "HTTP wrapper" of llamax. By doing that, we can maybe spend more time to polish the HTTP part of server and make it a really production-ready application.

ggerganov Apr 2, 2024
Maintainer

I am using the server since 3 months on a technical production environment with active users on kubernetes and I am pretty happy with the current server version in terms of features, code, stability, tests and performances.

Wow, that's great to hear!

Regarding the production readiness - I am not actively using the server, so I'm not really familiar with the deficiencies for real-world use cases. The kind of feedback as the one above is quite helpful in this regard.

In terms of the implementation, from the top of my head here are a few things we have to improve on:

Tokenization, handling of special tokens. I think there are some failure cases atm (related to embedding models if I remember correctly). See the TODO in the code
System prompt. It's managed in some convoluted way and it should be simplified. There was a recent PR server: fix system_tokens being erased in kv_cache; #6312 that also should be looked into as it seems to be correct, but needs verification
Context shift and context extension logic should be simplified and reused. This is probably related to a large extend to the work needed for llama : move the sampling API from common into llama lib #5214

In any case, appreciate that you guys are helping out and looking out for the next steps. If you have any further suggestions what to do next for server, feel free to take the initiative.

jboero · 2024-05-16T10:31:03Z

jboero
May 16, 2024

I am pretty happy with the current server version in terms of features, code, stability, tests and performances.

I'm curious what your experience is with load balancing and cache redundancy across pods. Do you use sticky sessions in any way or find cached tokens to be problematic when round robin load balancing? I think server's most underrated feature is caching tokens between completions and saving the time to re-tokenize and re-infer prompts that have already been read before.

0 replies

JohannesGaessler · 2024-05-16T12:06:24Z

JohannesGaessler
May 16, 2024
Collaborator

I think what is still missing for a lot of use cases is something like a --deterministic flag that (at the cost of performance) guarantees reproducible results. I think the main thing that would need to be done in terms of implementation is to always run the server at the maximum number of slots to keep the batch size constant.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`server` production readiness #6398

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

server production readiness #6398

phymbert Mar 30, 2024 Collaborator

Replies: 3 comments · 2 replies

phymbert Mar 30, 2024 Collaborator Author

ngxson Mar 30, 2024 Collaborator

ggerganov Apr 2, 2024 Maintainer

jboero May 16, 2024

JohannesGaessler May 16, 2024 Collaborator

`server` production readiness #6398

phymbert
Mar 30, 2024
Collaborator

Replies: 3 comments 2 replies

phymbert
Mar 30, 2024
Collaborator Author

ngxson Mar 30, 2024
Collaborator

ggerganov Apr 2, 2024
Maintainer

jboero
May 16, 2024

JohannesGaessler
May 16, 2024
Collaborator