Replies: 3 comments 2 replies
-
I am using the server since 3 months on a technical production environment with active users on kubernetes and I am pretty happy with the current server version in terms of features, code, stability, tests and performances. Especially I think if we continue in this direction we can claim that the server is production ready and we have an efficient LLM serving solution. @ggerganov @ngxson What do you think ? |
Beta Was this translation helpful? Give feedback.
-
I'm curious what your experience is with load balancing and cache redundancy across pods. Do you use sticky sessions in any way or find cached tokens to be problematic when round robin load balancing? I think server's most underrated feature is caching tokens between completions and saving the time to re-tokenize and re-infer prompts that have already been read before. |
Beta Was this translation helpful? Give feedback.
-
I think what is still missing for a lot of use cases is something like a |
Beta Was this translation helpful? Give feedback.
-
llama.cpp
is considered as production ready.But what about the server ?
In general, a production-ready system can include the following aspects:
Since a couple of month, the following PR have been added:
--threads
and--threads
,--ubatch-size
,--log-disable
#6254What are the missing features or steps to make the server prod ready ?
References:
Beta Was this translation helpful? Give feedback.
All reactions