Skip to content

Releases: ggerganov/llama.cpp

b2995

25 May 10:48
faa0e69
Compare
Choose a tag to compare
ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (#7433)

* Add SVE support for q4_0_q8_0 q8_0_q8_0

* remove ifdef

b2994

25 May 10:41
9791f40
Compare
Choose a tag to compare
android : module (#7502)

* move ndk code to a new library

* add gradle file

b2993

25 May 04:24
902184d
Compare
Choose a tag to compare
fix missing slash in `fs_get_cache_directory()` (#7503)

* fix missing slash in fs_get_cache_directory()

* use LOCALAPPDATA for fs_get_cache_directory()

* better code style

b2992

25 May 02:29
5768433
Compare
Choose a tag to compare
Make tokenize CLI tool have nicer command line arguments. (#6188)

* Make tokenizer.cpp CLI tool nicer.

Before this commit, tokenize was a simple CLI tool like this:

  tokenize MODEL_FILENAME PROMPT [--ids]

This simple tool loads the model, takes the prompt, and shows the tokens
llama.cpp is interpreting.

This changeset makes the tokenize more sophisticated, and more useful
for debugging and troubleshooting:

  tokenize [-m, --model MODEL_FILENAME]
           [--ids]
           [--stdin]
           [--prompt]
           [-f, --file]
           [--no-bos]
           [--log-disable]

It also behaves nicer on Windows now, interpreting and rendering Unicode
from command line arguments and pipes no matter what code page the user
has set on their terminal.

* style fix: strlen(str) == 0 --> *str == 0

* Simplify tokenize.cpp; by getting rid of handling positional style arguments.

It must now be invoked with long --model, --prompt etc. arguments only.
Shortens the code.

* tokenize.cpp: iostream header no longer required

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: brian khuu <mofosyne@gmail.com>

b2989

24 May 14:23
27891f6
Compare
Choose a tag to compare
docker.yml: disable light-intel and server-intel test (#7515)

* docker.yml: disable light-intel test

* docker.yml: disable server-intel test

b2988

24 May 13:22
fbca2f2
Compare
Choose a tag to compare
Add support for ArcticForCausalLM (#7020)

* common : increase max number of experts to 128

* common : add tensor LLM_TENSOR_FFN_NORM_EXPS for normalization before MoE that runs in parallel to attention + ffn

* gguf-py : add architecture-specific block mappings that override selected general block mappings

* convert-hf : add model conversion support for ArcticForCausalLM

* convert-hf : use added_tokens_decoder from tokenizer_config.json to redefine tokens from SentencePiece model (only for ArcticForCausalLM)

* llama : add inference support for LLM_ARCH_ARCTIC

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

b2985

23 May 15:17
1debe72
Compare
Choose a tag to compare
ggml : silence UB sanitizer error during iq2_xxs quantization (#0)

b2984

23 May 15:11
007489e
Compare
Choose a tag to compare
Fix phi3 chat template confusion with zephyr (#7449)

* Fix phi3 template matching vs zephyr

* Add regression test for new phi3 chat template

* Implement review suggestions

* Fix phi3 jinja test templates & match by <|end|>

* Apply suggestion

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Add all phi3 template variants in tests

* Remove unneeded message trimming

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Fix tests to not expect trimmed messages

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

b2982

23 May 14:07
3015851
Compare
Choose a tag to compare
llama : add getters for n_threads/n_threads_batch (#7464)

* llama : add getters for n_threads/n_threads_batch

This commit adds two new functions to the llama API. The functions
can be used to get the number of threads used for generating a single
token and the number of threads used for prompt and batch processing
(multiple tokens).

The motivation for this is that we want to be able to get the number of
threads that the a context is using. The main use case is for a
testing/verification that the number of threads is set correctly.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! llama : add getters for n_threads/n_threads_batch

Rename the getters to llama_n_threads and llama_n_threads_batch.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

b2981

23 May 14:06
55ac3b7
Compare
Choose a tag to compare
ci : use Pythia models instead of OpenLlama (#7470)

* ci : start using Pythia models over OpenLlama

ggml-ci

* ci : disable q2_k ppl tests

* ci : use convert-hf-to-gguf.py

* ci : update gg_get_model

* ci : fix convert outfile name

ggml-ci

* llama : gptneox arch use F32 attn prec

ggml-ci