Releases · ggerganov/llama.cpp

25 May 10:48

faa0e69

b2995

ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (#7433)

* Add SVE support for q4_0_q8_0 q8_0_q8_0

* remove ifdef

Assets 21

25 May 10:41

github-actions

b2994

9791f40

b2994

android : module (#7502)

* move ndk code to a new library

* add gradle file

Assets 21

25 May 04:24

github-actions

b2993

902184d

b2993

fix missing slash in `fs_get_cache_directory()` (#7503)

* fix missing slash in fs_get_cache_directory()

* use LOCALAPPDATA for fs_get_cache_directory()

* better code style

Assets 21

25 May 02:29

github-actions

b2992

5768433

b2992

Make tokenize CLI tool have nicer command line arguments. (#6188)

* Make tokenizer.cpp CLI tool nicer.

Before this commit, tokenize was a simple CLI tool like this:

  tokenize MODEL_FILENAME PROMPT [--ids]

This simple tool loads the model, takes the prompt, and shows the tokens
llama.cpp is interpreting.

This changeset makes the tokenize more sophisticated, and more useful
for debugging and troubleshooting:

  tokenize [-m, --model MODEL_FILENAME]
           [--ids]
           [--stdin]
           [--prompt]
           [-f, --file]
           [--no-bos]
           [--log-disable]

It also behaves nicer on Windows now, interpreting and rendering Unicode
from command line arguments and pipes no matter what code page the user
has set on their terminal.

* style fix: strlen(str) == 0 --> *str == 0

* Simplify tokenize.cpp; by getting rid of handling positional style arguments.

It must now be invoked with long --model, --prompt etc. arguments only.
Shortens the code.

* tokenize.cpp: iostream header no longer required

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: brian khuu <mofosyne@gmail.com>

Assets 21

24 May 14:23

github-actions

b2989

27891f6

b2989

docker.yml: disable light-intel and server-intel test (#7515)

* docker.yml: disable light-intel test

* docker.yml: disable server-intel test

Assets 21

24 May 13:22

github-actions

b2988

fbca2f2

b2988

Add support for ArcticForCausalLM (#7020)

* common : increase max number of experts to 128

* common : add tensor LLM_TENSOR_FFN_NORM_EXPS for normalization before MoE that runs in parallel to attention + ffn

* gguf-py : add architecture-specific block mappings that override selected general block mappings

* convert-hf : add model conversion support for ArcticForCausalLM

* convert-hf : use added_tokens_decoder from tokenizer_config.json to redefine tokens from SentencePiece model (only for ArcticForCausalLM)

* llama : add inference support for LLM_ARCH_ARCTIC

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

Assets 21

23 May 15:17

github-actions

b2985

1debe72

b2985

ggml : silence UB sanitizer error during iq2_xxs quantization (#0)

Assets 21

23 May 15:11

github-actions

b2984

007489e

b2984

Fix phi3 chat template confusion with zephyr (#7449)

* Fix phi3 template matching vs zephyr

* Add regression test for new phi3 chat template

* Implement review suggestions

* Fix phi3 jinja test templates & match by <|end|>

* Apply suggestion

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Add all phi3 template variants in tests

* Remove unneeded message trimming

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Fix tests to not expect trimmed messages

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

Assets 21

23 May 14:07

github-actions

b2982

3015851

b2982

llama : add getters for n_threads/n_threads_batch (#7464)

* llama : add getters for n_threads/n_threads_batch

This commit adds two new functions to the llama API. The functions
can be used to get the number of threads used for generating a single
token and the number of threads used for prompt and batch processing
(multiple tokens).

The motivation for this is that we want to be able to get the number of
threads that the a context is using. The main use case is for a
testing/verification that the number of threads is set correctly.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! llama : add getters for n_threads/n_threads_batch

Rename the getters to llama_n_threads and llama_n_threads_batch.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

Assets 21

23 May 14:06

github-actions

b2981

55ac3b7

b2981

ci : use Pythia models instead of OpenLlama (#7470)

* ci : start using Pythia models over OpenLlama

ggml-ci

* ci : disable q2_k ppl tests

* ci : use convert-hf-to-gguf.py

* ci : update gg_get_model

* ci : fix convert outfile name

ggml-ci

* llama : gptneox arch use F32 attn prec

ggml-ci

Assets 21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: ggerganov/llama.cpp

b2995

b2994

b2993

b2992

b2989

b2988

b2985

b2984

b2982

b2981