question about applying `convert.py` on mixed precision models like `meta-llama/Llama-2-13b-chat-hf` #5644

rchen19 · 2024-02-21T19:29:13Z

rchen19
Feb 21, 2024

As I understand it, models like meta-llama/Llama-2-13b-chat-hf contains both fp16 and fp32 tensors, so I am wondering:

When I set --outtype fp16, do all the fp32 tensors in the model gets converted to fp16 and the tensors that are already fp16 gets no change?
And When I set --outtype fp32, do all the fp16 tensors in the model gets converted to fp32, and the fp32 tensors does not change?

Thanks!

slaren · 2024-02-21T19:41:07Z

slaren
Feb 21, 2024
Collaborator

I think that all the tensors in the llama-2 model files distributed by meta are BF16. When converting or quantizing the model to GGUF, some of these tensors are always exported as FP32, regardless of the --outtype parameter or the quantization format. The reason for this is due to an implementation detail of ggml - these tensors are used in a ggml_mul operation that does not support FP16 or quantized types. These tensors are very small, so this does not affect significantly the size of the model files.

5 replies

rchen19 Feb 21, 2024
Author

I got the idea of llama being mixed precision from here: https://huggingface.co/meta-llama/Llama-2-13b-chat-hf, on the right side of the page it says Tensor type F32 FP16, not sure if there's a difference from the original meta's own version. I am converting from the HF version though.

rchen19 Feb 22, 2024
Author

OK, I think the original meta version of Llama-2 was released in bf16, but the HuggingFace hosted version is mostly fp16 with some layers being f32:

https://magazine.sebastianraschka.com/p/the-missing-bits-llama-2-weights

devzzzero May 13, 2024

Hi, looking through the gguf-py code a bit, it looks like gguf does NOT support bf16 at all, only fp16 and fp32. And I didn't see any tests where the bf16's exp field is examined to see if a fp32 would be better (i.e. to not lose bits in the exponent). Assuming I'm correct, does this mean that the as is, it seems like there is a built in conversion error when moving from original tensors in bf16 (e.g. llama 3) to gguf.

Am I correct in this? @ggerganov ? Can you demystify?

Thank you.

compilade May 13, 2024
Collaborator

@devzzzero There has been recent changes in convert-hf-to-gguf.py and gguf-py to support bf16 conversion, see #7158.
And bf16 is supported in GGUF since #6412, which was merged 5 days ago.

devzzzero May 13, 2024

Wow! @compilade @ggerganov Thank you. I'll pull and recompile.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about applying `convert.py` on mixed precision models like `meta-llama/Llama-2-13b-chat-hf` #5644

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

question about applying convert.py on mixed precision models like meta-llama/Llama-2-13b-chat-hf #5644

rchen19 Feb 21, 2024

Replies: 1 comment · 5 replies

slaren Feb 21, 2024 Collaborator

rchen19 Feb 21, 2024 Author

rchen19 Feb 22, 2024 Author

devzzzero May 13, 2024

compilade May 13, 2024 Collaborator

devzzzero May 13, 2024

question about applying `convert.py` on mixed precision models like `meta-llama/Llama-2-13b-chat-hf` #5644

rchen19
Feb 21, 2024

Replies: 1 comment 5 replies

slaren
Feb 21, 2024
Collaborator

rchen19 Feb 21, 2024
Author

rchen19 Feb 22, 2024
Author

compilade May 13, 2024
Collaborator