Supporting phi-2 tokenizer #7022

BramVanroy · 2024-05-01T13:40:37Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Provide support for phi-2. Running the following yields an error:

python -c "
from huggingface_hub import snapshot_download;
snapshot_download(repo_id='microsoft/phi-2', local_dir='phi-2', local_dir_use_symlinks=False)
"
python convert-hf-to-gguf.py phi-2/ --outtype f16

Error:

Traceback (most recent call last):
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 3001, in <module>
    main()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 2988, in main
    model_instance.set_vocab()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 75, in set_vocab
    self._set_vocab_gpt2()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 331, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 242, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 323, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

Phi-2 uses CodeGenTokenizer, which is a BPE Tokenizer.

I'm not sure if it is as easy as adding the following line here?

{ "name": "phi-2",          "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/microsoft/phi-2" },

Edit tried that, this is the generated hash:

if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
    # ref: https://huggingface.co/microsoft/phi-2
    res = "phi-2"

The text was updated successfully, but these errors were encountered:

arch-btw · 2024-05-01T14:29:36Z

@BramVanroy #7024

turian · 2024-05-06T23:34:46Z

Can you confirm that the HF tokenization and the llama.cpp quantized GGUF'ed tokenizer give identical results?

Particularly when the text has special characters

See #7049 and #7062

BramVanroy · 2024-05-14T18:11:12Z

@turian Any idea how I can easily test that?

flatsiedatsie · 2024-06-08T09:23:34Z

Sorry if this is the wrong thread to post in, and I don't know if this is useful, but I thought I'd share a quick attempt I made to convert a Phi 2 model.

I ran into the above mentioned error, and tried modifying the two .py files for converting from HuggingFace.

To the update files I added more Phi lines I found searching around the issues here, specifically this thread and:
#7219 (comment)

// convert-hf-to-gguf-update.py

    {"name": "phi",            "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/microsoft/phi-1", },
    {"name": "phi-2",          "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/microsoft/phi-2", },
    {"name": "phi-3",          "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct", },

..and two modifications of convert-hf-to-gguf.py:

        if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
            # ref: https://huggingface.co/microsoft/phi-1
            res = "phi"
        if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
            # ref: https://huggingface.co/microsoft/phi-2
            res = "phi-2"

Note that the two chkhsh values are the same?

To the Phi2Model class I added a single add_tokenizer_pre line:

        self.gguf_writer.add_name("Phi2")
        self.gguf_writer.add_tokenizer_pre("gpt-2")
        self.gguf_writer.add_context_length(self.find_hparam(["n_positions", "max_position_embeddings"]))

Then I tried to run my Frankenstein creation. It seemingly worked, but when testing it I saw this error:

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'phi'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'model-q4_0.gguf'
main: error: unable to load model

// I tried removing all the phi 1 references, and tried again. Now the error became:

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'phi-2'

BramVanroy added the enhancement New feature or request label May 1, 2024

BramVanroy mentioned this issue May 1, 2024

llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920

Merged

arch-btw mentioned this issue May 1, 2024

"ï¸ı" is causing chktok to mismatch when using chkhsh #7024

Closed

BramVanroy linked a pull request May 15, 2024 that will close this issue

Add phi-2 tokenizer #7300

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting phi-2 tokenizer #7022

Supporting phi-2 tokenizer #7022

BramVanroy commented May 1, 2024 •

edited

arch-btw commented May 1, 2024

turian commented May 6, 2024

BramVanroy commented May 14, 2024

flatsiedatsie commented Jun 8, 2024 •

edited

Supporting phi-2 tokenizer #7022

Supporting phi-2 tokenizer #7022

Comments

BramVanroy commented May 1, 2024 • edited

Prerequisites

Feature Description

arch-btw commented May 1, 2024

turian commented May 6, 2024

BramVanroy commented May 14, 2024

flatsiedatsie commented Jun 8, 2024 • edited

BramVanroy commented May 1, 2024 •

edited

flatsiedatsie commented Jun 8, 2024 •

edited