Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting phi-2 tokenizer #7022

Open
4 tasks done
BramVanroy opened this issue May 1, 2024 · 4 comments · May be fixed by #7300
Open
4 tasks done

Supporting phi-2 tokenizer #7022

BramVanroy opened this issue May 1, 2024 · 4 comments · May be fixed by #7300
Labels
enhancement New feature or request

Comments

@BramVanroy
Copy link

BramVanroy commented May 1, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Provide support for phi-2. Running the following yields an error:

python -c "
from huggingface_hub import snapshot_download;
snapshot_download(repo_id='microsoft/phi-2', local_dir='phi-2', local_dir_use_symlinks=False)
"
python convert-hf-to-gguf.py phi-2/ --outtype f16

Error:

Traceback (most recent call last):
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 3001, in <module>
    main()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 2988, in main
    model_instance.set_vocab()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 75, in set_vocab
    self._set_vocab_gpt2()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 331, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 242, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 323, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

Phi-2 uses CodeGenTokenizer, which is a BPE Tokenizer.

I'm not sure if it is as easy as adding the following line here?

{ "name": "phi-2",          "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/microsoft/phi-2" },

Edit tried that, this is the generated hash:

if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
    # ref: https://huggingface.co/microsoft/phi-2
    res = "phi-2"
@arch-btw
Copy link
Contributor

arch-btw commented May 1, 2024

@BramVanroy #7024

@turian
Copy link

turian commented May 6, 2024

Can you confirm that the HF tokenization and the llama.cpp quantized GGUF'ed tokenizer give identical results?

Particularly when the text has special characters

See #7049 and #7062

@BramVanroy
Copy link
Author

@turian Any idea how I can easily test that?

@BramVanroy BramVanroy linked a pull request May 15, 2024 that will close this issue
@flatsiedatsie
Copy link

flatsiedatsie commented Jun 8, 2024

Sorry if this is the wrong thread to post in, and I don't know if this is useful, but I thought I'd share a quick attempt I made to convert a Phi 2 model.

I ran into the above mentioned error, and tried modifying the two .py files for converting from HuggingFace.

To the update files I added more Phi lines I found searching around the issues here, specifically this thread and:
#7219 (comment)

// convert-hf-to-gguf-update.py

    {"name": "phi",            "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/microsoft/phi-1", },
    {"name": "phi-2",          "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/microsoft/phi-2", },
    {"name": "phi-3",          "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct", },

..and two modifications of convert-hf-to-gguf.py:

        if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
            # ref: https://huggingface.co/microsoft/phi-1
            res = "phi"
        if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
            # ref: https://huggingface.co/microsoft/phi-2
            res = "phi-2"

Note that the two chkhsh values are the same?

To the Phi2Model class I added a single add_tokenizer_pre line:

        self.gguf_writer.add_name("Phi2")
        self.gguf_writer.add_tokenizer_pre("gpt-2")
        self.gguf_writer.add_context_length(self.find_hparam(["n_positions", "max_position_embeddings"]))

Then I tried to run my Frankenstein creation. It seemingly worked, but when testing it I saw this error:

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'phi'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'model-q4_0.gguf'
main: error: unable to load model

// I tried removing all the phi 1 references, and tried again. Now the error became:

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'phi-2'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants