Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOT token incorrectly set for Mistral-v0.2 trained with added ChatML tokens #7271

Open
xzuyn opened this issue May 14, 2024 · 4 comments
Open

Comments

@xzuyn
Copy link

xzuyn commented May 14, 2024

It's setting the EOT to 32000 and saying that 32000 is <|im_end|>, but it's not that for my model. My tokenizer_config.json shows that 32000 is <|im_start|>, which is how I trained it. This also seems to be causing my model to end responses with <|im_start|> instead of <|im_end|>.

I converted using python3 convert-hf-to-gguf.py --outtype bf16 --outfile "./ggml-model-bf16.gguf" "./MyModelDir", then quantized using quantize "./ggml-model-bf16.gguf" "./MyModel-q6_K.gguf" "q6_K"

Link to the model's QLoRA if it matters.

llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32000 '<|im_end|>'
    "32000": {
      "content": "<|im_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "32001": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    }

It's like it's hardcoded to set <|im_start|> to 32001 and <|im_end|> to 32000 even if that's not what the model uses.

@Jeximo
Copy link
Contributor

Jeximo commented May 14, 2024

It appears your model does not list <|im_start|> or <|im_end|> as a special token. There's logic in llama.cpp if the token is not special.

If you're able, then maybe try adjusting the necessary special tokens to true.

@xzuyn
Copy link
Author

xzuyn commented May 15, 2024

That wouldn't explain this though.

It's like it's hardcoded to set <|im_start|> to 32001 and <|im_end|> to 32000 even if that's not what the model uses.

Screenshot_from_2024-05-13_21-21-54

My model also was not trained with those tokens set as special, so I shouldn't need to change that to get things to work.


Also I think the issue you linked is similar/related to an issue page I made the other day about models that uselegacy: true

@ngxson
Copy link
Collaborator

ngxson commented May 17, 2024

The link to your model is 404 not found.

Anyway, did you check if added_tokens.json is set correctly? (The JSON you posted above is from tokenizer_config.json)

@xzuyn
Copy link
Author

xzuyn commented May 18, 2024

The link to your model is 404 not found.

Sorry, I unprivated it.

Anyway, did you check if added_tokens.json is set correctly? (The JSON you posted above is from tokenizer_config.json)

My model is fine, and the added_tokens.json is also set correctly. The issue here is llama.cpp conversion not matching Transformers at all when it comes to added tokens.

{
  "<|im_end|>": 32001,
  "<|im_start|>": 32000
}
from transformers import AutoTokenizer
import requests


string_to_test = "<|im_start|>user\nTest Input<|im_end|><|im_start|>assistant\nTest Response<|im_end|>"

tokenizer = AutoTokenizer.from_pretrained("PJMixers/MV02-PB-Mixture-v1-run_15-SFT-7B-Latest-QLoRA")

# Model is converted and quantized with lcpp, running on the latest kcpp
koboldcpp_string_to_test = (
    requests.post(
        f"http://127.0.0.1:5001/api/extra/tokencount",
        json={"prompt": string_to_test},
    ).json()["ids"]
)

# Transformers output (Correct)
print(tokenizer.encode(string_to_test))
# [1, 32000, 2188, 13, 1963, 11232, 32001, 32000, 13892, 13, 1963, 12107, 32001]
# ['<s>', '<|im_start|>', '▁user', '<0x0A>', 'Test', '▁Input', '<|im_end|>', '<|im_start|>', '▁assistant', '<0x0A>', 'Test', '▁Response', '<|im_end|>']

# KoboldCPP/llama.cpp output (Very incorrect)
print(koboldcpp_string_to_test)
# [1, 32001, 1838, 13, 1963, 11232, 32000, 32001, 489, 11143, 13, 1963, 12107, 32000]
# ['<s>', '<|im_end|>', 'user', '<0x0A>', 'Test', '▁Input', '<|im_start|>', '<|im_end|>', 'ass', 'isstant', '<0x0A>', 'Test', '▁Response', '<|im_start|>']

You can also see here the legacy: true issue appear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants