Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multilingual Viking models, please. #7309

Open
JohnClaw opened this issue May 15, 2024 · 1 comment
Open

Add support for multilingual Viking models, please. #7309

JohnClaw opened this issue May 15, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@JohnClaw
Copy link

Convert.py script can't make ggufs for these models:

https://huggingface.co/LumiOpen/Viking-7B
https://huggingface.co/LumiOpen/Viking-13B
https://huggingface.co/LumiOpen/Viking-33B

@JohnClaw JohnClaw added the enhancement New feature or request label May 15, 2024
@compilade
Copy link
Collaborator

compilade commented May 16, 2024

Convert.py

You should use convert-hf-to-gguf.py for most models on HuggingFace. convert.py only supports Llama-like models.

However, the Viking models use a BPE tokenizer, and its pre-tokenizer uses a different regex which is not yet defined in llama.cpp. From its tokenizer.json:

{
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": " ?[^(\\s|[.,!?…。,、।۔،])]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Digits",
        "individual_digits": true
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },
}

This will need to be handled for pre-tokenization to work correctly for this model family.

(conversion currently fails when the pre-tokenizer doesn't match a known one)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants