Viking tokenizer support #7328

akx · 2024-05-16T13:22:11Z

LumiOpen/Viking-7B has a variant tokenizer

Just using llama-bpe had the model generate sensible Finnish, but this is attempting to do things a bit more correctly.

See

https://huggingface.co/LumiOpen/Viking-7B/discussions/2#664602c8b88f8519c3d50113 for discussion.
converted model (with just llama-bpe): https://huggingface.co/akx/Viking-7B-gguf

ggerganov

The proper way is to update convert-hf-to-gguf-update.py and validate that the tokenization tests pass

akx · 2024-05-17T11:38:35Z

@ggerganov Thanks for the pointers. I think I'm doing things more correctly in this iteration, but the test-tokenizer-0 test is failing (output gist here), and I'm not quite sure where to go from there, even after trying to follow #6920...

test-tokenizer-1-bpe doesn't fail, but prints

llm_load_vocab: mismatch in special tokens definition ( 11/131072 vs 24/131072 ).

among other output.

cc @jonabur (of the Viking team)

ggerganov · 2024-05-17T12:07:52Z

If test-tokenizer-0 fails, this likely means that the pre-tokenizer config is not exactly the same as LLaMA3. I just checked the tokenizer.json file of that model and here is the relevant section:

  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": " ?[^(\\s|[.,!?…。，、।۔،])]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Digits",
        "individual_digits": true
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },

So you have to implement use the respective regexes in llama.cpp instead of re-using LLaMA3:

llama.cpp/llama.cpp

Lines 12287 to 12378 in 27b0406

    
           switch (vocab.type) { 
        
               case LLAMA_VOCAB_TYPE_BPE: 
        
                   switch (vocab.type_pre) { 
        
                       case LLAMA_VOCAB_PRE_TYPE_LLAMA3: 
        
                           ignore_merges = true; 
        
                           word_collection = unicode_regex_split(text, { 
        
                               // original regex from tokenizer.json 
        
                               //"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", 
        
                               // adapted: https://github.com/ggerganov/llama.cpp/pull/6920#issuecomment-2080233989 
        
                               "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_DBRX: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               // same as llama3 
        
                               "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "[\r\n]", 
        
                               "\\s?[A-Za-zµÀ-ÖØ-öø-ƺƼ-ƿǄ-ʓʕ-ʯͰ-ͳͶͷͻ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-ՖႠ-ჅᎠ-Ᏽᏸ-ᏽᲐ-ᲺᲽ-Ჿᴀ-ᴫᵫ-ᵷᵹ-ᶚḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℴℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-ⱻⱾ-ⳤⳫ-ⳮⳲⳳꙀ-ꙭꚀ-ꚛꜢ-ꝯꝱ-ꞇꞋ-ꞎꭰ-ꮿﬀ-ﬆﬓ-ﬗＡ-Ｚａ-ｚ𐐀-𐑏𐒰-𐓓𐓘-𐓻𐲀-𐲲𐳀-𐳲𑢠-𑣟𞤀-𞥃]+", 
        
                               "\\s?[!-/:-~！-／：-～‘-‟　-。]+", 
        
                               "\\s+$", 
        
                               "[一-龥ࠀ-一가-퟿]+", 
        
                               "\\p{N}+", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_CODER: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "[\r\n]", 
        
                               "\\s?\\p{L}+", 
        
                               "\\s?\\p{P}+", 
        
                               "[一-龥ࠀ-一가-퟿]+", 
        
                               "\\p{N}", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_FALCON: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "[\\p{P}\\$\\+<=>\\^~\\|]+", 
        
                               "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", 
        
                               "[0-9][0-9][0-9]", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_MPT: 
        
                           // TODO: MPT pre-tokenization regexes are unknown 
        
                           //       the following are close, but not exact. run the following: 
        
                           //       ./bin/test-tokenizer-0 ../models/ggml-vocab-mpt.gguf 
        
                           GGML_ASSERT("MPT pre-tokenization regexes are unknown - fixes needed"); 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "\\s?\\p{L}+", 
        
                               "\\s?\\p{P}+", 
        
                               "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_STARCODER: 
        
                       case LLAMA_VOCAB_PRE_TYPE_REFACT: 
        
                       case LLAMA_VOCAB_PRE_TYPE_COMMAND_R: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "\\p{N}", 
        
                               "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_GPT2: 
        
                       case LLAMA_VOCAB_PRE_TYPE_OLMO: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_QWEN2: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               // original regex from tokenizer.json 
        
                               // "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" 
        
                               "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", 
        
                           }); 
        
                           break; 
        
                       default: 
        
                           // default regex for BPE tokenization pre-processing 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "[\\p{P}\\$\\+<=>\\^~\\|]+", 
        
                               "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", 
        
                               "\\p{N}+", 
        
                               "[0-9][0-9][0-9]", 
        
                           }); 
        
                           break; 
        
                   } 
        
                   break; 
        
               default: 
        
                   GGML_ASSERT(false); 
        
                   break; 
        
           }

jonabur · 2024-05-27T08:32:40Z

We've also run into this same issue providing a gguf'd version of Poro, using the same tokenizer regex. @akx have you made any progress converting the regex format?

akx · 2024-05-27T17:07:26Z

@jonabur I'm working on it (again) now :)

~~Btw, is the tokenizer the same for the other Vikings?~~

EDIT: evidently it is, same chkhsh!

akx · 2024-05-27T17:46:08Z

I'm not sure what to make of the failing tokenizer tests even after adding in the regexp (and I suppose "digits" means another regexp should be splitting all digits into separate tokens?)

The actual detokenized output matches the expected output, but the token sequence doesn't. The first commit in this PR now improves the output of test-tokenizer-0 for easier diffing...

-expected tokens:
+got tokens:
    746 '
  '
   2392 '
 
  '
  55899 '
@@ -169,37 +169,43 @@
   3395 '""'
  30917 '......'
  17846 '!!!!'
   2420 '!!'
  13728 '????'
   3963 '??'
-  9873 ' I've'
+   383 ' I'
+  7029 ''ve'
   1912 ' been'
- 37493 ' 't'
+   630 ' ''
+   107 't'
    733 'old'
- 17600 ' he's'
+   627 ' he'
+   689 ''s'
   1923 ' there'
     35 ','
    630 ' ''
   1417 'RE'
    791 ' you'
   6189 ' sure'
     54 '?'
- 23586 ' 'M'
+   630 ' ''
+    68 'M'
    835 ' not'
   6189 ' sure'
- 18068 ' I'll'
+   383 ' I'
+  6704 ''ll'
   2463 ' make'
    590 ' it'
     35 ','
- 35018 ' 'D'
+   630 ' ''
+    59 'D'
    791 ' you'
   1647 ' like'
   2032 ' some'
  22940 ' tea'
     54 '?'
   2221 ' We'
     30 '''
   6815 'Ve'
    279 ' a'
  79905 ''l'
     67 'L'

and

-expected tokens:
+got tokens:
    348 '   '
  40540 ' Hello'
-   472 '
+   209 '
    '
+   348 '   '
  40540 ' Hello'

github-actions · 2024-05-27T20:22:37Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 550 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8502.32ms p(95)=20008.53ms fails=, finish reason: stop=502 truncated=48
Prompt processing (pp): avg=98.22tk/s p(95)=429.3tk/s
Token generation (tg): avg=35.91tk/s p(95)=46.57tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=viking-8b-b commit=2c8f62fd408bb6118561ff1d24423e8151925cc5

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716840726 --> 1716841352
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 402.85, 402.85, 402.85, 402.85, 402.85, 528.5, 528.5, 528.5, 528.5, 528.5, 542.77, 542.77, 542.77, 542.77, 542.77, 625.46, 625.46, 625.46, 625.46, 625.46, 649.16, 649.16, 649.16, 649.16, 649.16, 655.62, 655.62, 655.62, 655.62, 655.62, 676.61, 676.61, 676.61, 676.61, 676.61, 694.72, 694.72, 694.72, 694.72, 694.72, 713.12, 713.12, 713.12, 713.12, 713.12, 716.73, 716.73, 716.73, 716.73, 716.73, 735.72, 735.72, 735.72, 735.72, 735.72, 780.19, 780.19, 780.19, 780.19, 780.19, 779.96, 779.96, 779.96, 779.96, 779.96, 781.35, 781.35, 781.35, 781.35, 781.35, 791.92, 791.92, 791.92, 791.92, 791.92, 796.38, 796.38, 796.38, 796.38, 796.38, 793.92, 793.92, 793.92, 793.92, 793.92, 819.08, 819.08, 819.08, 819.08, 819.08, 823.43, 823.43, 823.43, 823.43, 823.43, 831.4, 831.4, 831.4, 831.4, 831.4, 832.03, 832.03, 832.03, 832.03, 832.03, 836.31, 836.31, 836.31, 836.31, 836.31, 830.72, 830.72, 830.72, 830.72, 830.72, 833.2, 833.2, 833.2, 833.2, 833.2, 832.08, 832.08, 832.08, 832.08, 832.08, 841.14, 841.14, 841.14, 841.14, 841.14, 841.13, 841.13, 841.13, 841.13, 841.13, 842.05, 842.05, 842.05, 842.05, 842.05, 845.61, 845.61, 845.61, 845.61, 845.61, 845.19, 845.19, 845.19, 845.19, 845.19, 844.75, 844.75, 844.75, 844.75, 844.75, 851.75, 851.75, 851.75, 851.75, 851.75, 857.33, 857.33, 857.33, 857.33, 857.33, 861.68, 861.68, 861.68, 861.68, 861.68, 862.18, 862.18, 862.18, 862.18, 862.18, 855.28, 855.28, 855.28, 855.28, 855.28, 853.75, 853.75, 853.75, 853.75, 853.75, 854.13, 854.13, 854.13, 854.13, 854.13, 856.14, 856.14, 856.14, 856.14, 856.14, 856.43, 856.43, 856.43, 856.43, 856.43, 864.87, 864.87, 864.87, 864.87, 864.87, 867.15, 867.15, 867.15, 867.15, 867.15, 866.01, 866.01, 866.01, 866.01, 866.01, 863.74, 863.74, 863.74, 863.74, 863.74, 860.52, 860.52, 860.52, 860.52, 860.52, 855.15, 855.15, 855.15, 855.15, 855.15, 858.56, 858.56, 858.56, 858.56, 858.56, 858.32, 858.32, 858.32, 858.32, 858.32, 863.47, 863.47, 863.47, 863.47, 863.47, 862.15, 862.15, 862.15, 862.15, 862.15, 863.52, 863.52, 863.52, 863.52, 863.52, 866.01, 866.01, 866.01, 866.01, 866.01, 864.87, 864.87, 864.87, 864.87, 864.87, 870.15, 870.15, 870.15, 870.15, 870.15, 871.53, 871.53, 871.53, 871.53, 871.53, 871.02, 871.02, 871.02, 871.02, 871.02, 871.72, 871.72, 871.72, 871.72, 871.72, 872.69, 872.69, 872.69, 872.69, 872.69, 873.18, 873.18, 873.18, 873.18, 873.18, 874.91, 874.91, 874.91, 874.91, 874.91, 875.27, 875.27, 875.27, 875.27]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716840726 --> 1716841352
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 42.62, 42.62, 42.62, 42.62, 42.62, 42.63, 42.63, 42.63, 42.63, 42.63, 32.58, 32.58, 32.58, 32.58, 32.58, 35.45, 35.45, 35.45, 35.45, 35.45, 34.73, 34.73, 34.73, 34.73, 34.73, 35.75, 35.75, 35.75, 35.75, 35.75, 36.52, 36.52, 36.52, 36.52, 36.52, 36.26, 36.26, 36.26, 36.26, 36.26, 35.63, 35.63, 35.63, 35.63, 35.63, 35.31, 35.31, 35.31, 35.31, 35.31, 35.43, 35.43, 35.43, 35.43, 35.43, 34.8, 34.8, 34.8, 34.8, 34.8, 33.55, 33.55, 33.55, 33.55, 33.55, 33.08, 33.08, 33.08, 33.08, 33.08, 31.65, 31.65, 31.65, 31.65, 31.65, 30.77, 30.77, 30.77, 30.77, 30.77, 30.41, 30.41, 30.41, 30.41, 30.41, 30.43, 30.43, 30.43, 30.43, 30.43, 30.02, 30.02, 30.02, 30.02, 30.02, 30.2, 30.2, 30.2, 30.2, 30.2, 30.28, 30.28, 30.28, 30.28, 30.28, 30.46, 30.46, 30.46, 30.46, 30.46, 30.26, 30.26, 30.26, 30.26, 30.26, 30.39, 30.39, 30.39, 30.39, 30.39, 30.69, 30.69, 30.69, 30.69, 30.69, 30.59, 30.59, 30.59, 30.59, 30.59, 30.55, 30.55, 30.55, 30.55, 30.55, 30.85, 30.85, 30.85, 30.85, 30.85, 31.11, 31.11, 31.11, 31.11, 31.11, 31.16, 31.16, 31.16, 31.16, 31.16, 31.31, 31.31, 31.31, 31.31, 31.31, 31.42, 31.42, 31.42, 31.42, 31.42, 31.18, 31.18, 31.18, 31.18, 31.18, 31.15, 31.15, 31.15, 31.15, 31.15, 30.67, 30.67, 30.67, 30.67, 30.67, 30.34, 30.34, 30.34, 30.34, 30.34, 30.48, 30.48, 30.48, 30.48, 30.48, 30.71, 30.71, 30.71, 30.71, 30.71, 30.85, 30.85, 30.85, 30.85, 30.85, 30.88, 30.88, 30.88, 30.88, 30.88, 30.99, 30.99, 30.99, 30.99, 30.99, 30.79, 30.79, 30.79, 30.79, 30.79, 30.42, 30.42, 30.42, 30.42, 30.42, 30.09, 30.09, 30.09, 30.09, 30.09, 29.11, 29.11, 29.11, 29.11, 29.11, 28.79, 28.79, 28.79, 28.79, 28.79, 28.76, 28.76, 28.76, 28.76, 28.76, 28.82, 28.82, 28.82, 28.82, 28.82, 28.98, 28.98, 28.98, 28.98, 28.98, 29.0, 29.0, 29.0, 29.0, 29.0, 29.07, 29.07, 29.07, 29.07, 29.07, 29.05, 29.05, 29.05, 29.05, 29.05, 28.98, 28.98, 28.98, 28.98, 28.98, 29.05, 29.05, 29.05, 29.05, 29.05, 29.13, 29.13, 29.13, 29.13, 29.13, 29.23, 29.23, 29.23, 29.23, 29.23, 29.37, 29.37, 29.37, 29.37, 29.37, 29.45, 29.45, 29.45, 29.45, 29.45, 29.51, 29.51, 29.51, 29.51, 29.51, 29.57, 29.57, 29.57, 29.57, 29.57, 29.58, 29.58, 29.58, 29.58]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716840726 --> 1716841352
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.4, 0.4, 0.4, 0.4, 0.4, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.09, 0.09, 0.09, 0.09, 0.09, 0.21, 0.21, 0.21, 0.21, 0.21, 0.32, 0.32, 0.32, 0.32, 0.32, 0.25, 0.25, 0.25, 0.25, 0.25, 0.41, 0.41, 0.41, 0.41, 0.41, 0.3, 0.3, 0.3, 0.3, 0.3, 0.32, 0.32, 0.32, 0.32, 0.32, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.29, 0.29, 0.29, 0.29, 0.29, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.34, 0.34, 0.34, 0.34, 0.34, 0.5, 0.5, 0.5, 0.5, 0.5, 0.59, 0.59, 0.59, 0.59, 0.59, 0.55, 0.55, 0.55, 0.55, 0.55, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.25, 0.25, 0.25, 0.25]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716840726 --> 1716841352
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0]

jonabur · 2024-05-30T06:43:15Z

I'm not familiar with the tokenizer regex unfortunately, we inherited it from Bloom (maybe it should be called bloom tokenizer, instead?) but it looks like it's using some possibly non-standard regex features, and may not translate directly to the regex format supported in llama.cpp.

In particular capturing groups inside character classes, or embedding character classes inside character classes both seem possibly non-standard to me, though it's been a long time since I've done regexes anywhere near this complicated.

jonabur · 2024-05-30T13:50:12Z

llama.cpp

+                    tokenizer_pre == "llama3"    ||
+                    tokenizer_pre == "llama-v3"  ||
+                    tokenizer_pre == "llama-bpe" ||
+                    tokenizer_pre == "viking-7b") {


This needs two changes:

tokenizer_pre needs to be updated to match the "viking" in convert-hf-to-gguf.py

this needs its own if statement block which sets vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_VIKING

jonabur · 2024-05-30T14:02:04Z

llama.cpp

@@ -12580,6 +12581,11 @@ struct llm_tokenizer_bpe {
                            "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
                        });
                        break;
+                    case LLAMA_VOCAB_PRE_TYPE_VIKING:
+                        word_collection = unicode_regex_split(text, {
+                            " ?[^(\\s|[.,!?…。，、।۔،])]+",


I think this regex works for the first test that fails, but I'm still left with one failing test I don't understand.

" ?[^\\s.,!?…。，、।۔،]+",

jonabur · 2024-05-30T14:06:01Z

I suggested two changes which enable the code to work and the tests to pass, but now a different test is failing and I don't understand why. It looks like a unicode character is being split? Any idea what's going on here?

I'm not confident on the updated regex, because I'm not sure what the expected behavior is for embedding a character class inside a character class is supposed to be, or for a capturing group within a character class--I've never seen that done before. "[^(\\s|[...])]" but it seems like the simpler statement should work?

src: 'ied 4 ½ months'
res: 'ied 4 ½ months'
tok: 1502 231 43 882 145 9290
failed test:    'ied 4 ½ months'
detokenized to: 'ied 4 ½ months'
(which matches the expected output)
expected tokens:
  1502 'ied'
   231 ' '
    43 '4'
   231 ' '
  1177 '½'
  9290 ' months'

got tokens:
  1502 'ied'
   231 ' '
    43 '4'
   882 ' �'
   145 '�'
  9290 ' months'

mofosyne added model Model specific python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 16, 2024

ggerganov reviewed May 17, 2024

View reviewed changes

akx mentioned this pull request May 17, 2024

add Viking tokenizer support #7329

Closed

akx marked this pull request as draft May 17, 2024 10:52

akx force-pushed the viking-8b-b branch from 582491a to e69ffba Compare May 17, 2024 11:07

akx mentioned this pull request May 17, 2024

convert-hf-to-gguf-update improvements #7340

Merged

akx force-pushed the viking-8b-b branch 2 times, most recently from ab842e3 to 69f815d Compare May 17, 2024 11:32

akx changed the title ~~Add LumiOpen/Viking-7B to converter script~~ Viking-7B tokenizer support May 17, 2024

akx mentioned this pull request May 26, 2024

Readme: add akx/ggify to tools #1484

Merged

akx force-pushed the viking-8b-b branch from 69f815d to a89252a Compare May 27, 2024 17:24

github-actions bot added the testing Everything test related label May 27, 2024

akx added 2 commits May 27, 2024 20:45

test-tokenizer-0: improve output, show how many tests failed

c28c996

Add Viking-7B tokenizer support

2c8f62f

akx force-pushed the viking-8b-b branch from a89252a to 2c8f62f Compare May 27, 2024 17:45

akx changed the title ~~Viking-7B tokenizer support~~ Viking tokenizer support May 27, 2024

jonabur reviewed May 30, 2024

View reviewed changes

ezosa mentioned this pull request Jun 3, 2024

Poro-34B-chat tokenizer support #7713

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Viking tokenizer support #7328

Viking tokenizer support #7328

akx commented May 16, 2024 •

edited

ggerganov left a comment

akx commented May 17, 2024

ggerganov commented May 17, 2024

jonabur commented May 27, 2024

akx commented May 27, 2024 •

edited

akx commented May 27, 2024

github-actions bot commented May 27, 2024

jonabur commented May 30, 2024

jonabur May 30, 2024

jonabur May 30, 2024

jonabur commented May 30, 2024 •

edited

Viking tokenizer support #7328

Are you sure you want to change the base?

Viking tokenizer support #7328

Conversation

akx commented May 16, 2024 • edited

ggerganov left a comment

Choose a reason for hiding this comment

akx commented May 17, 2024

ggerganov commented May 17, 2024

jonabur commented May 27, 2024

akx commented May 27, 2024 • edited

akx commented May 27, 2024

github-actions bot commented May 27, 2024

jonabur commented May 30, 2024

jonabur May 30, 2024

Choose a reason for hiding this comment

jonabur May 30, 2024

Choose a reason for hiding this comment

jonabur commented May 30, 2024 • edited

akx commented May 16, 2024 •

edited

akx commented May 27, 2024 •

edited

jonabur commented May 30, 2024 •

edited