Hugging Face Fast Tokenizer
Specification
- tokenizer class:
PreTrainedTokenizerFast
- Unigram Byte-fallback model
- vocab size: 50,570
Requirements
transformers>=4.34.0
tokenizers>=0.14.0
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v1.0")
- The tokenizer configuration files are bundled with LLM-jp models which are distributed from Hugging Face Hub
- The tokenizer can be instantiated using the usual method
AutoTokenizer.from_pretrained(model_name_or_path)
- The minimum set of HF tokenizer is placed in
/hf/ver2.1/code10k_en20k_ja30k.ver2.1_hf_fast
SentencePiece Tokenizer
Specification
- SentencePiece Unigram Byte-fallback model
- vocab size: 50,570
Requirements
sentencepiece>=0.1.99
protobuf<3.21.0
Usage
from sentencepiece import SentencePieceProcessor
sp = SentencePieceProcessor("models/ver2.1/code10k_en20k_ja30k.ver2.1.model")