18 Oct 06:36

hiroshi-matsuda-rit

Release ver2.1 Latest

Latest

Hugging Face Fast Tokenizer

Specification

tokenizer class: PreTrainedTokenizerFast
- Unigram Byte-fallback model
vocab size: 50,570

Requirements

transformers>=4.34.0
tokenizers>=0.14.0

Usage

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v1.0")

The tokenizer configuration files are bundled with LLM-jp models which are distributed from Hugging Face Hub
The tokenizer can be instantiated using the usual method AutoTokenizer.from_pretrained(model_name_or_path)
The minimum set of HF tokenizer is placed in /hf/ver2.1/code10k_en20k_ja30k.ver2.1_hf_fast

SentencePiece Tokenizer

Specification

SentencePiece Unigram Byte-fallback model
vocab size: 50,570

Requirements

sentencepiece>=0.1.99
protobuf<3.21.0

Usage

from sentencepiece import SentencePieceProcessor
sp = SentencePieceProcessor("models/ver2.1/code10k_en20k_ja30k.ver2.1.model")

The minimum set of HF tokenizer is placed in /models/ver2.1/code10k_en20k_ja30k.ver2.1.model

Assets 3

09 Oct 02:54

hiroshi-matsuda-rit

Release ver2.2

What's Changed

Improve for gpt neox by @hiroshi-matsuda-rit in #3
Release hf-slow ver2.1 alpha1 by @hiroshi-matsuda-rit in #4
Strictly distinguishing slow and fast tokenizers by @hiroshi-matsuda-rit in #5
update docstrings by @hiroshi-matsuda-rit in #6
Release hf_fast.a2 by @hiroshi-matsuda-rit in #7
Release hf_slow.a2 by @hiroshi-matsuda-rit in #8
add trust_remote_code setting files by @hiroshi-matsuda-rit in #9
Apply Apache 2.0 by @hiroshi-matsuda-rit in #10
Release ver2.2 by @hiroshi-matsuda-rit in #11

New Contributors

@hiroshi-matsuda-rit made their first contribution in #3

Full Changelog: https://github.com/llm-jp/llm-ja-tokenizer/commits/v2.2

Contributors

hiroshi-matsuda-rit

Assets 6