Skip to content

cometkim/unicode-segmenter

Repository files navigation

unicode-segmenter

Package Version Integration codecov LICENSE - MIT

A lightweight and fast, pure JavaScript library for Unicode segmentation.

Features

The library includes utilities to deal with:

  • Emojis and pictographic
  • Extended grapheme clusters
  • Non-Latin alphabets and numbers
  • UTF-8 characters and UTF-16 surrogates
  • Polyfill Intl.Segmenter API

With no dependencies, so you can use it even in places where built-in Unicode libraries aren't available, such as old browsers, edge runtimes, and embedded environments.

Unicode® version

Unicode® 15.1.0 Standard Annex #29 Revision 43 (2023-08-16)

Usage

Using TypeScript

No worry. Library is fully typed, and provides *.d.ts for you 😉

Export unicode-segmenter/emoji

Utilities for matching emoji-like characters

Example: Use Unicode emoji property matches

import {
  isEmoji,             // match \p{Extended_Pictographic}
  isEmojiPresentation, // match \p{Emoji_Presentation}
} from 'unicode-segmenter/emoji';

isEmoji('😍'.codePointAt(0));
// => true
isEmoji('♡'.codePointAt(0));
// => true

isEmojiPresentation('😍'.codePointAt(0));
// => true
isEmojiPresentation('♡'.codePointAt(0));
// => false

Export unicode-segmenter/general

Utilities for matching alphanumeric characters

Example: Use Unicode general property matchers

import {
  isLetter,       // match \p{L}
  isNumeric,      // match \p{N}
  isAlphabetic,   // match \p{Alphabetic}
  isAlphanumeric, // match [\p{N}\p{Alphabetic}]
} from 'unicode-segmenter/general';

Export unicode-segmenter/grapheme

Utilities for text segmentation by extended grapheme cluster rules

Example: Count graphemes

import * as assert from 'node:assert/strict';
import { countGrapheme } from 'unicode-segmenter/grapheme';

assert.equal('👋 안녕!'.length, 6);
assert.equal(countGrapheme('👋 안녕!'), 5);

assert.equal('a̐éö̲'.length, 7);
assert.equal(countGrapheme('a̐éö̲'), 3);

Example: Get grapheme segments

import { graphemeSegments } from 'unicode-segmenter/grapheme';

[...graphemeSegments('a̐éö̲\r\n')];
// 0: { segment: 'a̐', index: 0, input: 'a̐éö̲\r\n' }
// 1: { segment: 'é', index: 2, input: 'a̐éö̲\r\n' }
// 2: { segment: 'ö̲', index: 4, input: 'a̐éö̲\r\n' }
// 3: { segment: '\r\n', index: 7, input: 'a̐éö̲\r\n' }

Example: Build an advanced grapheme matcher

import { graphemeSegments, GraphemeCategory } from 'unicode-segmenter/grapheme';

function* matchEmoji(str) {
  // internal field `_cat` is GraphemeCategory value of the match index
  for (const { segment, _cat } of graphemeSegments(input)) {
    if (_cat === GraphemeCategory.Extended_Pictographic) {
      yield segment;
    }
  }
}

[...matchEmoji('1🌷2🎁3💩4😜5👍')]
// 0: 🌷
// 1: 🎁
// 2: 💩
// 3: 😜
// 4: 👍

Export unicode-segmenter/intl-adapter

Intl.Segmenter API adapter (only granularity: "grapheme" available yet)

import { Segmenter } from 'unicode-segmenter/intl-adapter';

// Same API with the `Intl.Segmenter`
const segmenter = new Segmenter();

Export unicode-segmenter/intl-polyfill

Intl.Segmenter API polyfill (only granularity: "grapheme" available yet)

// Apply polyfill to the `globalThis.Intl` object.
import 'unicode-segmenter/intl-polyfill';

const segmenter = new Intl.Segmenter();

Export unicode-segmenter/utils

You can access some internal utilities to deal with UTF-8 in the JavaScript

Example: Handle UTF-16 surrogate pairs

import {
  isHighSurrogate,
  isLowSurrogate,
  surrogatePairToCodePoint,
} from 'unicode-segmenter/utils';

const u32 = '😍';
const hi = u32.charCodeAt(0);
const lo = u32.charCodeAt(1);

if (isHighSurrogate(hi) && isLowSurrogate(lo)) {
  const codePoint = surrogatePairToCodePoint(hi, lo); // equivalent to u32.codePointAt(0)
}

Example: Take a UTF-8 character from a JS string

import {
  takeChar,
  takeCodePoint,
} from 'unicode-segmenter/utils';

const str = '😍♡😍'; // .length = 5

let ch = '';
let cursor = 0;

ch = takeChar(str, cursor);              // => '😍'
ch = takeChar(str, cursor += ch.length); // => '♡'
ch = takeChar(str, cursor += ch.length); // => '😍'

// `takeCodePoint` does same, but returns Unicode code point

Benchmarks

This library aims to be lighter and faster than alternatives in the ecosystem.

Look benchmark to see how it works.

unicode-segmenter/emoji vs

  • built-in Unicode RegExp
  • emoji-regex@10.3.0 (101M+ weekly downloads on NPM)

Package stats

Name Unicode® ESM? Size Size (min) Size (min+gzip) Size (min+br)
unicode-segmenter/emoji 15.1.0 ✔️ 3,058 2,611 1,041 751
emoji-regex* 15.1.0 (vary) ✔️ 12,946 12,859 2,180 1,746
RegExp w/ u* - - 0 0 0 0
  • emoji-regex only supports Emoji_Presentation property, not Extended_Pictographic.
  • You can build your own emoji-regex using emoji-test-regex-pattern.
  • RegExp Unicode data is always kept up to date as the runtime support.
  • RegExp Unicode may not be available in some old browsers, edge runtimes, or embedded environments.

Runtime performance

The runtime performance of unicode-segmenter/emoji is enough to test the presence of emoji in a text.

It's ~2.5x worse than RegExp w/ u for match-all performance, but that's useless examples in the real world because others don't care about grapheme clusters.

Details
cpu: Apple M1 Pro
runtime: node v20.13.1 (arm64-darwin)

benchmark                    time (avg)             (min … max)       p75       p99      p999
--------------------------------------------------------------- -----------------------------
• checking if any emoji
--------------------------------------------------------------- -----------------------------
unicode-segmenter/emoji   16.11 ns/iter     (15.28 ns … 339 ns)  16.32 ns  18.66 ns  43.42 ns
RegExp w/ unicode         19.03 ns/iter     (16.52 ns … 185 ns)   17.9 ns  46.28 ns  74.85 ns
emoji-regex               43.15 ns/iter   (41.54 ns … 73.51 ns)  43.58 ns  47.93 ns  65.73 ns

summary for checking if any emoji
  unicode-segmenter/emoji
   1.18x faster than RegExp w/ unicode
   2.68x faster than emoji-regex

• match all emoji
--------------------------------------------------------------- -----------------------------
unicode-segmenter/emoji   3'215 ns/iter     (2'958 ns … 189 µs)  3'208 ns  3'708 ns 11'833 ns
RegExp w/ unicode         1'285 ns/iter   (1'221 ns … 1'509 ns)  1'299 ns  1'449 ns  1'509 ns
emoji-regex              11'696 ns/iter    (11'125 ns … 239 µs) 11'667 ns 16'125 ns 20'375 ns

summary for match all emoji
  unicode-segmenter/emoji
   2.5x slower than RegExp w/ unicode
   3.64x faster than emoji-regex

unicode-segmenter/general vs

  • built-in unicode RegExp
  • XRegExp@5.1.1 (2.8M+ weekly downloads on NPM)

Package stats

Name Unicode® ESM? Size Size (min) Size (min+gzip) Size (min+br)
unicode-segmenter/general 15.1.0 ✔️ 21,505 20,972 5,792 3,564
XRegExp 14.0.0 ✖️ ️ 383,156 194,202 62,986 39,871
RegExp w/ u* - - 0 0 0 0
  • RegExp Unicode data is always kept up to date as the runtime support.
  • RegExp Unicode may not be available in some old browsers, edge runtimes, or embedded environments.

Runtime performance

unicode-segmenter/general is almost equivalent to RegExp w/ u.

Details
cpu: Apple M1 Pro
runtime: node v20.13.1 (arm64-darwin)

benchmark                      time (avg)             (min … max)       p75       p99      p999
----------------------------------------------------------------- -----------------------------
• checking any alphanumeric
----------------------------------------------------------------- -----------------------------
unicode-segmenter/general     236 ns/iter       (229 ns … 579 ns)    233 ns    304 ns    552 ns
XRegExp                       243 ns/iter       (239 ns … 319 ns)    242 ns    285 ns    317 ns
RegExp w/ unicode             236 ns/iter       (233 ns … 312 ns)    237 ns    263 ns    299 ns

summary for checking any alphanumeric
  unicode-segmenter/general
   1x faster than RegExp w/ unicode
   1.03x faster than XRegExp

• match all alphanumeric
----------------------------------------------------------------- -----------------------------
unicode-segmenter/general   1'883 ns/iter   (1'851 ns … 2'105 ns)  1'880 ns  2'027 ns  2'105 ns
XRegExp                     3'135 ns/iter   (3'109 ns … 3'300 ns)  3'137 ns  3'273 ns  3'300 ns
RegExp w/ unicode           1'540 ns/iter   (1'520 ns … 1'655 ns)  1'544 ns  1'643 ns  1'655 ns

summary for match all alphanumeric
  RegExp w/ unicode
   1.22x faster than unicode-segmenter/general
   2.04x faster than XRegExp

unicode-segmenter/grapheme vs

Package stats

Name Unicode® ESM? Size Size (min) Size (min+gzip) Size (min+br)
unicode-segmenter/grapheme 15.1.0 ✔️ 33,594 29,915 9,454 5,776
graphemer 15.0.0 ✖️ ️ 410,424 95,104 15,752 10,660
grapheme-splitter 10.0.0 ✖️ 122,241 23,680 7,852 4,841
unicode-segmentation* 15.0.0 ✔️ 51,251 51,251 22,545 16,614
Intl.Segmenter* - - 0 0 0 0
  • unicode-segmentation size contains only the minimum WASM binary. It will be larger by adding more bindings.
  • Intl.Segmenter's Unicode data is always kept up to date as the runtime support.
  • Intl.Segmenter may not be available in some old browsers, edge runtimes, or embedded environments.

Runtime performance

unicode-segmenter/grapheme is 7~18x faster than other JS alternatives, 3~8x faster than native Intl.Segmenter), and 1.5~3x faster than WASM build of the Rust unicode-segmentation library.

The gap may increase depending on the environment. Bindings for browsers generally appear to perform worse. In most environments, unicode-segmenter/grapheme is over 6x faster than graphemer.

Details
cpu: Apple M1 Pro
runtime: node v20.13.1 (arm64-darwin)

benchmark                                        time (avg)             (min … max)       p75       p99      p999
----------------------------------------------------------------------------------- -----------------------------
• Lorem ipsum (ascii)
----------------------------------------------------------------------------------- -----------------------------
unicode-segmenter                             5'529 ns/iter     (5'166 ns … 188 µs)  5'334 ns  5'917 ns 50'584 ns
Intl.Segmenter                               50'420 ns/iter    (46'584 ns … 564 µs) 49'916 ns 57'667 ns    317 µs
graphemer                                    49'188 ns/iter    (46'833 ns … 395 µs) 47'959 ns 99'583 ns    221 µs
grapheme-splitter                            74'743 ns/iter    (73'292 ns … 229 µs) 73'917 ns 80'833 ns    192 µs
unicode-rs/unicode-segmentation (wasm-pack)  16'130 ns/iter    (15'625 ns … 292 µs) 16'000 ns 17'083 ns 86'042 ns

summary for Lorem ipsum (ascii)
  unicode-segmenter
   2.92x faster than unicode-rs/unicode-segmentation (wasm-pack)
   8.9x faster than graphemer
   9.12x faster than Intl.Segmenter
   13.52x faster than grapheme-splitter

• Emojis
----------------------------------------------------------------------------------- -----------------------------
unicode-segmenter                             1'693 ns/iter   (1'662 ns … 1'804 ns)  1'711 ns  1'787 ns  1'804 ns
Intl.Segmenter                               14'306 ns/iter  (12'209 ns … 1'277 µs) 13'500 ns 17'542 ns    776 µs
graphemer                                    13'101 ns/iter    (12'542 ns … 596 µs) 12'875 ns 15'417 ns    109 µs
grapheme-splitter                            27'259 ns/iter    (26'583 ns … 508 µs) 26'875 ns 31'500 ns 59'917 ns
unicode-rs/unicode-segmentation (wasm-pack)   5'590 ns/iter   (5'471 ns … 6'420 ns)  5'609 ns  6'361 ns  6'420 ns

summary for Emojis
  unicode-segmenter
   3.3x faster than unicode-rs/unicode-segmentation (wasm-pack)
   7.74x faster than graphemer
   8.45x faster than Intl.Segmenter
   16.1x faster than grapheme-splitter

• Demonic characters
----------------------------------------------------------------------------------- -----------------------------
unicode-segmenter                             1'742 ns/iter   (1'686 ns … 1'874 ns)  1'767 ns  1'867 ns  1'874 ns
Intl.Segmenter                                5'070 ns/iter   (3'498 ns … 9'027 ns)  8'009 ns  8'877 ns  9'027 ns
graphemer                                    27'235 ns/iter  (26'250 ns … 1'753 µs) 26'750 ns 29'416 ns    168 µs
grapheme-splitter                            19'988 ns/iter    (19'000 ns … 417 µs) 19'500 ns 24'000 ns    269 µs
unicode-rs/unicode-segmentation (wasm-pack)   2'518 ns/iter   (2'440 ns … 2'734 ns)  2'550 ns  2'641 ns  2'734 ns

summary for Demonic characters
  unicode-segmenter
   1.45x faster than unicode-rs/unicode-segmentation (wasm-pack)
   2.91x faster than Intl.Segmenter
   11.48x faster than grapheme-splitter
   15.64x faster than graphemer

• Tweet text (combined)
----------------------------------------------------------------------------------- -----------------------------
unicode-segmenter                             8'791 ns/iter     (8'125 ns … 268 µs)  8'750 ns  9'792 ns    122 µs
Intl.Segmenter                               72'326 ns/iter    (63'583 ns … 680 µs) 71'875 ns    111 µs    557 µs
graphemer                                    72'033 ns/iter    (66'791 ns … 357 µs) 71'875 ns    110 µs    300 µs
grapheme-splitter                               154 µs/iter       (147 µs … 501 µs)    157 µs    194 µs    480 µs
unicode-rs/unicode-segmentation (wasm-pack)  24'444 ns/iter    (23'666 ns … 355 µs) 24'083 ns 27'542 ns    179 µs

summary for Tweet text (combined)
  unicode-segmenter
   2.78x faster than unicode-rs/unicode-segmentation (wasm-pack)
   8.19x faster than graphemer
   8.23x faster than Intl.Segmenter
   17.55x faster than grapheme-splitter

• Code snippet (combined)
----------------------------------------------------------------------------------- -----------------------------
unicode-segmenter                            20'311 ns/iter    (19'541 ns … 220 µs) 19'958 ns 23'375 ns    129 µs
Intl.Segmenter                                  170 µs/iter     (151 µs … 1'633 µs)    161 µs    513 µs  1'282 µs
graphemer                                       163 µs/iter       (159 µs … 410 µs)    161 µs    286 µs    367 µs
grapheme-splitter                               352 µs/iter       (346 µs … 710 µs)    350 µs    463 µs    672 µs
unicode-rs/unicode-segmentation (wasm-pack)  57'798 ns/iter    (56'083 ns … 326 µs) 57'208 ns 63'625 ns    203 µs

summary for Code snippet (combined)
  unicode-segmenter
   2.85x faster than unicode-rs/unicode-segmentation (wasm-pack)
   8.04x faster than graphemer
   8.37x faster than Intl.Segmenter
   17.34x faster than grapheme-splitter

LICENSE

MIT

Note

The initial implementation was ported manually from Rust's unicode-segmentation library, which is licenced under the MIT license.