Accurate text lengths with `Intl.Segmenter` API

7 min read Original article ↗

Guess what the following lines of JS would evaluate to (no cheating with the console!):

"hello".length

"🌝".length

"🇮🇳".length

"👨‍👩‍👧‍👦".length

"दरबार".length

It’s okay, take some time.

Done?

Alright, time for the grand reveal 🥁:

"hello".length // 5

"🌝".length    // 2

"🇮🇳".length    // 4

"👨‍👩‍👧‍👦".length    // 11

"दरबार".length  // 5

Taking a wild guess that most of you reading this might’ve only gotten the first one correct, since the rest of those numbers seem completely bizarre!

What the Unicode?! #

I think most of us would agree that 🌝, 🇮🇳 and 👨‍👩‍👧‍👦 should’ve yielded a length of 1 but wowza — why is that length increasing even though we keep seeing just one character? How is the family 👨‍👩‍👧‍👦 emoji having a length of 11? I mean, surely even 4 would logically make more sense here seeing four people inside the emoji, wouldn’t it? Or I guess 11 is not enough if you’re considering character count by Dom’s family. Count of दरबार is also wrong for my Hindi-speaking friends, it should be 4 since बा is its own thang.

So, what the heck is going on? Enter the weird wide world of Unicode!

You see, the string values you create in JavaScript and most other modern programming languages use some sort of Unicode encoding, which is a set of pre-written rules on how to represent text. In the beginning, it was just 1 byte per character but Americans quickly realized the rest of the world also wanted to use computers in their own damn language. Thus, Unicode was born.

This HTML page that got delivered to your web browser is encoded in UTF-8, which is the most popular encoding. JavaScript being JavaScript, uses UTF-16 encoding, which uses 2-byte chunks or code units in technical speak. 2 bytes = 2 × 8 bits = 16 bits. That should be good for about 216 or 65,536 characters but alas, even that’s nearly not enough to represent all kinds of glyphs from across known human languages, funny symbols and everyone’s favourite emojis. So, we have rules that state how to combine multiple, special 2-byte code units in order to produce a human-readable character.

We can see how this looks internally in the case of 🌝 by typing the following into the browser console:

Array.from({ length: "🌝".length }, (_, i) => "🌝"[i])
// [ '\ud83c', '\udf1d' ]

Those two thingies in the array there are the UTF-16 code units that combine to make an emoji that makes sense to us. Thing is, when we’re writing code and printing things out on the web using our fancy frameworks, we usually never have to bother with these internal details because our editors, OS and browsers all abstract all of this complexity away and just presents us with pleasing and readable text.

But, this becomes a problem when we as developers need to validate the text length in a form input, say for a user’s name. If users type out characters that take up multiple UTF-16 code units, it’ll become a problem real fast if you are restricting length of a name thinking only in terms of English. Another example that comes to mind is micro-blogging platforms like Bluesky that intentionally restricts skeets to a certain character limit. If the underlying implementation of the composer uses String#length, some users will be left angry even if they used only fewer characters to type out the thing they want to shout out to the world — be it their native language or emoji-speak.

Aside: If you’re targeting international audience for your app, avoid restricting names to short lengths or avoid requiring a minimum length for first/last names. There are people out there without either of those, or with one-letter name parts or name lengths that go beyond what you are probably used to.

Swift gets this right #

I thought this weirdness was inherent in almost all the languages out there, until I came across Swift. Swift’s String struct exposes a count property (instead of length) and this returns the right answer — a number that almost always sums up-to the number of characters we can perceive with our senses.

"hello".count // 5

"🌝".count    // 1

"🇮🇳".count    // 1

"👨‍👩‍👧‍👦".count    // 1

"दरबार".count  // 4

Beautiful, isn’t it? I think we all can agree this makes a whole lot more sense and is what we’d expect in most circumstances.

Swift gets this right because the String#count property is doing a lot more work instead of just returning the fixed length of an array or buffer that’s holding the characters in memory. When you reference String#count, it’s computing what the actual, human-perceptible length should be by iterating over the internal representations, combining things where it makes sense.

In Unicode speak, this combination that makes sense to rest of us humans is the extended grapheme cluster. Here’s how Swift docs define it:

Every instance of Swift’s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character.

This counting of graphemes don’t come for free though. Unlike a cheap, fixed-size array whose length can be computed in $O(1)$ time, String#count takes $O(n)$ time since we need to iterate through each and every byte. But if you ask me — worth it.

Can JavaScript haz this? #

Turns out, yes we can! We can get this count without importing any third-party dependency — thanks to the Intl.Segmenter API. It has great support across all modern browsers as per caniuse.com, unless you’re still stuck needing to support IE 11 or older operating systems (I sympathize). Also considered Baseline Newly Available since April 2024, if that’s something you care about.

This API allows us to split our text according to rules of the locale and a granularity — like "word", "sentence" and last but not least — "grapheme". If you are with me so far, you might’ve reckoned this is exactly what we’re looking for. This is also the default option but we’re going to type it out anyway since I think it makes it clear to the reader. Once we construct a Segmenter object, we can use the .segment(text) method to return an iterable that splits text into human-readable character representations. All that’s left is converting this iterable into an array and getting its .length.

Let’s make a little helper to get the answer we all deserve:

function realLength(text) {
	return Array.from(
		new Intl.Segmenter(
			"en",
			{ granularity: "grapheme" }
		).segment(text)
	).length;
}

The "en" locale parameter doesn’t need changing as far as I can tell when we use this with granularity of "grapheme" but matters for other types.

Moment of truth:

realLength("hello") // 5

realLength("🌝")    // 1

realLength("🇮🇳")    // 1

realLength("👨‍👩‍👧‍👦")    // 1

realLength("दरबार")  // 4

Feels great, doesn’t it? And this makes way more sense than whatever weird numbers were being spit out by String#length before.

Gotchas #

While this is close to being perfect, there are still some cases where both Swift/JavaScript utilities don’t produce the correct human-understandable length we’re looking for even though they both abide perfectly to parsing extended grapheme clusters. Here’s one such example when having control characters in the string:

realLength("text\u0001\u0002") // 6 instead of 4
"text\u{0001}\u{0002}".count   // 6 instead of 4

But probably, the right answer here is to remove some of these non-printable characters before persisting them.

Also keep in mind that counting this way is as expensive (or more) as Swift’s String#count property so you shouldn’t rush to replace every use of String#length across your codebase with this utility. But, you should keep the things discussed here in mind whenever you come across situations where you need to restrict/validate text length and evaluate whether you need this utility or if a simple String#length would suffice.

MDN has a great page on JavaScript strings, UTF-16 and grapheme clusters if you’re feeling curious about Unicode.