Unicode 15 released

7 min read Original article ↗

[Posted September 14, 2022 by corbet]

Version 15 of the Unicode standard has been released.

This version adds 4,489 characters, bringing the total to 149,186 characters. These additions include two new scripts, for a total of 161 scripts, along with 20 new emoji characters, and 4,193 CJK (Chinese, Japanese, and Korean) ideographs.


to post comments

Unicode 15 released

Posted Sep 14, 2022 18:16 UTC (Wed) by SLi (subscriber, #53131) [Link] (3 responses)

Why do we get to use the "bottom left part of glyph is damaged" modifier only for hieroglyphs? :(

Unicode 15 released

Posted Sep 14, 2022 21:03 UTC (Wed) by flussence (guest, #85566) [Link] (1 responses)

It makes sense given the percentage of source material that needs it, but I agree. Plenty of stuff written on paper could've used that too!

Unicode 15 released

Posted Sep 20, 2022 5:44 UTC (Tue) by willy (subscriber, #9762) [Link]

Or vellum (I may have just done the tourist thing in Dublin and been to see the Book of Kells)

Unicode 15 released

Posted Sep 15, 2022 9:10 UTC (Thu) by n8willis (subscriber, #43041) [Link]

Unicode 15 released

Posted Sep 14, 2022 19:05 UTC (Wed) by alspnost (guest, #2763) [Link] (1 responses)

Finally, a WiFi emoji!

Unicode 15 released

Posted Sep 14, 2022 21:15 UTC (Wed) by Sesse (subscriber, #53779) [Link]

And one for HONK

Unicode 15 released

Posted Sep 15, 2022 5:02 UTC (Thu) by suckfish (guest, #69919) [Link] (33 responses)

Unicode 15 released

Posted Sep 15, 2022 7:36 UTC (Thu) by dh (subscriber, #153) [Link] (14 responses)

Unicode has some 1100000 possible code points. With 150000 assigned and 5000 new ones per year we're save for another 190 years. So while this might lead to a year-2212-problem, I'd say it's a bit early to take actions.

Unicode 15 released

Posted Sep 15, 2022 14:20 UTC (Thu) by zwol (guest, #126152) [Link] (13 responses)

Unicode 15 released

Posted Sep 15, 2022 17:40 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (12 responses)

First we would have to persuade Microsoft and Oracle that UTF-16 was a bad idea. For backcompat reasons, that's basically never going to happen (i.e. Windows and Java still use UTF-16 extensively and seemingly have no plans to remove or deprecate it).

Unicode 15 released

Posted Sep 15, 2022 18:20 UTC (Thu) by devslashilly (guest, #124291) [Link] (11 responses)

Good news since java 18 it's been UTF-8 https://openjdk.org/jeps/400 now we need to wait the 10 years for people to update their jvms.

Unicode 15 released

Posted Sep 15, 2022 23:10 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (10 responses)

Unicode 15 released

Posted Sep 16, 2022 0:34 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

Unicode 15 released

Posted Sep 16, 2022 18:16 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (5 responses)

Unicode 15 released

Posted Sep 16, 2022 20:11 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

> FindFirstFileW/FindNextFileW and pass it directly to CreateFileW
I remember reading that Windows is starting to enforce at least some sanity in CreateFileW and doesn't allow some of the more malformed names. And that's discounting gotchas like "aux.txt".

Unicode 15 released

Posted Sep 16, 2022 20:17 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (3 responses)

Unicode 15 released

Posted Sep 16, 2022 20:33 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Unicode 15 released

Posted Sep 16, 2022 21:35 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (1 responses)

Yeah, but those are all really unlikely in practice, so nobody's going to bother checking for them anyway. OTOH, "The user's name contains a character that is not in the current code page, and so everything under C:\Users\<name> is inaccessible and/or requires the use of a tilde name hack that will look ugly in your UI" is a much bigger problem... but if you're all-in on the -W functions, you probably assume(d) that you were safe from that.

Unicode 15 released

Posted Nov 8, 2022 2:36 UTC (Tue) by vtjnash (subscriber, #141755) [Link]

They could use WTF-8 encoding instead. It is a superset of UTF-8 that also supports round-trip from malformed UTF-16. (The reverse is not fully true, since it can yield different results if two such WTF-8 strings are concatenated and end up yielding a well-formed UTF-16 string after conversion)

Unicode 15 released

Posted Sep 18, 2022 6:44 UTC (Sun) by jond (subscriber, #37669) [Link] (2 responses)

Unicode 15 released

Posted Sep 18, 2022 8:50 UTC (Sun) by ABCD (subscriber, #53650) [Link]

Unicode 15 released

Posted Sep 18, 2022 8:55 UTC (Sun) by dtlin (subscriber, #36537) [Link]

char being a 16-bit value is hard-baked into the JVM, and thus anything that uses a char[] is inherently operating on UTF-16.

Java 9 did add the +XX:+CompactStrings option (JEP 254), which changed the internal representation of String from char[] to byte[], along with a bit determining whether that representation is Latin-1 or UTF-16, with the former taking up half the space. But there was no change to the user-visible API, it is only an implementation detail.

(Java 9 did add String#codePoints() returning an IntStream of code points, but it's unrelated and you could have implemented that yourself with codePointAt()+offsetByCodePoints() anyway, it's just more convenient.)

Unicode 15 released

Posted Sep 15, 2022 11:08 UTC (Thu) by grawity (subscriber, #80596) [Link] (4 responses)

Unicode 15 released

Posted Sep 16, 2022 1:21 UTC (Fri) by scientes (guest, #83068) [Link] (3 responses)

Unicode 15 released

Posted Sep 16, 2022 10:16 UTC (Fri) by grawity (subscriber, #80596) [Link] (2 responses)

Unicode 15 released

Posted Sep 16, 2022 12:54 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

Unicode 15 released

Posted Sep 19, 2022 4:37 UTC (Mon) by grawity (subscriber, #80596) [Link]

Unicode 15 released

Posted Sep 15, 2022 19:45 UTC (Thu) by plugwash (subscriber, #29694) [Link] (3 responses)

Unicode 15 released

Posted Sep 15, 2022 21:59 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Unicode 15 released

Posted Sep 15, 2022 22:50 UTC (Thu) by khim (subscriber, #9252) [Link]

> Even Windows is supporting true UTF-8 APIs these days.

Externally but not internally.

I guess the first step would be to encourage to move programs to UTF-8.

Because that's the most important step, anyway.

Unicode 15 released

Posted Sep 20, 2022 4:03 UTC (Tue) by plugwash (subscriber, #29694) [Link]

Unicode 15 released

Posted Sep 16, 2022 17:35 UTC (Fri) by k8to (guest, #15413) [Link] (8 responses)

Unicode 15 released

Posted Sep 16, 2022 17:55 UTC (Fri) by sfeam (subscriber, #2841) [Link] (7 responses)

"Somehow I imagine we're not actually creating many new scripts and characters".

You underestimate the allure of cute emojis.

Unicode 15 released

Posted Sep 18, 2022 8:34 UTC (Sun) by Sesse (subscriber, #53779) [Link] (6 responses)

FWIW, the position of the Unicode Consortium is that emoji are a temporary solution that should be replaced by arbitrary “stickers” in the long run (and thus, presumably move out of their realm, as they do not define messaging protocols or file formats—stickers would live outside the concept of character sets).

Unicode 15 released

Posted Sep 18, 2022 19:29 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (5 responses)

Unicode 15 released

Posted Sep 18, 2022 19:36 UTC (Sun) by Sesse (subscriber, #53779) [Link] (4 responses)

“Temporary solution” does not mean “needs to end right now”, though.

Unicode 15 released

Posted Sep 18, 2022 19:37 UTC (Sun) by Sesse (subscriber, #53779) [Link] (1 responses)

Also, there _is_ a deprecation process, it just doesn't end in removal: “The Unicode Standard may deprecate the character (that is, formally discourage its use), but it will not reallocate, remove, or reassign the character.”

Unicode 15 released

Posted Sep 19, 2022 19:16 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

That's not a process. It's a one-off announcement. You write "this (range of) code point(s) (is/are) deprecated" on some formal piece of paper and call it a day. Everything else is the implementation's problem, not the standard's problem. So I stand by my claim that this would not be difficult for the Consortium to do immediately.

Unicode 15 released

Posted Sep 19, 2022 18:37 UTC (Mon) by plugwash (subscriber, #29694) [Link] (1 responses)

Unicode 15 released

Posted Sep 19, 2022 19:14 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

There is nothing more permanent than a temporary solution.