What developers should know about Unicode and character sets in 2013
the-pastry-box-project.net> Never assume that the data you’re dealing with is UTF-8 — ASCII appears identical unless you view the hex to see if each character is taking one byte (ASCII) or three (UTF-8).
Um, what? This is just wrong. ascii-equivalent characters only take one byte in UTF-8. Other characters may take two, or three, bytes.
If the author actually viewed text in ascii that, when in UTF-8, had three-bytes per character.... I don't know what they were looking at, but it wasn't UTF-8.
Also, if the data is ASCII, and includes only legal 7-bit ASCII characters -- it is simultaneously ALSO valid and legal UTF-8. UTF-8 is a superset of ASCII.
I'm not sure this guy understands what he's talking about.
The concluding statement is a bit wierd: "ASCII appears identical unless you view the hex to see if each character is taking one byte (ASCII) or three (UTF-8)"
That isn't accurate, ASCII text would appear identical even if 'you view the hex', because it is identical in UTF-8, that's the whole point of UTF-8. You'd have to look at non-ASCII characters to see how they're encoded.
Notepad also doesn't save as ASCII by default but »ANSI«, the default legacy codepage configured for your Windows installation.
Yes, the default Windows code page -- many pieces of software don't realize that registry keys, file paths, etc. are all encoded in a different code page if you are running, for example, Japanese Windows. (Also, it isn't exactly Shift-JIS...)
Yeah, I think he might have meant ISO-8859-1 or Windows 1252 rather than ASCII; but still, all of those characters would take up two bytes in UTF-8, not 3, unless you used combining diacritics rather than precomposed.
all those characters? You mean except the straight ascii-compatible ones, which will just take up one byte.
Yes, I meant all of the characters outside of the ASCII range. As in, there are no characters in ISO-8859-1 which take up more than 2 bytes in UTF-8. I guess there are a few in Windows-1252 which take up more than two bytes (like the Euro sign), so it's possible he meant Windows-1252 rather than ASCII.
Some background not covered in an otherwise pretty good article:
"In general, don’t save a Byte Order Mark (BOM) — it’s not needed for UTF-8, and historically could cause problems."
This attitude comes from agony in processing from UTF-16 files. I interface with a group that finds it hilarious to send me textual data in UTF-16 format and the first hard won lesson you learn with UTF-16 is superficially the default order should be correct 50% of the time if guessed randomly but somehow its always wrong. So say you read one line of a UTF-16 text file and process it accordingly after passing it thru a UTF-16 decoder. OK no problemo, it had a BOM as the first glyph/byte/character/whatever and was converted and interpreted correctly. Then you read another line, just like you'd read a line process a line with ASCII or UTF-8. However they only give me a BOM at the start of a file not a start of line, so invariably I translate that to garbage because the bytes are swapped.
Now there are program methods to analyze the BOM and memorize it. Or read the whole blasted multi-gig file into memory all at once and then de-UTF-16 it all at once and then line by line the file. But fundamentally its a simple one liner sysadmin type job to just shove the file thru a UTF-16 to UTF-8 translator program before it hits my processing system. I already had to unencrypt it, and unzip it, and verify its hash so I know they sent the whole file to me (and correctly), so adding a conversion stage is no big deal.
And this kind of UTF-16 experience is what leads people to do things like say "oh, its unicode? That means I should squirt out BOMs as often as possible" even though that technically only applies to unicode UTF-16 and is not helpful for UTF-8.
I hate to be "that SEO guy", but the OP needs to do some SEO. The submitted title here is nowhere to be seen, which is too bad because it's a great title and one that I would try to Google after forgetting to bookmark this page.
Luckily I do use Pinboard, which auto-grabs the title, if it existed. But this is a helpful reference to many devs who don't read HN, and it's all but obscured.
Oh, one more fun fact: some emoji characters occupy more than one _Unicode_ character, and can be encoded in different ways depending on the device that uses them. (Before they were introduced into Unicode, they used character codes designated for custom platform-specific stuff).
Debugging a text input field where user can enter emoji & RTL text is FUN.
Are there really multi-character emoji? Or is it that they are single characters on an astral plane which are encoded as two code units in UTF-16, and therefore behave rather like two characters if your language uses 16-bit chars?
Several characters, yes. And those characters, in turn, can be presented as low and hi surrogate pairs in UTF-16.
http://apps.timwhitlock.info/emoji/tables/unicode
Look for flags and numbers. Here's German flag in ASCII: \xF0\x9F\x87\xA9\xF0\x9F\x87\xAA 8 bytes, 2 unicode symbols, 4 UTF-16 symbols.
This is not as strange as it might look at the first glance.
A lot of ordinary characters can be represented as two (or more) Unicode code points - for instance an unaccented Latin letter and a combining accent.
Flags emoji seem more like a hack on the side of the font or text renderer. If you look at the Unicode representation it actually spells out the ISO country code. Some fonts probably define a ligature containing these two characters that looks like a flag instead of two separate Latin characters.
Representation of digits inside keycaps also makes sense to me: it's a normal digit eight (dating back to ASCII) plus a combining character that looks like a keycap.
Are these handled in the font as ligatures?
In what UI framework? When I worked on that, I decided to render them from a different texture that doesn't depend on the current font, but scales to it's size.
Site appears to be down; Google cache: http://webcache.googleusercontent.com/search?q=cache:A8oNdl-...
Note that some browser do use the <meta charset="UTF-8"> even if the content-type header already sent the charset.
Another thing to add: always open a database connection in the charset of choice. And if you are a PHP user (like I am): there are still functions that don't support multibyte so be careful.
This is the biggest current driver towards me trying to muster the effort to move off of PHP. Also, I had no end of trouble working with filenames that contained UTF-8 characters using PHP, and had to give up in the end.
> While there are a ton of encodings you could use, for the web use UTF-8. You want to use UTF-8 for your entire stack. So how do we get that?
You should use your language's internal unicode representation, and decode from/encode to UTF-8 on I/O.