Utext: Rich Unicode Documents

26 points by RomanHauksson 2 years ago · 10 comments

Reader

Blind people reliant on screenreaders often complain about people using unusual Unicode characters to spice up their writing, since the screenreader can’t make sense of those characters or tries to pronounce them as in the language they were intended for. So, the approach toyed with here would be a veritable nightmare for those members of the community.

gwern 2 years ago

Unicode characters useful for typography don't come along that often, and many of the relevant characters here date back decades. So if screenreaders can't reliably ignore a character like SOFT HYPHEN which is older than many of the readers here, that's ultimately their fault.
In any case, because a Utext starts & ends as a plain text file, there is no reason you have to throw away the source text serve readers only the compiled fancy Unicode text version; they can live happily in the same file. You can (and should for reasons I outline) serve both the original 'source' and the 'compiled' version, and you can be clever about using whitespace or control characters to signal inband which is which, allowing the user to choose with a trivial grep: https://gwern.net/utext#utext-format And so in practice I think it would be superior in accessibility than many things.
Y_Y 2 years ago

Sounds like a problem with the screen reader. If they are supposed to replace the part where the user interprets a glyph on the screen, then they should act like a human would and interpret something that looks like semicolon as a semicolon, even if it's a Greek question mark (excepting at the end of a Greek question).
(Do any of them have an OCR layer? The context sensitivity might be more challenging, but probably can be specialised to common cases or LLM-magicked away.)
- OfSanguineFire 2 years ago
  
  The problem is that use of unusual Unicode characters represents memes, there are constantly new ones and it would be hard for any screenreader to keep up with new stuff. And there are plenty of imaginative use of letters from Indian scripts, from hieroglyphics, etc. where the screenreader can’t be expected to recognize what is intended as easily as a human being would.
- benj111 2 years ago
  
  It isn't (just) a problem of the screen reader.
  This is hacking unicode to do things that unicode isn't supposed to try to do.
  <List> <Item> element ....
  Tells me that this is a list of things.
  Reusing a unicode thing that just happens to look like a dot doesn't give context to the screen reader. It doesn't see a fancy bullet point it sees U184638 'libyan double sigilled C' or what ever.
  You're then relying on the hearer to know that U184638 looks like a fancy bullet point.

BlueTemplar 2 years ago

Sounds like a nightmare when trying to copy-paste text ?

gwern 2 years ago

My suggested format would be to include the original Markdown-esque source code in any generated Utext: https://news.ycombinator.com/item?id=38104034 So if copy-paste of the fancy Unicode version doesn't work well (just as it usually doesn't work well for the most common document file formats like HTML, PDF, images, SVG, Flash, JS...), you can copy-paste the original text from the source code.

rurban 2 years ago

I can only counter with https://en.wikipedia.org/wiki/Ornament_and_Crime

gwern 2 years ago

Considering that buildings inspired by Loos make me want to barf, I'll take that as a compliment.

Settings

Utext: Rich Unicode Documents

Keyboard Shortcuts