Settings

Theme

Show HN: High-speed UTF-8 validation in Rust

github.com

127 points by kryps 5 years ago · 47 comments

Reader

tialaramex 5 years ago

One flavour I would expect to be valuable that isn't present here is this:

Process the input (as quickly as possible) and never fail, but replace each invalid sequence of bytes with U+FFFD (bytes 0xEF 0xBF 0xBD).

  • chrismorgan 5 years ago

    One major problem with this: you can’t do it in place. Invalid byte sequences could be 1–4 bytes long, but U+FFFD is exactly three bytes long.

    • AjtevhihBudhirv 5 years ago

      There're still faster approaches than naïve and probably common "copy valid byte sequences one by one into a resizable result buffer". For instance, scan through the input bytes all at once, keeping track of position and length of valid sequences, then memcopy each valid sequence into a preallocated buffer.

      Edit: Although, it looks like Rust's std already does this, except for preallocating an exactly correct size result buffer: https://doc.rust-lang.org/src/alloc/string.rs.html#538

      • duckerude 5 years ago

        > preallocating an exactly correct size result buffer

        Looks like it just uses the size of the original slice. If the average broken chunk is less than three bytes (maybe quite common?) then it'll have to grow the buffer, at least doubling it.

          >> let bytestring = b"foobar\xcc";
          >> bytestring.len()
          7
          >> let cleaned = String::from_utf8_lossy(bytestring).into_owned();
          >> cleaned.len()
          9
          >> cleaned.capacity()
          14
    • cryptonector 5 years ago

      Use an ASCII character like `?` or, even better, ASCII SUB (0x1A, substitute), which is specifically intended for this sort of thing. Failing that, there are a number of other unused ASCII control characters like VT (vertical tab), FS (file separator), GS (group separator), US (unit separator), CAN (cancel), etc.

      • tialaramex 5 years ago

        No. Don't do any of these things. The reason U+FFFD exists, even though ASCII has any number of fun things you can scribble in one byte is that U+FFFD specifically isn't any of the things your program probably didn't expect to appear unexpectedly after unrelated processing.

        It isn't a letter, or a digit, or whitespace, or punctuation, or a word separator, or a control character, it is neither uppercase nor lowercase, it doesn't have any canonical equivalents - it's just a codepoint that exists specifically for this purpose.

        As a result it's much less likely that if gibberish sneaks into your system somehow and gets turned into U+FFFD this causes something important to break elsewhere.

        And when sooner or later a human is shown this text, it's very obvious that U+FFFD isn't what they expected, whether that was E-acute, a Euro currency symbol, a cat emoji or whatever else, and the human will know something went wrong and can decide if they care about that.

        • cryptonector 5 years ago

          This is about what can be done in SIMD. See GP and GGP. I think using ASCII SUB is a perfectly reasonable thing to do in this context. You can always further post-process the result to turn ASCII SUB into U+FFFD.

  • nezirus 5 years ago

    Yeah, I'd like to see that as well, something like to_string_lossy(), but really fast, so I can call it whenever I expect occasional broken UTF-8 string (and don't care about few replaced characters).

  • nerdponx 5 years ago

    I don't know, you are throwing away the length of the invalid sequence that way. Maybe useful in certain situations, but probably not in most.

  • loa_in_ 5 years ago

    I bet one could modify this to save indexes of invalid chars and do what you say with just one additional copy

celeritascelery 5 years ago

I find it interesting that with the state of rust SIMD many implementations just opted for reimplementing the code multiple times for each intrinsic (SSE, Neon, AVX, etc). One would think that one of the generic libraries (faster, simdeez, packed_simd) would start to see widespread use, but they all seem to have issues that have prevented adoption.

  • mooman219 5 years ago

    I work on a SIMD optimized font library [0] and have stumbled into the same situation of hand writing SIMD intrinsics. Some things are just kinda hard to make sure they get optimized correctly, and there is enough difference between the platforms where that matters when fiddling with bits. I also kinda have fun writing SIMD code like this too.

    [0]: https://github.com/mooman219/fontdue/blob/master/src/platfor...

  • monocasa 5 years ago

    The various SIMD ISAs are different enough from eachother that you really want a custom implementation per ISA for perf reasons. Particularly if you're doing something off the beaten path like this rather than cranking through some vector math.

    • celeritascelery 5 years ago

      I don't know. Looking in their code, they have one for SSE and another for AVX2. The algorithm is identical expect for one is 128 and the other is 256 bits. It seems like something that would be easy to abstract over so you only had a single implementation (at least for x86 SIMD). But it is obviously a hard problem or it would be solved already.

    • Fronzie 5 years ago

      Even for vector math, there are subtle differences, such as the numerical precision of the hardware reciprocal, leading to different implementations for x/y.

      The Eigen library did recently combine most SSE and AVX code paths, however: https://eigen.tuxfamily.org/index.php?title=3.4

    • pgwhalen 5 years ago

      Can you comment on the Vector API that Java is set to include? Do you think it's entirely misguided, or more of a best effort given what one can reasonably expect from Java?

      http://openjdk.java.net/jeps/8261663

      (I'm completely ignorant on the subject)

  • BelenusMordred 5 years ago

    packed_simd is now packed_simd2, the original author is MIA and no one else has crate or repo permissions

    I really think the general push right now is towards stdsimd being the future for portability.

jfk13 5 years ago

I'd be curious how this compares with the encoding_rs crate (https://github.com/hsivonen/encoding_rs), which also offers UTF-8 validation (among many other things).

TwoBit 5 years ago

Given the current ubiquity of UTF-8, is there value in having native CPU instructions for it? Or do they exist somewhere already?

nerdponx 5 years ago

Does Rust compile code that can be used/called from other languages with an FFI? That to me is always one of the persistent advantages of C, I don't have to fully understand how compile machine code works on a technical level, but there are lots of different languages that let me use C libraries in their own language runtimes.

It would be great to have things like high-performance unicode handling with consistent semantics across multiple languages!

kouteiheika 5 years ago

So are there any plans to eventually contribute this to the standard library, e.g. just like hashbrown was?

meisel 5 years ago

How does this compare, speed-wise, to the UTF-8 validation done in simdjson?

  • krypsOP 5 years ago

    Check the benchmarks section (https://github.com/rusticstuff/simdutf8#Benchmarks), second table. simdutf8 is up to 28 % faster on my Comet Lake CPU. However with pure ASCII clang does something magical with simdjson and it beats my implementation by a lot. GCC-compiled simdjson is slower all around except for a few outliers with short byte sequences.

    The algorithm is the one from simdjson, the main difference is that it uses an extra step in the beginning to align reads to the SIMD block size.

  • tyingq 5 years ago

    If someone is wanting to make a code comparison, most of the UTF-8 validation in simdjson seems to be nicely summed up in this pull request that sped it up: https://github.com/simdjson/simdjson/pull/993/files

celeritascelery 5 years ago

> The implementation is similar to the one in simdjson except that it aligns reads to the block size of the SIMD extension, which leads to better peak performance compared to the implementation in simdjson.

I didn't really understand this part. Aligned to what? to the cache line? SIMD always reads the block size. Unless I am missing something here.

  • unwind 5 years ago

    I read it as "to the width of the SIMD registers" which I have seen in other quick scanners, but did not read the code here.

    • celeritascelery 5 years ago

      Aren’t all reads aligned to the width of the SIMD register? If I do an AVX512 command it will read 512 bits right?

      • jsheard 5 years ago

        It's about where you read data from, not how much data gets read. For example an AVX read is aligned if the address being read from is a multiple of 32 bytes, otherwise it's unaligned and runs slightly slower, and slower still if it happens to straddle two cachelines. The same applies to write instructions as well.

        It's less of an issue than it used to be, the penalty for unaligned access has steadily been reduced by newer CPU architectures, but it's still there.

parhamn 5 years ago

I was wonder about what the ARM/M1 support for simd like instructions are. It seems like it will be a while for these simd packages to be as performant on ARM. Is this correct?

  • jsheard 5 years ago

    ARM has it's own SIMD instruction sets (NEON and SVE) but the intrinsics have yet to be stabilized in Rust, so the Rust SIMD ecosystem is x86-centric for now.

    The plan is for Rust to eventually have a portable SIMD abstraction built into the standard library to reduce the need for CPU-specific code.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection