Show HN: High-speed UTF-8 validation in Rust

127 points by kryps 5 years ago · 47 comments

Reader

One flavour I would expect to be valuable that isn't present here is this:

Process the input (as quickly as possible) and never fail, but replace each invalid sequence of bytes with U+FFFD (bytes 0xEF 0xBF 0xBD).

chrismorgan 5 years ago

One major problem with this: you can’t do it in place. Invalid byte sequences could be 1–4 bytes long, but U+FFFD is exactly three bytes long.
- AjtevhihBudhirv 5 years ago
  
  There're still faster approaches than naïve and probably common "copy valid byte sequences one by one into a resizable result buffer". For instance, scan through the input bytes all at once, keeping track of position and length of valid sequences, then memcopy each valid sequence into a preallocated buffer.
  Edit: Although, it looks like Rust's std already does this, except for preallocating an exactly correct size result buffer: https://doc.rust-lang.org/src/alloc/string.rs.html#538
  - duckerude 5 years ago
    
    > preallocating an exactly correct size result buffer
    Looks like it just uses the size of the original slice. If the average broken chunk is less than three bytes (maybe quite common?) then it'll have to grow the buffer, at least doubling it.
    >> let bytestring = b"foobar\xcc"; >> bytestring.len() 7 >> let cleaned = String::from_utf8_lossy(bytestring).into_owned(); >> cleaned.len() 9 >> cleaned.capacity() 14
- cryptonector 5 years ago
  
  Use an ASCII character like `?` or, even better, ASCII SUB (0x1A, substitute), which is specifically intended for this sort of thing. Failing that, there are a number of other unused ASCII control characters like VT (vertical tab), FS (file separator), GS (group separator), US (unit separator), CAN (cancel), etc.
  - tialaramex 5 years ago
    
    No. Don't do any of these things. The reason U+FFFD exists, even though ASCII has any number of fun things you can scribble in one byte is that U+FFFD specifically isn't any of the things your program probably didn't expect to appear unexpectedly after unrelated processing.
    It isn't a letter, or a digit, or whitespace, or punctuation, or a word separator, or a control character, it is neither uppercase nor lowercase, it doesn't have any canonical equivalents - it's just a codepoint that exists specifically for this purpose.
    As a result it's much less likely that if gibberish sneaks into your system somehow and gets turned into U+FFFD this causes something important to break elsewhere.
    And when sooner or later a human is shown this text, it's very obvious that U+FFFD isn't what they expected, whether that was E-acute, a Euro currency symbol, a cat emoji or whatever else, and the human will know something went wrong and can decide if they care about that.
    
    cryptonector 5 years ago
    
    This is about what can be done in SIMD. See GP and GGP. I think using ASCII SUB is a perfectly reasonable thing to do in this context. You can always further post-process the result to turn ASCII SUB into U+FFFD.
nezirus 5 years ago

Yeah, I'd like to see that as well, something like to_string_lossy(), but really fast, so I can call it whenever I expect occasional broken UTF-8 string (and don't care about few replaced characters).
nerdponx 5 years ago

I don't know, you are throwing away the length of the invalid sequence that way. Maybe useful in certain situations, but probably not in most.
loa_in_ 5 years ago

I bet one could modify this to save indexes of invalid chars and do what you say with just one additional copy

celeritascelery 5 years ago

I find it interesting that with the state of rust SIMD many implementations just opted for reimplementing the code multiple times for each intrinsic (SSE, Neon, AVX, etc). One would think that one of the generic libraries (faster, simdeez, packed_simd) would start to see widespread use, but they all seem to have issues that have prevented adoption.

mooman219 5 years ago

I work on a SIMD optimized font library [0] and have stumbled into the same situation of hand writing SIMD intrinsics. Some things are just kinda hard to make sure they get optimized correctly, and there is enough difference between the platforms where that matters when fiddling with bits. I also kinda have fun writing SIMD code like this too.
[0]: https://github.com/mooman219/fontdue/blob/master/src/platfor...
monocasa 5 years ago

The various SIMD ISAs are different enough from eachother that you really want a custom implementation per ISA for perf reasons. Particularly if you're doing something off the beaten path like this rather than cranking through some vector math.
- celeritascelery 5 years ago
  
  I don't know. Looking in their code, they have one for SSE and another for AVX2. The algorithm is identical expect for one is 128 and the other is 256 bits. It seems like something that would be easy to abstract over so you only had a single implementation (at least for x86 SIMD). But it is obviously a hard problem or it would be solved already.
- Fronzie 5 years ago
  
  Even for vector math, there are subtle differences, such as the numerical precision of the hardware reciprocal, leading to different implementations for x/y.
  The Eigen library did recently combine most SSE and AVX code paths, however: https://eigen.tuxfamily.org/index.php?title=3.4
- pgwhalen 5 years ago
  
  Can you comment on the Vector API that Java is set to include? Do you think it's entirely misguided, or more of a best effort given what one can reasonably expect from Java?
  http://openjdk.java.net/jeps/8261663
  (I'm completely ignorant on the subject)
BelenusMordred 5 years ago

packed_simd is now packed_simd2, the original author is MIA and no one else has crate or repo permissions
I really think the general push right now is towards stdsimd being the future for portability.

jfk13 5 years ago

I'd be curious how this compares with the encoding_rs crate (https://github.com/hsivonen/encoding_rs), which also offers UTF-8 validation (among many other things).

cryptonector 5 years ago

This is SIMD based. If the encoding_rs crate isn't SIMD based, then this thing will beat its pants off.
- burntsushi 5 years ago
  
  It does use SIMD (when enabled). But also, the encoding_rs crate is doing more. It isn't just checking for validity. It's also doing transcoding and error detection.
  - krypsOP 5 years ago
    
    As far as I can see it does not use SIMD for UTF8 validation, only likely/unlikely instrinsics, see https://github.com/hsivonen/encoding_rs/blob/e98a2096ab09c92...
    
    jfk13 5 years ago
    
    That's not the only use of SIMD in the crate (e.g. see https://github.com/hsivonen/encoding_rs/blob/e98a2096ab09c92...), but I haven't looked into exactly where/how it's used further.

TwoBit 5 years ago

Given the current ubiquity of UTF-8, is there value in having native CPU instructions for it? Or do they exist somewhere already?

dahfizz 5 years ago

What would these cpu instructions do?
- TwoBit 5 years ago
  
  Read or write code points quickly, with validation.

nerdponx 5 years ago

Does Rust compile code that can be used/called from other languages with an FFI? That to me is always one of the persistent advantages of C, I don't have to fully understand how compile machine code works on a technical level, but there are lots of different languages that let me use C libraries in their own language runtimes.

It would be great to have things like high-performance unicode handling with consistent semantics across multiple languages!

hardwaresofton 5 years ago

Yes it does -- one of the great parts about Rust. There are lots of projects out there slotting rust into other languages and getting crazy speedups.
- https://doc.rust-lang.org/reference/linkage.html
- https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html#call...
jhgb 5 years ago

A few days ago I read https://dev.to/veer66/calling-rust-from-common-lisp-45c5 , which illustrates presumably what you're talking about.
Disp4tch 5 years ago

Yes, from the python side of things there are tools like py03 that make integrating rust into python code really painless.
I have a sql parsing library (shameless plug) that is 50x faster than any other python implementation, it is just a super simple wrapper around a rust crate.
https://github.com/wseaton/sqloxide
gliptic 5 years ago

Yes, you can expose things with the C ABI. This is essential for how it's used in Firefox.
kzrdude 5 years ago

Yes, Rust natively can expose C ffi functions (for example).
The https://cxx.rs/ project is also a major crate for C++ interoperability.

kouteiheika 5 years ago

So are there any plans to eventually contribute this to the standard library, e.g. just like hashbrown was?

mrec 5 years ago

The author discusses this in the corresponding /r/rust thread:
https://www.reddit.com/r/rust/comments/mvc6o5/incredibly_fas...
tl;dr: it's not straightforward, mostly because the standard functionality lives in `core` which doesn't have access to OS support for CPU feature detection.
- bombcar 5 years ago
  
  Does Rust allow for libraries to “override” core functionality when available?
  - Kinrany 5 years ago
    
    If it was all userspace Rust, it would have been possible to choose the implementation at compile time conditional on the presence of std.

meisel 5 years ago

How does this compare, speed-wise, to the UTF-8 validation done in simdjson?

krypsOP 5 years ago

Check the benchmarks section (https://github.com/rusticstuff/simdutf8#Benchmarks), second table. simdutf8 is up to 28 % faster on my Comet Lake CPU. However with pure ASCII clang does something magical with simdjson and it beats my implementation by a lot. GCC-compiled simdjson is slower all around except for a few outliers with short byte sequences.
The algorithm is the one from simdjson, the main difference is that it uses an extra step in the beginning to align reads to the SIMD block size.
tyingq 5 years ago

If someone is wanting to make a code comparison, most of the UTF-8 validation in simdjson seems to be nicely summed up in this pull request that sped it up: https://github.com/simdjson/simdjson/pull/993/files

celeritascelery 5 years ago

> The implementation is similar to the one in simdjson except that it aligns reads to the block size of the SIMD extension, which leads to better peak performance compared to the implementation in simdjson.

I didn't really understand this part. Aligned to what? to the cache line? SIMD always reads the block size. Unless I am missing something here.

unwind 5 years ago

I read it as "to the width of the SIMD registers" which I have seen in other quick scanners, but did not read the code here.
- celeritascelery 5 years ago
  
  Aren’t all reads aligned to the width of the SIMD register? If I do an AVX512 command it will read 512 bits right?
  - jsheard 5 years ago
    
    It's about where you read data from, not how much data gets read. For example an AVX read is aligned if the address being read from is a multiple of 32 bytes, otherwise it's unaligned and runs slightly slower, and slower still if it happens to straddle two cachelines. The same applies to write instructions as well.
    It's less of an issue than it used to be, the penalty for unaligned access has steadily been reduced by newer CPU architectures, but it's still there.
    
    celeritascelery 5 years ago
    
    ahh, so it does come back to cache line alignment. Reading aligned data doesn't give any benefit in and of itself[1]. At least not on modern hardware. I guess the performance improvement would make sense since SIMD instructions are sized to be a multiple of the cache line size.
    [1] https://lemire.me/blog/2012/05/31/data-alignment-for-speed-m...

parhamn 5 years ago

I was wonder about what the ARM/M1 support for simd like instructions are. It seems like it will be a while for these simd packages to be as performant on ARM. Is this correct?

jsheard 5 years ago

ARM has it's own SIMD instruction sets (NEON and SVE) but the intrinsics have yet to be stabilized in Rust, so the Rust SIMD ecosystem is x86-centric for now.
The plan is for Rust to eventually have a portable SIMD abstraction built into the standard library to reduce the need for CPU-specific code.

Settings

Show HN: High-speed UTF-8 validation in Rust

Keyboard Shortcuts