Show HN: High-speed UTF-8 validation in Rust
github.comOne flavour I would expect to be valuable that isn't present here is this:
Process the input (as quickly as possible) and never fail, but replace each invalid sequence of bytes with U+FFFD (bytes 0xEF 0xBF 0xBD).
One major problem with this: you can’t do it in place. Invalid byte sequences could be 1–4 bytes long, but U+FFFD is exactly three bytes long.
There're still faster approaches than naïve and probably common "copy valid byte sequences one by one into a resizable result buffer". For instance, scan through the input bytes all at once, keeping track of position and length of valid sequences, then memcopy each valid sequence into a preallocated buffer.
Edit: Although, it looks like Rust's std already does this, except for preallocating an exactly correct size result buffer: https://doc.rust-lang.org/src/alloc/string.rs.html#538
> preallocating an exactly correct size result buffer
Looks like it just uses the size of the original slice. If the average broken chunk is less than three bytes (maybe quite common?) then it'll have to grow the buffer, at least doubling it.
>> let bytestring = b"foobar\xcc"; >> bytestring.len() 7 >> let cleaned = String::from_utf8_lossy(bytestring).into_owned(); >> cleaned.len() 9 >> cleaned.capacity() 14
Use an ASCII character like `?` or, even better, ASCII SUB (0x1A, substitute), which is specifically intended for this sort of thing. Failing that, there are a number of other unused ASCII control characters like VT (vertical tab), FS (file separator), GS (group separator), US (unit separator), CAN (cancel), etc.
No. Don't do any of these things. The reason U+FFFD exists, even though ASCII has any number of fun things you can scribble in one byte is that U+FFFD specifically isn't any of the things your program probably didn't expect to appear unexpectedly after unrelated processing.
It isn't a letter, or a digit, or whitespace, or punctuation, or a word separator, or a control character, it is neither uppercase nor lowercase, it doesn't have any canonical equivalents - it's just a codepoint that exists specifically for this purpose.
As a result it's much less likely that if gibberish sneaks into your system somehow and gets turned into U+FFFD this causes something important to break elsewhere.
And when sooner or later a human is shown this text, it's very obvious that U+FFFD isn't what they expected, whether that was E-acute, a Euro currency symbol, a cat emoji or whatever else, and the human will know something went wrong and can decide if they care about that.
This is about what can be done in SIMD. See GP and GGP. I think using ASCII SUB is a perfectly reasonable thing to do in this context. You can always further post-process the result to turn ASCII SUB into U+FFFD.
Yeah, I'd like to see that as well, something like to_string_lossy(), but really fast, so I can call it whenever I expect occasional broken UTF-8 string (and don't care about few replaced characters).
I don't know, you are throwing away the length of the invalid sequence that way. Maybe useful in certain situations, but probably not in most.
I bet one could modify this to save indexes of invalid chars and do what you say with just one additional copy
I find it interesting that with the state of rust SIMD many implementations just opted for reimplementing the code multiple times for each intrinsic (SSE, Neon, AVX, etc). One would think that one of the generic libraries (faster, simdeez, packed_simd) would start to see widespread use, but they all seem to have issues that have prevented adoption.
I work on a SIMD optimized font library [0] and have stumbled into the same situation of hand writing SIMD intrinsics. Some things are just kinda hard to make sure they get optimized correctly, and there is enough difference between the platforms where that matters when fiddling with bits. I also kinda have fun writing SIMD code like this too.
[0]: https://github.com/mooman219/fontdue/blob/master/src/platfor...
The various SIMD ISAs are different enough from eachother that you really want a custom implementation per ISA for perf reasons. Particularly if you're doing something off the beaten path like this rather than cranking through some vector math.
I don't know. Looking in their code, they have one for SSE and another for AVX2. The algorithm is identical expect for one is 128 and the other is 256 bits. It seems like something that would be easy to abstract over so you only had a single implementation (at least for x86 SIMD). But it is obviously a hard problem or it would be solved already.
Even for vector math, there are subtle differences, such as the numerical precision of the hardware reciprocal, leading to different implementations for x/y.
The Eigen library did recently combine most SSE and AVX code paths, however: https://eigen.tuxfamily.org/index.php?title=3.4
Can you comment on the Vector API that Java is set to include? Do you think it's entirely misguided, or more of a best effort given what one can reasonably expect from Java?
http://openjdk.java.net/jeps/8261663
(I'm completely ignorant on the subject)
packed_simd is now packed_simd2, the original author is MIA and no one else has crate or repo permissions
I really think the general push right now is towards stdsimd being the future for portability.
I'd be curious how this compares with the encoding_rs crate (https://github.com/hsivonen/encoding_rs), which also offers UTF-8 validation (among many other things).
This is SIMD based. If the encoding_rs crate isn't SIMD based, then this thing will beat its pants off.
It does use SIMD (when enabled). But also, the encoding_rs crate is doing more. It isn't just checking for validity. It's also doing transcoding and error detection.
As far as I can see it does not use SIMD for UTF8 validation, only likely/unlikely instrinsics, see https://github.com/hsivonen/encoding_rs/blob/e98a2096ab09c92...
That's not the only use of SIMD in the crate (e.g. see https://github.com/hsivonen/encoding_rs/blob/e98a2096ab09c92...), but I haven't looked into exactly where/how it's used further.
Given the current ubiquity of UTF-8, is there value in having native CPU instructions for it? Or do they exist somewhere already?
What would these cpu instructions do?
Read or write code points quickly, with validation.
Does Rust compile code that can be used/called from other languages with an FFI? That to me is always one of the persistent advantages of C, I don't have to fully understand how compile machine code works on a technical level, but there are lots of different languages that let me use C libraries in their own language runtimes.
It would be great to have things like high-performance unicode handling with consistent semantics across multiple languages!
Yes it does -- one of the great parts about Rust. There are lots of projects out there slotting rust into other languages and getting crazy speedups.
- https://doc.rust-lang.org/reference/linkage.html
- https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html#call...
A few days ago I read https://dev.to/veer66/calling-rust-from-common-lisp-45c5 , which illustrates presumably what you're talking about.
Yes, from the python side of things there are tools like py03 that make integrating rust into python code really painless.
I have a sql parsing library (shameless plug) that is 50x faster than any other python implementation, it is just a super simple wrapper around a rust crate.
Yes, you can expose things with the C ABI. This is essential for how it's used in Firefox.
Yes, Rust natively can expose C ffi functions (for example).
The https://cxx.rs/ project is also a major crate for C++ interoperability.
So are there any plans to eventually contribute this to the standard library, e.g. just like hashbrown was?
The author discusses this in the corresponding /r/rust thread:
https://www.reddit.com/r/rust/comments/mvc6o5/incredibly_fas...
tl;dr: it's not straightforward, mostly because the standard functionality lives in `core` which doesn't have access to OS support for CPU feature detection.
Does Rust allow for libraries to “override” core functionality when available?
If it was all userspace Rust, it would have been possible to choose the implementation at compile time conditional on the presence of std.
How does this compare, speed-wise, to the UTF-8 validation done in simdjson?
Check the benchmarks section (https://github.com/rusticstuff/simdutf8#Benchmarks), second table. simdutf8 is up to 28 % faster on my Comet Lake CPU. However with pure ASCII clang does something magical with simdjson and it beats my implementation by a lot. GCC-compiled simdjson is slower all around except for a few outliers with short byte sequences.
The algorithm is the one from simdjson, the main difference is that it uses an extra step in the beginning to align reads to the SIMD block size.
If someone is wanting to make a code comparison, most of the UTF-8 validation in simdjson seems to be nicely summed up in this pull request that sped it up: https://github.com/simdjson/simdjson/pull/993/files
> The implementation is similar to the one in simdjson except that it aligns reads to the block size of the SIMD extension, which leads to better peak performance compared to the implementation in simdjson.
I didn't really understand this part. Aligned to what? to the cache line? SIMD always reads the block size. Unless I am missing something here.
I read it as "to the width of the SIMD registers" which I have seen in other quick scanners, but did not read the code here.
Aren’t all reads aligned to the width of the SIMD register? If I do an AVX512 command it will read 512 bits right?
It's about where you read data from, not how much data gets read. For example an AVX read is aligned if the address being read from is a multiple of 32 bytes, otherwise it's unaligned and runs slightly slower, and slower still if it happens to straddle two cachelines. The same applies to write instructions as well.
It's less of an issue than it used to be, the penalty for unaligned access has steadily been reduced by newer CPU architectures, but it's still there.
ahh, so it does come back to cache line alignment. Reading aligned data doesn't give any benefit in and of itself[1]. At least not on modern hardware. I guess the performance improvement would make sense since SIMD instructions are sized to be a multiple of the cache line size.
[1] https://lemire.me/blog/2012/05/31/data-alignment-for-speed-m...
I was wonder about what the ARM/M1 support for simd like instructions are. It seems like it will be a while for these simd packages to be as performant on ARM. Is this correct?
ARM has it's own SIMD instruction sets (NEON and SVE) but the intrinsics have yet to be stabilized in Rust, so the Rust SIMD ecosystem is x86-centric for now.
The plan is for Rust to eventually have a portable SIMD abstraction built into the standard library to reduce the need for CPU-specific code.