Settings

Theme

Validating UTF-8 bytes using only 0.45 cycles per byte (AVX edition)

lemire.me

148 points by akarambir 7 years ago · 53 comments

Reader

the_clarence 7 years ago

I see a lot of applications trying to take advantage of SIMD, but what when you try to run them on systems that don't support these instructions? My guess is that you need to write multiple files taking advantage of different sets of instructions and then dynamically figure out which to use at runtime with cpuid, but isn't that cumbersome and a way to inflate a codebase dramatically?

  • en4bz 7 years ago
  • wmu 7 years ago

    Speaking of the Intel world it's not that bad. There are three major version right now: SSE4.1, AVX and AVX2 (AVX512 is not popular yet).

    In the past (roughly 10 years ego) it was a problem, as there were: MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, XOP, 3DNow and perhaps a few more extensions.

    it's not a typo, there are three 'S' :)

    • wmu 7 years ago

      Sorry, I forgot that in HN comments the asterisk char is an italics indicator. There should be a mark after SSSE3.

  • oconnor663 7 years ago

    > inflate a codebase dramatically

    This is usually only done for very specific algorithms. Unicode validation, hash functions, things like that. Unless you have an absolutely tiny application (which you might, if you're some kind of microcontroller), it's going to be a small percentage of your overall code size.

    • londons_explore 7 years ago

      In a microcontroller, I don't think you'll be needing AVX2...

      • Rebelgecko 7 years ago

        I'm not sure where exactly the line is drawn between a microcontroller and a CPU, but even some of the lower end ARMs support SIMD instructions.

  • jmgrosen 7 years ago

    Generally speaking, I think if you care enough about performance to write manual SIMD code, being a little more cumbersome is a tradeoff you’re willing to make.

  • why_only_15 7 years ago

    In my understanding when you use intrinsics and build for a processor without support for the intrinsics then GCC for example will replace it with equivalent code.

    • mcbain 7 years ago

      Unfortunately, no.

      That is the case with GCCs __builtin functions. With a few exceptions, intrinsics are basically macros for inline asm that the compiler can reason about.

      If on x86-64 you use a _mm256* intrinsic and compile without AVX support you just get a compile error, not a pair of equivalent SSE instructions.

      • rurban 7 years ago

        Even worse. You mostly get run-time errors when the built machine supported that feature, your machine doesn't, and the features aren't separated into multiversioning or loading different shared libs.

    • eesmith 7 years ago

      That is true. Here's a couple of negatives. First, you still need to build once for each architecture, either as different executables, or as different object files, and provide some dispatch mechanism to use the right one based on what hardware is available.

      Second, if the intrinsics aren't built-in then there may be faster alternatives than using the GCC emulated version.

      • BeeOnRope 7 years ago

        You must be thinking about GCC "builtins" because there is no emulation for x86 SIMD intrinsics (ie the things in <immintrin.h>).

        • eesmith 7 years ago

          Oh, indeed I was. Thanks for pointing out my error. I was specifically thinking about POPCNT.

  • saagarjha 7 years ago

    Darwin platforms ship binaries with different slices for different versions of Intel processors. You have the generic x86_64 and the newer x86_64h which supports more features.

bradleyjg 7 years ago

Under the new string model in java > 8 a fairly frequent workflow is:

1) get external string

2) figure out if it is UTF-8, UTF-16, or some other recognizable encoding

3) validate the byte stream

4) figure out if the code points in the incoming string can be represented in Latin-1

5) instantiate a java string using either the Latin-1 encoder or the UTF-16 encoder

I know some or all of these steps are done using hotspot intrinsics, and then the JIT/VM does inlining, folding and so on, but I wonder how fast a custom assembly function to do all these steps at once could be.

jwilk 7 years ago

Previous blog post on HN:

https://news.ycombinator.com/item?id=17081571

kissiel 7 years ago

I wonder about the Joules per byte. AFAIK AVX units are quite expensive energy-wise.

akarambirOP 7 years ago

What does linux utilities like sed, awk use for text manipulation because they were very slow when I was changing a few table names in a sql file.

  • zorked 7 years ago

    I don't think they use anything in common. Try to set your locale to "C" as otherwise string comparisons will do extra work handling your locale's notions of equivalent characters.

  • coldtea 7 years ago

    What was the size of the SQL file?

    A "few table names" doesn't mean much if the SQL file is 20GB.

    In any case, sed and awk are plenty fast, but not the fastest methods of text manipulation. You could write a custom C program for that.

    • Thiez 7 years ago

      While it sure is possible to do text manipulation in C, I don't think it should ever be the first choice, even if 'fastest' is a goal. A 0 byte is perfectly acceptable in a utf8 string (or any unicode string, really). But C has those annoying zero-terminated strings, so if you want to manipulate arbitrary unicode strings the first thing you can do is kiss the string functions in the C standard library goodbye. Which you probably want to do anyway because pascal-strings are simply better.

      I would use Rust or C++ for this task.

      • knome 7 years ago

        > A 0 byte is perfectly acceptable in a utf8 string (or any unicode string, really)

        What? My understanding was that utf8 was crafted specifically so that the only null byte in it was literally NUL. That all normal human language described by a utf8 string will never contain a NUL. They're comparable to C strings in that way, where it can be used safely as an end of string marker. If you have embedded NULs, it's not really utf8, is it?

        • masklinn 7 years ago

          > They're comparable to C strings in that way, where it can be used safely as an end of string marker. If you have embedded NULs, it's not really utf8, is it?

          It is. NUL is a C-string convention, as far as unicode is concerned NULL (U+0000) is a perfectly normal codepoint (very much unlike e.g. the U+D800–U+DFFF range).

        • Dylan16807 7 years ago

          > My understanding was that utf8 was crafted specifically so that the only null byte in it was literally NUL.

          Correct.

          > That all normal human language described by a utf8 string will never contain a NUL.

          Correct.

          > If you have embedded NULs, it's not really utf8, is it?

          Incorrect.

          NUL is a valid character. If you accept arbitrary utf-8, or arbitrary ascii, or arbitrary 8859-1, then there might be embedded NUL. You can filter them out if you want, but they're not invalid.

          • paavoova 7 years ago

            It's invalid for unix filenames to have a null character. Therefore, if your application is printing filenames in their unicode representation, it doesn't ever need to consider there to be a null byte. This of course isn't an arbitrary case, but it shows one can make assumptions regardless of the "validity" of a character. I believe for most cases of arbitrary input, the correct and safe thing to do is to assume a byte stream of unknown encoding.

            • Thiez 7 years ago

              Since we arrived on this null-character discussion by considering text manipulation in C, I suspect most comments in this thread are made in the assumption that the text must be manipulated in some way (mine are!), so treating it as a byte stream of unknown encoding doesn't really solve the problem.

              While null in filenames may be forbidden on Unix (and also on Windows), there are more exotic systems where it is allowed [1]. When writing portable software it's probably best not to make assumptions about what characters will never be in a filename.

              Naturally if you have a problem where you can get away with just moving bytes around and never making assumptions about its contents then that is a great solution.

              [1]: https://en.wikipedia.org/wiki/Filename#Comparison_of_filenam...

            • paulddraper 7 years ago

              It's also invalid for filenames to have a slash, but I don't think that's very relevant to the discussion at hand.

      • masklinn 7 years ago

        > Which you probably want to do anyway because pascal-strings are simply better.

        They're not though. While having an explicit length is great, p-strings means the length is the first item of the data buffer, which is just awful, and why Pascal was originally limited to 255 byte strings.

        Rust or C++ use record-strings, where the string type is a "rich" stack-allocated structure of (*buffer, length[, capacity], …) rather than just a buffer/pointer.

        • Thiez 7 years ago

          That is a fair point, I misunderstood the term to refer to any type of string where the length is stored explicitly. I'll try and refer to them by their correct name ('record strings') from now on :-)

        • Dylan16807 7 years ago

          > p-strings means the length is the first item of the data buffer, which is just awful

          You can represent it as a struct of (length, char[]) which isn't awful.

          • masklinn 7 years ago

            > You can represent it as a struct of (length, char[]) which isn't awful.

            It kinda is still: if you're storing it on the stack you're dealing with an unsized on-stack structure which is painful, and if you're not you're paying a deref for accessing the length which you don't need to. If by `char[]` you mean `char*` then it's a record string, not a p-string.

            • Dylan16807 7 years ago

              I mean a variable-length array, all stored together.

              Presumably you'd allocate it on the heap in general. But a record string also requires a heap allocation.

              Most of the time you're touching the length you're probably touching the string data too, so that dereference isn't going to cost very much. And it comes with a tradeoff of more compact local data. So I stand by it being not awful! It may not be perfect, but it's a solid option.

              • Thiez 7 years ago

                When you have record strings you get slicing for free though. Without the indirection of a pointer you have to copy data when you slice (or you must have a separate 'sliced string' type).

                • Dylan16807 7 years ago

                  "free" if you ignore the cost of doing lifetime management. So beneficial in some use cases but not others.

  • masklinn 7 years ago

    Note that this and that are not necessarily related: you're talking about performing unicode-aware text matching and manipulation, TFA is solely about validating a buffer's content as UTF-8.

  • rurban 7 years ago

    They are still mostly not multi-byte string (i.e. unicode) aware after decades of work. I.e. you cannot really search for strings, with case-folding or normalized variants.

    See http://crashcourse.housegordon.org/coreutils-multibyte-suppo... and http://perl11.org/blog/foldcase.html for an overview of the performance problems.

    This tool only does the minor task of validation of the UTF-8 encoding, nothing else. There are still the major tasks of decoding, folding and normalization to do.

  • akx 7 years ago

    How slow? On my 2013 MBP, `gsed` (sed from coreutils) can do a replacement like that at about 350 MiB/s (of which most seems to be spent writing to disk, since writing to /dev/null hikes it up to 800 MiB/s).

    • akarambirOP 7 years ago

      It was sed substitute command on a ~800Mb file on Thinkpad T470 with SSD. It was taking around 40-50 sec for each substitution. Though as others have pointed, it may not be directly related to article in discussion.

      • coldtea 7 years ago

        >It was taking around 40-50 sec for each substitution.

        Substitution should not be really a relevant metric as it wouldn't influence the result much. Sed/Awk will still have to go through the whole file to find all occurrences they should substitute (and when they do find an occurrence, the substitution would take nanoseconds).

        The size of the file is a better metric (e.g. how many seconds for that 800mb in total).

        Also, whether you used regex in your awk/sed, and what kind. A badly written regex can slow down search very much.

      • etatoby 7 years ago

        Did you use any quadratic or worse regex algorithm? Such as having more than one .* in a single regex.

        Did you set LANG=C before running sed, to bypass the UTF-8 logic?

        Also, if you had a list of substitutions to perform, did you try writing them as a single sed script?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection