Show HN: Accelerate SHA256 Computations in Go Using AVX512 instructions

74 points by y4m4 8 years ago · 35 comments

Reader

A recent blog post by Vlad Krasnov, author of a bunch of the crypto assembly code in openssl and in golang, about frequency scaling when using AVX-512 making it not worth it: https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

He doesn't like the title of the OP and provided links:

> Very misleading title. Could just as well name it "accelerate sha256 up to 134x". You need to compare apples to apples. If AVX2 was used in the same way AVX512 is used, the speedup would be 2X at most. Reminds me of two of my papers https://eprint.iacr.org/2012/371.pdf https://eprint.iacr.org/2012/067.pdf

(from https://twitter.com/thecomp1ler/status/940724783804645376)

EDIT: Thanks 'delhanty !

delhanty 8 years ago

The twitter link needs fixing perhaps:
https://twitter.com/thecomp1ler/status/940724783804645376
blasdel 8 years ago

He's using the "cheap" Xeon Silver chips that clock down all cores immediately and aggressively when any are using AVX-512
Most of the Gold and Platinum series chips don't start frequency scaling down below baseline until around half the cores are using AVX512. The fanciest Platinum chips can use it on all cores with the only limit being that you can't Turbo quite as much: https://en.wikichip.org/wiki/intel/xeon_platinum/8180m
Without that capability, cloud providers wouldn't be able to offer multitenant VMs with access to the new instructions
- thecompilr 8 years ago
  
  Can't turbo as much is still a net loss of performance. Even on Platinum turbo frequency is almost 10% lower with AVX512 on single core, and 30% lower with all cores. So you really want the majority of your software to use AVX512 to gain net benefit. It takes the system 2ms to recover after an AVX512 instruction. But you are correct that the Silvers are way worse. I suspect Intel intentionally killed AVX512 performance on the Silvers. I tested power consumption, and there is no reason to reduce the frequency, except for the sake of it. The sad thing is there is no CPUID flag to distinguish good AVX512 from useless AVX512. Would really be better if they disabled it completely on Silver. The way it is now will just hurt adoption.
  - zx2c4 8 years ago
    
    Of interest regarding this might be: https://twitter.com/InstLatX64/status/934093081514831872
    > The sad thing is there is no CPUID flag to distinguish good AVX512 from useless AVX512.
    You can read the the avx512_2ndFMA bit from the PIROM, according to this Intel datasheet: https://www.intel.com/content/www/us/en/processors/xeon/scal...
    Linux doesn't implement reading PIROM over SMBus, but it sure would be nice to expose this flag in /proc/cpuinfo.
    In WireGuard we're at the moment just disabling the zmm AVX512F implementation on Skylake-X, falling back to the still-fast-but-not-as-fast AVX512VL implementation that only touches ymm and doesn't downclock as much (following OpenSSL's reasoning on +/- Andy Polyakov's same implementation):
    https://git.zx2c4.com/WireGuard/tree/src/crypto/chacha20poly...
    I may look into trying to read the PIROM so that I can make a more informed decision. I've tested those Platinum boxes, and indeed it's a lot faster there, even with the [lesser] downclocking, whereas a Gold box didn't perform as well, making the ymm-only implementation necessary.
    
    thecompilr 8 years ago
    
    If that is an issue for you, you could try using the implementation I wrote for boringssl. It avoids SIMD multiplications altogether and only uses simple AVX2 instructions, so there is no slowdown (AFAICT) although it is not as fast as AVX512VL from OpenSSL in benchmarks.

eloff 8 years ago

This is assembly, not pure Go, but it doesn't use CGO which I probably what they mean.

Intel Cannon Lake processors will support the SHA instruction extensions (currently available only on Goldmont). It will be interesting to see how that compares with this approach of running 16 SHA computations in parallel. You would be able to get rid of the scheduling overhead of having to first queue up 16 SHA calculations from other threads.

TazeTSchnitzel 8 years ago

> Intel Cannon Lake processors will support the SHA instruction extensions (currently available only on Goldmont)
They're also already available on AMD Zen (Ryzen, Threadripper, Epyc, Ryzen Mobile).
stcredzero 8 years ago

This is assembly, not pure Go
Well, if you're going to dip into pedantic mode, couldn't the language maintainers just define Go to include a few relevant Assembly instruction sets? (Not taking a dig at you but rather at the above level of pendantry.)
- MaxBarraclough 8 years ago
  
  Not without tying Go to one architecture, no.
  When C programmers write inline assembly, they don't pretend it's C code.
  - sythe2o0 8 years ago
    
    Go has it's own form of assembly which it compiles to multiple architectures.
    
    npongratz 8 years ago
    
    This is interesting. A citation that helps strengthen the explanation:
    https://golang.org/doc/asm
    "The most important thing to know about Go's assembler is that it is not a direct representation of the underlying machine. Some of the details map precisely to the machine, but some do not... The details vary with architecture, and we apologize for the imprecision; the situation is not well-defined."
    
    MaxBarraclough 8 years ago
    
    I followed npongratz's link. Interesting read.
    Seems the point of it is to enable easier porting of assembly between architectures, by providing a consistent syntax.
    I was expecting something akin to LLVM assembly language, but no, they're come up with their own bizarre high-level assembly-language intended to map fairly directly to various different instruction-sets. It's not an abstraction layer in the usual sense; the exposed register-set and instruction-set are faithful to the target architecture.
    It's a finite register machine which isn't just faithfully exposing the underlying architecture. Not something we often see. iirc SPIR and GNU Lightning are both finite register machines, but, to quote Douglas Adams, this has made a lot of people very angry and been widely regarded as a bad move.
    How is it compiled? Presumably it doesn't get translated to LLVM as an intermediary.
    It strikes me as an awful lot of work. Does their high-level assembler really outperform LLVM? Would've thought a project of that sort would deserve to exist in its own right, not just as an obscure component of Go.
    
    stcredzero 8 years ago
    
    It's a finite register machine which isn't just faithfully exposing the underlying architecture. Not something we often see. iirc SPIR and GNU Lightning are both finite register machines, but, to quote Douglas Adams, this has made a lot of people very angry and been widely regarded as a bad move.
    TAOS operating system used such a virtual ISA, and was able to achieve around 90% efficiency of native code. The worst case was PowerPC which fell to 80%. That's pretty darn good, IMO.
    
    MaxBarraclough 8 years ago
    
    Skimreading http://www.dickpountain.co.uk/home/computing/byte-articles/t... ( linked from https://news.ycombinator.com/item?id=9806607 , where you yourself commented )
    Interesting OS. Its 'VP code' looks like a precursor to Java bytecode/HotSpot, but much more low-level and RISC-ey.
    Inferno OS's 'Dis' VM took a similar to approach to VP code, if I understand correctly.
    I presume that, in 1991 when the article was written, "JIT" wasn't yet in the techies' parlance. It's not used anywhere in the article.
    
    dbaupp 8 years ago
    
    It won't when using architecture specific instructions like AVX512.
  - pjmlp 8 years ago
    
    Sure they do, most developers are even't aware that ANSI C does not define any kind of inline Assembly, and ANSI C++ only defines the existance of an asm keyword with implementation defined behavior.
  - stcredzero 8 years ago
    
    Not without tying Go to one architecture, no.
    Not trying hard enough to prove the null hypothesis. If you're going full pedantic mode, why stop at one ISA? Just throw in the most relevant dozen. (Heck, I even spelled that out!)
    
    MaxBarraclough 8 years ago
    
    That doesn't escape the problem. If you do that, your new definition of the core Go language is no longer cross-platform.
    
    stcredzero 8 years ago
    
    It's cross platform enough for most people.

foobarbazetc 8 years ago

One thing to note is that the benchmark is running on a Skylake Platinum chip which has two AVX512 FMAs.

You need a Gold 6000 series and above to see any benefit from AVX512. In most other cases the CPU throttles down some insane amount and there’s no to little benefit.

jandrewrogers 8 years ago

The much cheaper Xeon W-series, such as those in the iMac Pro, also have two AVX-512 FMAs.
thecompilr 8 years ago

You don't use FMA (or any multiplication) for SHA2. The throttling is a big issue.
- foobarbazetc 8 years ago
  
  True.
  Did you guys get to test Epyc at CloudFlare?
  The 7401P seems pretty special. Like really great $ per perf. I think SuperMicro are coming out with 1 socket Epyc boards/servers.
nextos 8 years ago

In this particular AVX512 use case or in most?

ComputerGuru 8 years ago

I blogged about the SHA instruction support in the x86_64 ISA a few months back, it’ll be nice to see it actually happen: https://neosmart.net/blog/2017/will-amds-ryzen-finally-bring...

dragonfax 8 years ago

Isn't this the kind of thing that was missing from the "go on different platforms" benchmark a little while back. The intel platform has crazy optimization for encryption algorithms on Inteil, while ARM was severely lacking.

wolf550e 8 years ago

search for "arm" on this list: https://dev.golang.org/reviews

mikebenfield 8 years ago

Possibly I'm confused, but in what sense is this "in Pure Go"?

derefr 8 years ago

I guess in the sense that you only need the Go toolchain, rather than also needing a C toolchain to compile+link an extra C object file into your binary.
gjem97 8 years ago

IMO, it's playing fast and loose with that term, but I guess the point is that it's not using CGO (i.e. calling into C code). It is, however, using the assembler packaged with the Go tools, so in that regard it's not "pure" go.
- stcredzero 8 years ago
  
  IMO, it's playing fast and loose with that term
  The terminology in this context is already fast and loose: It is rigorous in a practical engineering sense and is far from a mathematical level of precision. As I pointed out above, the maintainers could just define Go to include a few Assembly languages.
bigdubs 8 years ago

It doesn't use CGO.

Settings

Show HN: Accelerate SHA256 Computations in Go Using AVX512 instructions

Keyboard Shortcuts