Computing Adler32 Checksums at 41 GB/s

2 points by wooosh 4 years ago · 3 comments

Reader

Nyan 4 years ago

Nicely done.

> There is still a lot of room to micro-optimize both the avx and avx64 implementation

I personally couldn't see much - perhaps aligning loads and defering `_mm256_madd_epi16` are the only ideas that come to mind. What did you have in mind?

woooshOP 4 years ago

Not sure if any of these would result in meaningful performance gains, but a few ideas I had:
* An avx96/avx128 version, which requires more care than avx32/avx64 because you will overflow a 16 bit signed number if you simply extend the coefficient vectors from 0..32 to 0..96/128 (e.g. 255*96 + 254*96 > 32767), but looking at it now, I realize you shouldn't actually need more than one 0..32 coefficient vector.
* The chunk length could be longer because there are 8 separate 32 bit counters in each vector, which can be summed into a uint64_t instead of a uint32_t when computing the modulo.
* As you said, aligning the loads and deferring the `_mm256_madd_epi16` outside of the loop. For deferring the madd specifically, using two separate sum2 vectors and splitting the `mad` vector into two by using `_mm256_and_si256(mad, _mm256_set1_epi32(0xFFFF)` and `_mm256_srli(mad, 16)` which should improve upon the 5 cycle latency hit incurred by the madd.
Plus I am sure there are many other opportunities to optimize this I have not thought of :)
- Nyan 4 years ago
  
  Nice!
  > For deferring the madd specifically, using two separate sum2 vectors and splitting the `mad` vector into two
  Actually, the idea was to accumulate into 16-bit sums, and only do madd to 32-bit every 4 loop cycles. I'm not sure splitting it up like that actually helps, since the latency can be easily hidden by an OoO processor, and could actually be detrimental adding more uOps.
  One thing to note is that you've got a dependent add chain on sum2_v, so using two independent sums instead of one could help.
  > Plus I am sure there are many other opportunities to optimize this I have not thought of :)
  Other implementations I've seen don't go any further, e.g. https://github.com/zlib-ng/zlib-ng/blob/develop/arch/x86/adl... https://github.com/veluca93/fpnge/blob/9a9fc023870bacd06674f...
  So perhaps as you allude to, it isn't really worth it.

Settings

Computing Adler32 Checksums at 41 GB/s

Keyboard Shortcuts