exploration: speed up rescue_prime_regular benchmark (not for merging) by int-e · Pull Request #70 · Neptune-Crypto/twenty-first

Conversation

There's two patches here.

The first avoids recomputing constant field elements (#69 already has this, I just didn't see that PR).

The second patch tries different ways of handling overflows in the modular arithmetic, and picks a faster version for add. (One piece of the puzzle is that while the if looks scary, modern CPUs tend to have conditional move instructions, so the resulting code is actually branchless.)

Combining those two changes the code is faster than v0.8.0 on i7-6850K (so the Montgomery representation does indeed pay off):

rescue_prime_regular/hash_10/RescuePrimeRegular / Hash 10/10                                                                             
                        time:   [20.017 µs 20.045 µs 20.076 µs]
                        change: [-17.919% -17.781% -17.635%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

rescue_prime_regular/hash_varlen/RescuePrimeRegular / Hash Variable Length/16384                                                                           
                        time:   [35.703 ms 35.783 ms 35.877 ms]
                        change: [-14.785% -14.601% -14.401%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 50 measurements (12.00%)
  2 (4.00%) high mild
  4 (8.00%) high severe

Anyway, feel free to play around with this and cherry-pick whatever you like.

rescue_prime_regular/hash_10/RescuePrimeRegular / Hash 10/10
                        time:   [24.362 µs 24.391 µs 24.422 µs]
                        change: [-35.922% -35.807% -35.704%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

rescue_prime_regular/hash_varlen/RescuePrimeRegular / Hash Variable Length/16384
                        time:   [65.256 ms 65.293 ms 65.331 ms]
                        change: [-30.331% -30.268% -30.215%] (p = 0.00 < 0.05)
                        Performance has improved.

rescue_prime_regular/hash_10/RescuePrimeRegular / Hash 10/10
                        time:   [20.067 µs 20.122 µs 20.193 µs]
                        change: [-17.079% -16.598% -15.936%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

rescue_prime_regular/hash_varlen/RescuePrimeRegular / Hash Variable Length/16384
                        time:   [36.210 ms 36.480 ms 36.808 ms]
                        change: [-44.515% -44.129% -43.708%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 50 measurements (6.00%)
  3 (6.00%) high severe

int-e mentioned this pull request

Dec 22, 2022

So why is the code faster now in this context? It looks like in the inner loop of xlix_round(), the compiler (llvm, I suppose) sees through the attempt to write branchless code inside add and produces a conditional jump instead!

   b9a56:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
   b9a5b:	48 29 da             	sub    %rbx,%rdx
   b9a5e:	72 02                	jb     b9a62 <_ZN12twenty_first11shared_math20rescue_prime_regular18RescuePrimeRegular10xlix_round17he700661c81f8c263E.llvm.5740233521524750972+0xa12>
   b9a60:	31 c0                	xor    %eax,%eax
   b9a62:

That's terrible. In contrast, the plain if results in a conditional move:

   b942d:	48 29 da             	sub    %rbx,%rdx
   b9430:	b8 00 00 00 00       	mov    $0x0,%eax
   b9435:	49 0f 42 c5          	cmovb  %r13,%rax // %r13 is 0xFFFFFFFF here

No branch here, so the processor will be happy.

Interestingly this does not happen in isolation; compiling the following module

pub fn foo(a: u64, b: u64) -> u64 {
    let (r, c) = a.overflowing_sub(b);
    r.wrapping_sub(0u32.wrapping_sub(c as u32) as u64)
}

pub fn bar(a: u64, b: u64) -> u64 {
    let (r, c) = a.overflowing_sub(b);
    if c {
        r.wrapping_add(0xffffffff_00000001)
    } else {
        r
    }
}

results in identical code for both functions, using a conditional move:

	movq	%rdi, %rcx
	subq	%rsi, %rcx
	movabsq	$-4294967295, %rax
	addq	%rcx, %rax
	subq	%rsi, %rdi
	cmovaeq	%rdi, %rax
	retq

Very, very nice. I can confirm that this gives a 100 % speedup (halving the running time) on the rescue_prime_regular/hash_varlen/RescuePrimeRegular / Hash Variable Length/16384 and the a 60 % speedup for the merkle_tree/merkle_tree/65536 benchmark.

Is this a compiler bug that we have uncovered? And if so, do you think it's in LLVM or in the rustc?

Thank you so much for uncovering this.

It's a missed optimization opportunity for sure; I find it hard to call it a bug (the code works, after all, and the missed optimization is only this impactful because this is a hot inner loop). The Rust compiler itself does not do any arithmetic optimizations, that's all LLVM. (But rustc can influence the optimization passes, so it's not completely off the hook.)

I think a minimal example of this could be interesting to both the Rust and the LLVM developers.

Thanks so much for this! ❤️

I quite like the way to communicate which methods you have tried, and what the running times of those respective ways are.

Co-authored-by: sword-smith <thor@neptune.cash>

Sword-Smith added a commit that referenced this pull request

Dec 22, 2022

- Avoids all branching (conditional jumps) in B field addition, see #70
- Makes the Rescue Prime round constants and matrices into const B field
  elements, as opposed to const u64 that had to go through `montyred` at
  runtime.

Co-authored-by: Ferdinand Sauer <ferdinand@neptune.cash>
Co-authored-by: Bertram Felgenhauer <int-e@gmx.de>

Some snippets to perform this analysis. Full credit goes to @int-e

# Look at the latest release build. You might want to run `cargo clean` followed by a benchmark for these files to be present.
> find target -name twenty_first\*
target/release/deps/twenty_first-60caca2773f2357d.d
target/release/deps/twenty_first-60caca2773f2357d
target/release/deps/twenty_first-242087d812946d75.d

# Find the relevant function from the assembly/llvm
> nm target/release/deps/twenty_first-60caca2773f2357d | grep xlix
00000000000b9050 t _ZN12twenty_first11shared_math20rescue_prime_regular18RescuePrimeRegular10xlix_round17he700661c81f8c263E.llvm.5740233521524750972

# inspect the object file (compiler output) with `objdump` and attempt to look for expensive instructions like `jb`.
> objdump --disassemble=_ZN12twenty_first11shared_math20rescue_prime_regular18RescuePrimeRegular10xlix_round17he700661c81f8c263E.llvm.5740233521524750972 target/release/deps/twenty_first-60caca2773f2357d
[about 4000 lines of disassembly for twenty_first::shared_math::escue_prime_regular::RescuePrimeRegular::xlix_round()

http://paste.debian.net/plain/1265025

For isolated examples you can use

rustc -O --emit=asm --crate-type=lib <filename>

The latter command will not work if there are external dependencies. It will output a *.s file that will contain the produced assembly.