Conversation
There's two patches here.
The first avoids recomputing constant field elements (#69 already has this, I just didn't see that PR).
The second patch tries different ways of handling overflows in the modular arithmetic, and picks a faster version for add. (One piece of the puzzle is that while the if looks scary, modern CPUs tend to have conditional move instructions, so the resulting code is actually branchless.)
Combining those two changes the code is faster than v0.8.0 on i7-6850K (so the Montgomery representation does indeed pay off):
rescue_prime_regular/hash_10/RescuePrimeRegular / Hash 10/10
time: [20.017 µs 20.045 µs 20.076 µs]
change: [-17.919% -17.781% -17.635%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
5 (5.00%) high mild
1 (1.00%) high severe
rescue_prime_regular/hash_varlen/RescuePrimeRegular / Hash Variable Length/16384
time: [35.703 ms 35.783 ms 35.877 ms]
change: [-14.785% -14.601% -14.401%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 50 measurements (12.00%)
2 (4.00%) high mild
4 (8.00%) high severe
Anyway, feel free to play around with this and cherry-pick whatever you like.
rescue_prime_regular/hash_10/RescuePrimeRegular / Hash 10/10
time: [24.362 µs 24.391 µs 24.422 µs]
change: [-35.922% -35.807% -35.704%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
rescue_prime_regular/hash_varlen/RescuePrimeRegular / Hash Variable Length/16384
time: [65.256 ms 65.293 ms 65.331 ms]
change: [-30.331% -30.268% -30.215%] (p = 0.00 < 0.05)
Performance has improved.
rescue_prime_regular/hash_10/RescuePrimeRegular / Hash 10/10
time: [20.067 µs 20.122 µs 20.193 µs]
change: [-17.079% -16.598% -15.936%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
rescue_prime_regular/hash_varlen/RescuePrimeRegular / Hash Variable Length/16384
time: [36.210 ms 36.480 ms 36.808 ms]
change: [-44.515% -44.129% -43.708%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 50 measurements (6.00%)
3 (6.00%) high severe
int-e
mentioned this pull request
So why is the code faster now in this context? It looks like in the inner loop of xlix_round(), the compiler (llvm, I suppose) sees through the attempt to write branchless code inside add and produces a conditional jump instead!
b9a56: b8 ff ff ff ff mov $0xffffffff,%eax b9a5b: 48 29 da sub %rbx,%rdx b9a5e: 72 02 jb b9a62 <_ZN12twenty_first11shared_math20rescue_prime_regular18RescuePrimeRegular10xlix_round17he700661c81f8c263E.llvm.5740233521524750972+0xa12> b9a60: 31 c0 xor %eax,%eax b9a62:
That's terrible. In contrast, the plain if results in a conditional move:
b942d: 48 29 da sub %rbx,%rdx b9430: b8 00 00 00 00 mov $0x0,%eax b9435: 49 0f 42 c5 cmovb %r13,%rax // %r13 is 0xFFFFFFFF here
No branch here, so the processor will be happy.
Interestingly this does not happen in isolation; compiling the following module
pub fn foo(a: u64, b: u64) -> u64 { let (r, c) = a.overflowing_sub(b); r.wrapping_sub(0u32.wrapping_sub(c as u32) as u64) } pub fn bar(a: u64, b: u64) -> u64 { let (r, c) = a.overflowing_sub(b); if c { r.wrapping_add(0xffffffff_00000001) } else { r } }
results in identical code for both functions, using a conditional move:
movq %rdi, %rcx subq %rsi, %rcx movabsq $-4294967295, %rax addq %rcx, %rax subq %rsi, %rdi cmovaeq %rdi, %rax retq
Very, very nice. I can confirm that this gives a 100 % speedup (halving the running time) on the rescue_prime_regular/hash_varlen/RescuePrimeRegular / Hash Variable Length/16384 and the a 60 % speedup for the merkle_tree/merkle_tree/65536 benchmark.
Is this a compiler bug that we have uncovered? And if so, do you think it's in LLVM or in the rustc?
Thank you so much for uncovering this.
It's a missed optimization opportunity for sure; I find it hard to call it a bug (the code works, after all, and the missed optimization is only this impactful because this is a hot inner loop). The Rust compiler itself does not do any arithmetic optimizations, that's all LLVM. (But rustc can influence the optimization passes, so it's not completely off the hook.)
I think a minimal example of this could be interesting to both the Rust and the LLVM developers.
I quite like the way to communicate which methods you have tried, and what the running times of those respective ways are.
Sword-Smith added a commit that referenced this pull request
- Avoids all branching (conditional jumps) in B field addition, see #70 - Makes the Rescue Prime round constants and matrices into const B field elements, as opposed to const u64 that had to go through `montyred` at runtime. Co-authored-by: Ferdinand Sauer <ferdinand@neptune.cash> Co-authored-by: Bertram Felgenhauer <int-e@gmx.de>
Some snippets to perform this analysis. Full credit goes to @int-e
# Look at the latest release build. You might want to run `cargo clean` followed by a benchmark for these files to be present. > find target -name twenty_first\* target/release/deps/twenty_first-60caca2773f2357d.d target/release/deps/twenty_first-60caca2773f2357d target/release/deps/twenty_first-242087d812946d75.d # Find the relevant function from the assembly/llvm > nm target/release/deps/twenty_first-60caca2773f2357d | grep xlix 00000000000b9050 t _ZN12twenty_first11shared_math20rescue_prime_regular18RescuePrimeRegular10xlix_round17he700661c81f8c263E.llvm.5740233521524750972 # inspect the object file (compiler output) with `objdump` and attempt to look for expensive instructions like `jb`. > objdump --disassemble=_ZN12twenty_first11shared_math20rescue_prime_regular18RescuePrimeRegular10xlix_round17he700661c81f8c263E.llvm.5740233521524750972 target/release/deps/twenty_first-60caca2773f2357d [about 4000 lines of disassembly for twenty_first::shared_math::escue_prime_regular::RescuePrimeRegular::xlix_round()
http://paste.debian.net/plain/1265025
For isolated examples you can use
rustc -O --emit=asm --crate-type=lib <filename>
The latter command will not work if there are external dependencies. It will output a *.s file that will contain the produced assembly.