Rust zero-cost abstractions vs. SIMD
turbopuffer.comThe real pitfall is overhead in the standard memory allocator. On ARM v8-A, I bypassed it entirely for my audit engine. Result: 85ns latency for 10.8T data points on a $100 board. I recorded the memory profiler and benchmarks as proof since the numbers look 'impossible'. See the video here
Actually, the bottleneck wasn't the I/O, it was the context switching. If anyone wants the specific memory map addresses I used for the ARM v8-A bypass, let me know
Sounds like the cost isn't really in the abstraction, but in implementing a traversal of the merge tree which produced one value at a time instead of creating a batch with what is presumably fewer total wasted computations... I doubt that they'd have had better codegen if they inlined their `next()` into the loop consuming the values. And vice versa, probably an `Iterator` for the merge tree that internally produces a batch and then yields from it would probably perform pretty much the same as their current code (since it's thin enough to be inlined I expect).
Can we please encourage variable-width fonts for text, fixed-width fonts for code? It improves readability.