Show HN: I built a CSV parser to try Go 1.26's new SIMD package

8 points by tokkyokky 3 months ago · 5 comments · 1 min read

Reader

Hey HN,

A CSV parser using Go 1.26's experimental simd/archsimd package.

I wanted to see what the new SIMD API looks like in practice. CSV parsing is mostly "find these bytes in a buffer"—load 64 bytes, compare, get a bitmask of positions. The interesting part was handling chunk boundaries correctly (quotes and line endings can split across chunks).

- Drop-in replacement for encoding/csv - ~20% faster for unquoted data on AVX-512 - Quoted data is slower (still optimizing) - Scalar fallback for non-AVX-512

Requires GOEXPERIMENT=simd.

https://github.com/nnnkkk7/go-simdcsv

Feedback on edge cases or the SIMD implementation welcome.

peymo 3 months ago

Oh this is really cool! I didn't know Go has added this!

I went on a similar adventure but in Zig. Since I had to prepare a benchmarking suite, I put out one in case anyone needs it. If you think it might be helpful, give it a go: https://github.com/peymanmortazavi/csv-race

In my findings, using 64 bytes (512-bits) even when possible actually degraded the performance. I also had to fine-tune the numbers for different CPUs. For instance on Apple, I could go much higher but on my CPU, if I went to 64 bytes (512-bits), It would degrade the performance.

Another thing I explored was to iterate on the fields as opposed to records. This allows you to just avoid any copying or dynamic memory allocation, which should give you a pretty decent boost. You can add utility wrappers to match Go's record based iteration when it is necessary.

Just some thoughts! but congrats on this!!

juliusgeo 3 months ago

This is super cool! If I'm understanding your implementation correctly, you do perform bit by bit state machine logic to check whether quotes should be escaped etc. You can do that in a single pass by using carry-less polynomial multiplication instructions (_mm_clmulepi64_si128 on AVX-512 I believe), or by just computing the carryless xor directly on the quote mask and then &ing the inverse with the bitmask for quotes. Simdjson uses this trick, and I use it as well in my Rust simd csv parser:

https://github.com/juliusgeo/csimdv-rs/blob/681df3b036f30c5a...

This is a good write-up on how the approach works: https://nullprogram.com/blog/2021/12/04/

tokkyokkyOP 3 months ago

Thanks for the tip! Your comment prompted me to refactor the quote handling - replaced the bit-by-bit state machine loop with prefix XOR, and switched to adjacent bit masking for double-quote detection. Seeing a nice performance improvement in benchmarks. Go's simd/archsimd doesn't have CLMUL yet, but the XOR cascade works well. Appreciate your feedback!

zigzag312 3 months ago

Benchmark comparison with C# SIMD optimized CSV parser [1] would be fun to see.

[1] https://github.com/nietras/Sep

tokkyokkyOP 3 months ago

Oh, nice! I’ll try to do it!!

Settings

Show HN: I built a CSV parser to try Go 1.26's new SIMD package

Keyboard Shortcuts