Show HN: Safely batch byte-pair merges to speed up tokenization in Python 30X

colab.research.google.com

1 points by alexandermorgan 2 years ago · 4 comments

Reader

yorwba 2 years ago

For further optimization, custom index data structures à la https://en.wikipedia.org/wiki/Re-Pair would likely help.

alexandermorganOP 2 years ago

Thanks for sharing, I wasn't aware of this. I'm having trouble seeing how it differs from the byte-pair encoding algorithm. More importantly though, how can it be linear time when it's recursive and you have to tally the counts of each pair again after each merge?
- yorwba 2 years ago
  
  I think the main difference is in the intellectual lineage. Re-Pair was designed as a compression algorithm where you keep merging until you're left with a single token and then use the sequence of merges as a compressed representation of the original string. But you can also stop it partway through and treat it as a particularly clever byte-pair encoding implementation.
  The algorithm is recursive in the sense that merged tokens can participate in further merges just as in byte-pair encoding, but it involves a whole lot of pointer-chasing, so the core is inherently quite iterative. Those pointers allow skipping over all tokens which are unaffected by a merge, so updating the counts is very cheap. You can imagine each initial token being equipped with a fixed compute budget, and merging two tokens uses up the compute budget of one token, but the compute budget for the other token remains for the merge result to use. So the overall time is bounded by (compute budget per token) x (number of tokens).
  - alexandermorganOP 2 years ago
    
    Super interesting, thanks! I'll have to keep thinking about it.