GitHub - kuprel/minbpe-pytorch: Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization, with PyTorch/CUDA

minbpe-pytorch

Minimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.

This adds PyTorch/CUDA training and encoding support to Andrej Karpathy's minbpe. It takes 67.4 seconds on an H100 with SXM5 to train the BasicTokenizer with a vocab_size of 512 on 308MB of Enron emails. The original code takes 2hrs 15min on an M2 Air with Python 3.11 to do this. That is a 120x speedup.

quick start

Install requirements:

$ pip install -r requirements.txt

Download Enron emails and save a 308MB text file to tests/enron.txt:

$ python get_enron_emails.py

Train a BasicTokenizer on the large text file:

The model will be saved in the models directory.

tests

The pytest library is used for tests. All of them are located in the tests/ directory. First pip install pytest, then:

todo

Speed up encode method for RegexTokenizer
Implement train method for RegexTokenizer
Support MPS device for MacBooks, currently breaks for torch.unique

License

MIT