GitHub - karpathy/rustbpe: The missing tiktoken training code

The missing tiktoken training code

A lightweight Rust library for training GPT-style BPE tokenizers. The tiktoken library is excellent for inference but doesn't support training. The HuggingFace tokenizers library supports training but carries significant complexity from years of accumulated tokenizer variants. My minbpe library handles both training and inference, but only in Python and not optimized for speed.

rustbpe fills this gap: a simple, efficient BPE training implementation in Rust with Python bindings. Train your tokenizer with rustbpe, then export to tiktoken for fast inference.

Features

Fast training with parallel processing (rayon)
GPT-4 style regex pre-tokenization by default
Direct export to tiktoken format
Python bindings via PyO3
Batch encoding with automatic parallelization

Installation

Python

From source

git clone https://github.com/karpathy/rustbpe.git
cd rustbpe
uv venv && source .venv/bin/activate
uv pip install maturin
maturin develop --release

Usage

Training

import rustbpe

# Create tokenizer and train on your data
tokenizer = rustbpe.Tokenizer()
tokenizer.train_from_iterator(
    ["your", "training", "texts", "here"],
    vocab_size=4096
)

# Encode and decode
ids = tokenizer.encode("hello world")
text = tokenizer.decode(ids)  # "hello world"

# Check vocabulary size
print(tokenizer.vocab_size)  # 4096

# Batch encode (parallel)
all_ids = tokenizer.batch_encode(["text one", "text two", "text three"])

Export to tiktoken

The main use case: train with rustbpe, inference with tiktoken.

import rustbpe
import tiktoken

# Train
tokenizer = rustbpe.Tokenizer()
tokenizer.train_from_iterator(open("corpus.txt"), vocab_size=8192)

# Export to tiktoken
enc = tiktoken.Encoding(
    name="my_tokenizer",
    pat_str=tokenizer.get_pattern(),
    mergeable_ranks={bytes(k): v for k, v in tokenizer.get_mergeable_ranks()},
    special_tokens={},
)

# Fast inference with tiktoken
ids = enc.encode("hello world")
text = enc.decode(ids)

Custom regex pattern

By default, rustbpe uses the GPT-4 tokenization pattern. You can provide your own:

tokenizer.train_from_iterator(
    texts,
    vocab_size=4096,
    pattern=r"[a-zA-Z]+|[0-9]+|\s+"  # custom pattern
)

API Reference

`Tokenizer`

Method	Description
`Tokenizer()`	Create a new tokenizer
`train_from_iterator(texts, vocab_size, buffer_size=8192, pattern=None)`	Train on an iterator of strings
`encode(text)`	Encode a string to token IDs
`decode(ids)`	Decode token IDs back to a string
`batch_encode(texts)`	Encode multiple strings in parallel
`vocab_size`	Property: vocabulary size (256 + number of merges)
`get_pattern()`	Get the regex pattern used for pre-tokenization
`get_mergeable_ranks()`	Get token bytes and ranks for tiktoken export

Development

Prerequisites

Rust: https://rustup.rs/
uv: curl -LsSf https://astral.sh/uv/install.sh | sh

Setup

git clone https://github.com/karpathy/rustbpe.git
cd rustbpe
uv venv && source .venv/bin/activate
uv pip install maturin pytest
maturin develop

Running tests

# Rust tests (fast, tests core algorithm)
cargo test

# Python tests (requires maturin develop first)
pytest tests/python/ -v -s

# Both
cargo test && pytest tests/python/ -v

Project structure

rustbpe/
├── Cargo.toml              # Rust package manifest
├── pyproject.toml          # Python package manifest
├── src/
│   └── lib.rs              # Rust implementation + PyO3 bindings + tests
└── tests/
    └── python/
        └── test_tokenizer.py

How BPE works

Byte Pair Encoding builds a vocabulary iteratively:

Start with 256 byte-level tokens (0x00-0xff)
Count all adjacent token pairs in the corpus
Merge the most frequent pair into a new token
Repeat until reaching target vocabulary size

The result is a vocabulary that efficiently represents common patterns while being able to encode any input.

LLM Assistance note

I wrote the Python reference code personally and from scratch and I am expert there and understand it fully. I then wrote the Rust code against this implementation with tests for equality. However, I am not a Rust developer by background so I had significant help from ChatGPT and Claude Code Opus 4.5. All the equality tests pass as far as I am aware, but I do apologize if some of the Rust code is not properly arranged, structured, or implemented. Please let me know in Issues/PRs if so and I am happy to adjust the code to make it better.

License

MIT