GitHub - hardik-vala/modernbert.c: Inference ModernBERT in one file of pure C

modernbert.c

A simple implementation of ModernBERT, in pure C. Inspired by llama2.c.

To keep it minimal, I hard-coded the ModernBERT architecture into one file, supporting inference only. You can load the base model weights, or fine-tuned weights. Currently, only support token classification as a downstream task.

run

Clone this repository:

git clone https://github.com/hardik-vala/modernbert.c.git

Then, open the repository folder:

Intall Python dependencies:

Export the tokenizer binary, which downloads + outputs the vocabulary, merges, and metadata into a format that can be easily loaded by the C file:

Similarly, export the model weights:

# answerdotai/ModernBERT-base
make export-model-base

# OR

# ai4privacy/llama-ai4privacy-english-anonymiser-openpii, for a token classification example
make export-model-tokclf

This export will take some time, to download the model weights from huggingface and convert them. For ModernBERT-base, expect a ~570MB output file.

Compile the C code:

Run the C code:

./run model/tokclf.bin tokenizer/tokenizer.bin "hello world"

# for token classification
./run --tokclf --n_labels 3 model/tokclf.bin tokenizer/tokenizer.bin "hello world"

models

You can load any huggingface model that uses the ModernBERT architecture. See the model/export.py script. (warning: this repo has not been tested with ModernBERT-large, or any of it's derivatives.)

tokenizer

The tokenizer implementation tokenizer/tokenizer.c is a crude approximation of huggingface's BPE-based PretrainedTokenizer. It works for ~80% of cases, but misses alot of edge cases.

performance (cpu only)

The default make compile command currently applies the -O3 optimization, which includes optimizations that are expensive in terms of compile time and memory usage. You can expect token throughput > 1200 tokens/s. I include this rough figure only as a point of reference, because there are caveats:

Since ModernBERT is an encoder model, it doesn't decode the output one token at a time. The model uses a single pass to produce outputs for all input tokens, no auto-regression. So token throughput here is not apples to apples with most llms out there, that are decoder-only.
The time still scales with the number of input tokens, so with more tokens, time-to-first-token is larger.

The bulk of the performance comes from the OpenBLAS library, and the highly-optimized matrix multiplications.

You can try to compile with make compilefast. This turns on the -Ofast flag, which includes additional optimizations that may break compliance with the C/IEEE specifications, in addition to -O3. See the GCC docs for more information. But I didn't see much of a difference.

license

MIT