Run Microsoft BitNet b1.58 ternary LLMs with WebGPU — in browsers and native apps.
Live Chat Demo · TL;DR Widget Demo · Getting Started · API Reference
0xBitNet runs BitNet b1.58 ternary LLMs on WebGPU. Custom WGSL compute kernels handle the ternary matrix operations, with bindings for TypeScript, Rust, and Python. Works in browsers, Node.js, and native apps.
Highlights
- Pure WebGPU — Custom WGSL kernels for ternary matrix operations (no WASM, no server)
- Multi-language — TypeScript (
0xbitnet), Rust (oxbitnet), Python (oxbitnet), Swift (OxBitNet), Java/Android (oxbitnet-java), C (oxbitnet-ffi) - Cross-platform — Browsers, Node.js, Deno, native apps via wgpu
- Chat templates — Built-in LLaMA 3 chat message formatting
- Automatic caching — IndexedDB (browser) / disk cache (native)
- Streaming — Token-by-token output via async generators / streams / callbacks
Quick Start
TypeScript / JavaScript
import { BitNet } from "0xbitnet"; const model = await BitNet.load( "https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/resolve/main/ggml-model-i2_s.gguf", { onProgress: (p) => console.log(`${p.phase}: ${(p.fraction * 100).toFixed(1)}%`) } ); for await (const token of model.generate("The meaning of life is")) { process.stdout.write(token); } model.dispose();
Rust
use oxbitnet::BitNet; use futures::StreamExt; let mut model = BitNet::load("model.gguf", Default::default()).await?; let mut stream = model.generate("Hello!", Default::default()); while let Some(token) = stream.next().await { print!("{token}"); } model.dispose();
Python
from oxbitnet import BitNet model = BitNet.load_sync("model.gguf") model.chat( [("system", "You are a helpful assistant."), ("user", "Hello!")], on_token=lambda t: print(t, end="", flush=True), temperature=0.7, ) model.dispose()
Swift
import OxBitNet let model = try await BitNet.load("model.gguf") for try await token in model.chat([.user("Hello!")], options: .init(temperature: 0.7)) { print(token, terminator: "") } model.dispose()
Java
import io.github.m96chan.oxbitnet.*; import java.util.List; try (BitNet model = BitNet.loadSync("model.gguf")) { model.chat( List.of(new ChatMessage("user", "Hello!")), token -> { System.out.print(token); return true; }, new GenerateOptions().temperature(0.7f) ); }
C / FFI
#include "oxbitnet.h" static int32_t on_token(const char *token, uintptr_t len, void *userdata) { fwrite(token, 1, len, stdout); return 0; /* 0 = continue, non-zero = stop */ } int main(void) { OxBitNet *model = oxbitnet_load("model.gguf", NULL); OxBitNetChatMessage messages[] = { { .role = "user", .content = "Hello!" }, }; OxBitNetGenerateOptions opts = oxbitnet_default_generate_options(); oxbitnet_chat(model, messages, 1, &opts, on_token, NULL); oxbitnet_free(model); }
Chat Messages (TypeScript)
const messages = [ { role: "system" as const, content: "You are a helpful assistant." }, { role: "user" as const, content: "Explain quantum computing in one sentence." }, ]; for await (const token of model.generate(messages, { maxTokens: 128, temperature: 0.7 })) { process.stdout.write(token); }
Supported Models
| Model | GGUF | Parameters | VRAM |
|---|---|---|---|
| BitNet b1.58 2B-4T | ggml-model-i2_s.gguf | 2B | ~1.5 GB |
| Falcon-E 1B Instruct | ggml-model-i2_s.gguf | 1B | ~666 MB |
| Falcon-E 3B Instruct | ggml-model-i2_s.gguf | 3B | ~1 GB |
Any I2_S GGUF model with a compatible architecture should work — see Model Compatibility for details.
Install
| Language | Package | Install |
|---|---|---|
| TypeScript / JS | 0xbitnet |
npm install 0xbitnet |
| Rust | oxbitnet |
cargo add oxbitnet |
| Python | oxbitnet |
pip install oxbitnet |
| Swift / iOS | OxBitNet |
Swift Package Manager (see oxbitnet-swift) |
| Java / Android | oxbitnet |
implementation("io.github.m96-chan:oxbitnet:0.5.2") |
| C / FFI | oxbitnet-ffi |
cargo build -p oxbitnet-ffi --release |
API Overview
TypeScript
| Method | Description |
|---|---|
BitNet.load(url, options?) |
Load a GGUF model from a URL |
bitnet.generate(prompt, options?) |
Stream tokens as an AsyncGenerator<string> |
bitnet.diagnose(prompt?) |
Run GPU diagnostics on a forward pass |
bitnet.dispose() |
Release all GPU resources |
Rust
| Method | Description |
|---|---|
BitNet::load(source, options).await |
Load a GGUF model |
bitnet.generate(prompt, options) |
Stream tokens as impl Stream<Item = String> |
bitnet.generate_chat(messages, options) |
Chat with template formatting |
bitnet.dispose() |
Release all GPU resources |
Python
| Method | Description |
|---|---|
BitNet.load_sync(source) |
Load a GGUF model |
model.chat(messages, on_token) |
Chat with streaming callback |
model.generate(prompt, on_token) |
Generate with streaming callback |
model.generate_sync(prompt) |
Generate, return full string |
model.dispose() |
Release all GPU resources |
Swift
| Method | Description |
|---|---|
BitNet.load(source, options:) |
Load a GGUF model (async) |
BitNet.loadSync(source, options:) |
Load a GGUF model (blocking) |
model.generate(prompt, options:) |
Stream tokens as AsyncThrowingStream<String, Error> |
model.chat(messages, options:) |
Chat with streaming via AsyncThrowingStream |
model.dispose() |
Release all GPU resources (also called by deinit) |
Java
| Method | Description |
|---|---|
BitNet.loadSync(source, options?) |
Load a GGUF model |
model.chat(messages, callback, options?) |
Chat with streaming callback |
model.generate(prompt, callback, options?) |
Generate with streaming callback |
model.dispose() / model.close() |
Release all GPU resources (AutoCloseable) |
C / FFI
| Function | Description |
|---|---|
oxbitnet_load(source, options) |
Load a GGUF model, returns opaque handle |
oxbitnet_chat(model, messages, n, opts, cb, ud) |
Chat with streaming callback |
oxbitnet_generate(model, prompt, opts, cb, ud) |
Generate with streaming callback |
oxbitnet_free(model) |
Release all GPU resources |
oxbitnet_error_message() |
Get last error (thread-local) |
Platform Support
0xBitNet runs on any platform with a WebGPU implementation:
Browsers:
- Chrome / Edge 113+ (recommended)
- Firefox Nightly (behind flag)
- Safari 18+
Native (Rust / Python):
- Uses wgpu — Vulkan, Metal, DX12 backends automatically
- No browser or WebGPU runtime needed
Native (Node.js / Deno):
- Deno (built-in WebGPU)
- Node.js with
webgpunpm package (Dawn bindings) — see Node.js CLI example - Any runtime exposing the WebGPU API (e.g., wgpu-native, Electron)
A dedicated GPU with sufficient VRAM is required (see Supported Models for estimates).
Examples
Web Chat
A WebGPU-powered chat application. Downloads the model on first visit, then runs LLM chat completely on-device — no backend needed.
TL;DR Widget
An offline-ready summarization widget. Provides LLM-powered TL;DR without any network dependency.
Node.js CLI
Run BitNet from the command line using Node.js and the webgpu npm package (Dawn bindings). Interactive chat with streaming output and tok/s metrics.
cd examples/node-cli npm install && npm start
Rust CLI
Interactive chat using native wgpu.
cd packages/rust
cargo run --example chat --releasePython CLI
Interactive chat via Python bindings.
pip install oxbitnet python packages/rust/crates/oxbitnet-python/examples/chat.py
Swift CLI
Minimal Swift chat example wrapping the C FFI layer.
cd packages/rust cargo build -p oxbitnet-ffi --release cd crates/oxbitnet-swift swift run -Xlinker -L../../../../target/release Chat model.gguf "Hello!"
Java CLI
Minimal Java chat example using JNI bindings.
cd packages/rust cargo build -p oxbitnet-java --release cd crates/oxbitnet-java/examples javac -cp ../java/src/main/java:. Chat.java java -Djava.library.path=../../../../target/release -cp ../java/src/main/java:. Chat model.gguf "Hello!"
C CLI
Minimal C example using the FFI bindings.
cd packages/rust cargo build -p oxbitnet-ffi --release gcc crates/oxbitnet-ffi/examples/chat.c -Icrates/oxbitnet-ffi -Ltarget/release -loxbitnet_ffi -o chat LD_LIBRARY_PATH=target/release ./chat model.gguf "Hello!"
Architecture
0xbitnet/
├── packages/
│ ├── core/ # WGSL kernels + TypeScript API (npm: 0xbitnet)
│ │ └── src/
│ │ ├── gpu/ # WebGPU device init, buffer pool
│ │ ├── model/ # GGUF parser, weight loader, config
│ │ ├── nn/ # Transformer layers, attention, BitLinear
│ │ ├── shaders/ # 12 WGSL compute shaders (shared with Rust)
│ │ └── tokenizer/ # BPE tokenizer, chat templates
│ └── rust/ # Rust + Python bindings
│ └── crates/
│ ├── oxbitnet/ # Rust library (crates.io: oxbitnet)
│ ├── oxbitnet-python/ # Python bindings via PyO3 (PyPI: oxbitnet)
│ ├── oxbitnet-swift/ # Swift bindings via C FFI (SPM package)
│ ├── oxbitnet-java/ # Java/JNI bindings (Android-ready)
│ └── oxbitnet-ffi/ # C FFI bindings (cdylib + staticlib)
├── examples/
│ ├── web-chat/ # Chat app demo (Vite)
│ ├── tl-dr-widget/ # Offline TL;DR widget demo (Vite)
│ └── node-cli/ # Node.js CLI using Dawn WebGPU bindings
└── docs/
See Architecture for data flow and internals.
Prerequisites
- TypeScript/JS: Node.js 18+, a WebGPU-capable environment
- Rust: Rust 1.75+, a Vulkan/Metal/DX12-capable GPU
- Swift: Swift 5.9+, a Vulkan/Metal/DX12-capable GPU
- Java: JDK 17+, a Vulkan/Metal/DX12-capable GPU
- Python: Python 3.9+,
pip install oxbitnet
Contributing
Contributions are welcome! Whether it's a bug report, feature request, or pull request — all input is appreciated.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please see CONTRIBUTING.md for detailed guidelines.
