GitHub - bwasti/binfer: experiments with inference in bun

binfer (Bun + infer)

Experimental Bun-based snappy LLM inference engine in TypeScript and CUDA. Currently supports gpt-oss and qwen huggingface models.

Requirements

CUDA 12.x
Bun 1.x
NVIDIA GPU (H100 recommended, A100 supported)
NCCL (for multi-GPU)

Build

This automatically builds the CUDA kernels. To rebuild manually: bun run build:cuda

Usage

Single Generation

bun run src/index.ts <model> [prompt] [options]

# Examples
bun run src/index.ts Qwen/Qwen3-1.7B "Hello world" --max-tokens 50
bun run src/index.ts openai/gpt-oss-20b "Tell me a joke" --tp 2

Interactive Chat

bun run src/index.ts Qwen/Qwen3-8B --chat

Batch Processing

bun run src/index.ts model -i prompts.txt -o outputs.txt --max-tokens 100

OpenAI-Compatible Server

bun run src/serve.ts Qwen/Qwen3-1.7B --port 8000

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 20}'

CLI Options

Flag	Description
`--tp <n\|auto>`	Tensor parallel degree (default: auto)
`--max-tokens <n>`	Maximum tokens to generate (default: 100)
`--dtype <fp16\|bf16>`	Weight precision (default: model native)
`--chat`	Interactive chat mode
`--bench`	Run benchmark
`--profile`	Show timing breakdown
`--trace <file>`	Write JSON trace (chrome://tracing format)
`--use-template`	Apply chat template to prompt
`--no-cuda-graphs`	Disable CUDA graph capture
`--temperature <f>`	Sampling temperature (default: 0 = greedy)
`--top-k <n>`	Top-k sampling (default: 0 = disabled)
`--top-p <f>`	Nucleus sampling (default: 1.0 = disabled)
`-i, --input <file>`	Input file with prompts (one per line)
`-o, --output <file>`	Output file for batch results
`-q, --quiet`	Suppress loading output
`-v, --verbose`	Show startup timing

Supported Models

Qwen (including Qwen3 with QK-norm)
GPT-OSS (20B, 120B)
possibly more, I forget

Any HuggingFace model with LLaMA-style architecture.

Features

Inference

Paged KV cache
Continuous batching
CUDA graph capture for decode
Streaming token output

Multi-GPU

Tensor parallelism with NCCL
Automatic TP degree detection
Column/row parallel weight sharding
Pre-sharded weight caching

Quantization

MXFP4 (4-bit with E8M0 scales)
Native FP16/BF16

Attention

Flash Attention 3 (H100)
Paged attention for variable-length batches
Grouped Query Attention (GQA)
RoPE scaling (linear, YaRN)
Sliding window attention

MoE (Mixture of Experts)

Expert parallel across GPUs
MXFP4 quantized experts
Marlin kernel for batched inference
Up to 128 experts

Tokenizer

Native JS tokenizer (@huggingface/tokenizers)
Chat templates: ChatML, Llama, GPT-OSS formats
No Python dependency

Project Structure

src/
  index.ts          # CLI entry point
  serve.ts          # OpenAI-compatible server
  engine/           # Inference engine
  model/            # Config parsing, tokenizer
  parallel/         # Tensor parallelism, NCCL
  backend/cuda/     # CUDA bindings
cuda/
  kernels/          # CUDA kernel implementations
  include/          # C headers

License

MIT