Pure CUDA + Rust. Zero python dependencies, zero complex recipes.
Inference at
unimaginable speeds
An LLM inference engine written from scratch in Rust and CUDA. No PyTorch. No Python. Just a ~2.5 GB image that runs 3x faster than the status quo.
$ uvx sparkrun setup install
sparkrun pulls & runs the ~2.5 GB Atlas image for you. Run command below.
130
tok/s peak (Qwen3.5-35B)
Faster by Design
Clean architecture beats bloat
vLLM ships 20+ GB of Python, PyTorch, and 200+ dependencies. Atlas ships a single ~2.5 GB binary. That simplicity is the speed.
Atlas
Image size ~2.5 GB
Cold start <2 min
Runtime Rust + CUDA
Dependencies None
vLLM
Image size 20+ GB
Cold start ~10 min
Runtime Python + PyTorch
Dependencies 200+ packages
โก
Pure Rust + CUDA
Compiled from HTTP to kernel dispatch. No interpreter, no GIL, no JIT warm-up.
๐ง
Custom CUDA Kernels
Hand-tuned attention, MoE, GDN, and Mamba-2 kernels for Blackwell SM120/121. NVFP4 and FP8 with native tensor cores.
๐ฎ
MTP Speculative Decoding
Multi-Token Prediction generates multiple tokens per forward pass. Up to 3x throughput over single-token decoding.
Qwen3.5-35B (NVFP4) on DGX Spark
Single GPU, batch=1. Atlas with MTP K=2.
Average (diverse workloads)
Qwen3.5-122B (NVFP4) on a single DGX Spark
122B parameter model, single node. ~54 tok/s with EP=2.
Supported Models
Model matrix
Every model gets hand-tuned CUDA kernels. Pick a vendor, then a model family; every recipe maps to a single sparkrun recipe you can copy and run as-is.
Qwen ยท Qwen3-Coder-Next
1 recipe
Qwen3 Coder Next FP8
Qwen/Qwen3-Coder-Next-FP8
sparkrun run @atlas/qwen3-coder-next-fp8
All recipes are the single source of truth in atlas-recipes. Run any of them with sparkrun. EP=2 = Expert Parallelism across two GB10 nodes.
Try It Yourself
Up and running in one command
Don't take our word for it. One command on your DGX Spark and you're serving. The quickstart script installs sparkrun only if it isn't already present, then runs the recipe. sparkrun pulls & runs the Atlas image for you using your existing Docker/Podman + NVIDIA container runtime.
Run with sparkrun โ Qwen3.6-35B-A3B FP8 + MTP on a single Spark
$ curl -fsSL https://atlasinference.io/quickstart.sh | sh
OpenAI compatible at http://localhost:8888/v1. Works with Claude Code, Cline, OpenCode, Open WebUI, and any OpenAI-compatible client.
Contact
Get in touch
We optimize for your use case. Reach out with model requests, hardware setups, or partnership ideas.
Roadmap
Built for the community
We don't chase every architecture at once. We do each one properly, with kernels that hit the hardware ceiling rather than emulate around it.
๐
Hardware Expansion
Optimized for DGX Spark today. ASUS Ascent GX10 compatibility confirmed by the community. Strix Halo port in exploration. RTX 6000 Pro Blackwell on the horizon. Same kernel philosophy, adapted per chip.
๐ก
Kernel Philosophy
Every model gets its own hand-tuned CUDA kernels. No generic fallbacks. We profile, optimize, and validate at the register level. If a model matters to you, it matters to us.
๐ข
Community-Driven
MiniMax M2.7 just landed. Model support is driven entirely by what the community asks for. We're in Discord every day listening. Tell us what you're running and we'll optimize for your use case.
๐
Open Source
Free and open source release coming soon. We want to make sure what we release is something people can actually build on, not just a dump.
๐จ
Multimodal
Vision support live for Qwen3-VL. Audio and additional modalities on the roadmap. The goal is proper kernel-level support for each modality.
๐ฏ
Agentic-Ready
OpenAI + Anthropic API compatibility on the same port. Tool calling, structured output, multi-turn. Works with Claude Code, Cline, OpenCode, and Open WebUI out of the box.