Atlas Inference Engine

Pure CUDA + Rust. Zero python dependencies, zero complex recipes.

Inference at
unimaginable speeds

An LLM inference engine written from scratch in Rust and CUDA. No PyTorch. No Python. Just a ~2.5 GB image that runs 3x faster than the status quo.

$ uvx sparkrun setup install

sparkrun pulls & runs the ~2.5 GB Atlas image for you. Run command below.

130

tok/s peak (Qwen3.5-35B)

Faster by Design

Clean architecture beats bloat

vLLM ships 20+ GB of Python, PyTorch, and 200+ dependencies. Atlas ships a single ~2.5 GB binary. That simplicity is the speed.

Atlas

Image size ~2.5 GB

Cold start <2 min

Runtime Rust + CUDA

Dependencies None

vLLM

Image size 20+ GB

Cold start ~10 min

Runtime Python + PyTorch

Dependencies 200+ packages

⚡

Pure Rust + CUDA

Compiled from HTTP to kernel dispatch. No interpreter, no GIL, no JIT warm-up.

🔧

Custom CUDA Kernels

Hand-tuned attention, MoE, GDN, and Mamba-2 kernels for Blackwell SM120/121. NVFP4 and FP8 with native tensor cores.

🔮

MTP Speculative Decoding

Multi-Token Prediction generates multiple tokens per forward pass. Up to 3x throughput over single-token decoding.

Qwen3.5-35B (NVFP4) on DGX Spark

Single GPU, batch=1. Atlas with MTP K=2.

Average (diverse workloads)

Qwen3.5-122B (NVFP4) on a single DGX Spark

122B parameter model, single node. ~54 tok/s with EP=2.

Supported Models

Model matrix

Every model gets hand-tuned CUDA kernels. Pick a vendor, then a model family; every recipe maps to a single sparkrun recipe you can copy and run as-is.

Qwen · Qwen3-Coder-Next

1 recipe

Qwen3 Coder Next FP8

Qwen/Qwen3-Coder-Next-FP8

sparkrun run @atlas/qwen3-coder-next-fp8

All recipes are the single source of truth in atlas-recipes. Run any of them with sparkrun. EP=2 = Expert Parallelism across two GB10 nodes.

Try It Yourself

Up and running in one command

Don't take our word for it. One command on your DGX Spark and you're serving. The quickstart script installs sparkrun only if it isn't already present, then runs the recipe. sparkrun pulls & runs the Atlas image for you using your existing Docker/Podman + NVIDIA container runtime.

Run with sparkrun — Qwen3.6-35B-A3B FP8 + MTP on a single Spark

$ curl -fsSL https://atlasinference.io/quickstart.sh | sh

OpenAI compatible at http://localhost:8888/v1. Works with Claude Code, Cline, OpenCode, Open WebUI, and any OpenAI-compatible client.

Contact

Get in touch

We optimize for your use case. Reach out with model requests, hardware setups, or partnership ideas.

Roadmap

Built for the community

We don't chase every architecture at once. We do each one properly, with kernels that hit the hardware ceiling rather than emulate around it.

🌐

Hardware Expansion

Optimized for DGX Spark today. ASUS Ascent GX10 compatibility confirmed by the community. Strix Halo port in exploration. RTX 6000 Pro Blackwell on the horizon. Same kernel philosophy, adapted per chip.

💡

Kernel Philosophy

Every model gets its own hand-tuned CUDA kernels. No generic fallbacks. We profile, optimize, and validate at the register level. If a model matters to you, it matters to us.

📢

Community-Driven

MiniMax M2.7 just landed. Model support is driven entirely by what the community asks for. We're in Discord every day listening. Tell us what you're running and we'll optimize for your use case.

🛠

Open Source

Free and open source release coming soon. We want to make sure what we release is something people can actually build on, not just a dump.

🎨

Multimodal

Vision support live for Qwen3-VL. Audio and additional modalities on the roadmap. The goal is proper kernel-level support for each modality.

🎯

Agentic-Ready

OpenAI + Anthropic API compatibility on the same port. Tool calling, structured output, multi-turn. Works with Claude Code, Cline, OpenCode, and Open WebUI out of the box.

Inference at unimaginable speeds