Show HN: a Rust-based multimodal inference server
github.comWe built a production-grade multimodal inference server in Rust for serving vision–language models (image + text → streamed text).
The goal was to explore what a Rust-native control plane looks like for modern multimodal inference: continuous batching, KV-aware admission control, predictable behavior under load, and proper streaming semantics.
The system exposes an OpenAI-compatible API, supports multi-image inputs, and is designed to degrade gracefully under overload rather than OOM or stall. It’s organized as a single monorepo with a gateway, GPU workers, scheduler, and pluggable engine adapters.
We’ve also included a benchmark suite focused on real-world scenarios (TTFT, cancellation, overload, fairness) rather than synthetic tokens/sec numbers.
Would love feedback from folks building or operating inference infrastructure.
No comments yet.