Training large language models (LLMs) attracts attention for its massive compute demands and headline-making breakthroughs; however, what ultimately determines their real-world practicality and broad adoption is the efficiency, cost, and latency of the inference stage. Inference is the process by which a trained AI model applies what it has learned to new, unseen data to make predictions or generate outputs. For LLMs, this means accepting a user prompt, computing through the model’s vast network of parameters, and ultimately producing a coherent text response.
The core challenge in LLM inference is deploying models with tens to hundreds of billions of parameters under tight constraints on latency, throughput, and cost. It is a complex, cross-stack problem spanning algorithms, software, and hardware. On the one hand, the sheer size of the models and their compute- and memory-intensive operations (e.g., attention) pose fundamental obstacles; on the other, the autoregressive decoding process that underpins text generation is inherently sequential, limiting parallelism. As a result, LLM inference calls for a full-stack solution that accounts for everything from low-level hardware to application design, with the inference engine at the center.
Among open-source inference engines, vLLM and SGLang are two of the most closely watched projects.
From academic innovation to a community-driven open-source standard-bearer
Press enter or click to view image in full size
vLLM traces its roots to a 2023 paper centered on the PagedAttention algorithm, “Efficient Memory Management for Large Language Model Serving with PagedAttention.” If you look closely at the paper’s author list, you’ll notice many of those names reappear later.
In the early days of LLM serving, vLLM’s breakthrough wasn’t a brand-new AI algorithm; instead, it borrowed paging and cache management ideas from operating systems to fine-grain memory management, laying the groundwork for high-throughput request handling via its PagedAttention mechanism. vLLM also embraced and advanced several industry techniques, such as Continuous Batching first described in the paper “Orca: A Distributed Serving System for Transformer-Based Generative Models.”
Press enter or click to view image in full size
In LLM inference, performance and speed are the ultimate trump cards. In practice, vLLM delivered striking gains: according to the official blog at the time, compared with a Hugging Face Transformers–based backend, vLLM handled up to 5× the traffic and boosted throughput by as much as 30×. As a result, vLLM quickly evolved from an academic research project into a community-driven open-source effort: within less than half a year it amassed tens of thousands of stars, and today it has a formidable developer base — over ten thousand contributors have engaged in issue/PR discussions, nearly 2,000 have submitted PRs, and on average at least 10 new issues are filed daily; a large backlog remains, with more than 2,000 issues and PRs still awaiting triage.
Press enter or click to view image in full size
SGLang originated from the paper “SGLang: Efficient Execution of Structured Language Model Programs” and opened new ground with a highly optimized backend runtime centered on RadixAttention and an efficient CPU scheduling design. Rather than discarding PagedAttention, RadixAttention extends it: it preserves as much prompt and generation KV cache as possible and attempts to reuse KV cache across requests; when prefixes match, it slashes prefill computation, improving performance — its paper shows significant gains over inference engines without RadixAttention. Beyond RadixAttention, SGLang’s fundamentals are strong; even with RadixAttention disabled in benchmarks, its performance remains excellent.
Community-wise, SGLang is a fast-rising newcomer with a leaner footprint — its total contributor count is less than half of vLLM’s, and while it has over 2,000 users and participants, that’s still under one-fifth of vLLM’s scale. Rapid iteration and an enthusiastic user base have stretched maintainers: both projects have sizable backlogs of open issues/PRs. Over the past three months, most issues in vLLM receive responses within 12 hours to 3 days, whereas in SGLang it typically takes 3 to 5 days.
Press enter or click to view image in full size
Press enter or click to view image in full size
Origins: a continuous current of innovation
As a leading U.S. public research university, UC Berkeley has produced a remarkable roster of open-source projects. From earlier eras: Postgres in databases, RISC-V in hardware, Spark in big-data processing, and Ray in machine learning. In today’s LLM wave, that innovative DNA endures — Berkeley is again behind a top-tier open-source inference engine in vLLM. SGLang wasn’t created solely at Berkeley, but its origins are closely tied to the university.
vLLM led the way with an open-source release in June 2023; SGLang debuted roughly six months later. Early core initiators of the two projects — Woosuk Kwon (vLLM) and Lianmin Zheng (SGLang) — both hail from Berkeley and studied under Ion Stoica, the luminary who led students to create the flagship open-source projects Spark and Ray.
In 2023, Lianmin, Stanford’s Ying Sheng, and several scholars from other universities founded the open research group LMSYS.org, soon launching popular projects such as FastChat, Chatbot Arena, and Vicuna. The widely used LLM evaluation platform Chatbot Arena had already adopted vLLM and FastChat as backend engines in April — months before vLLM’s official open-source release. You can still spot traces of that history in the repositories’ early commit logs:
Press enter or click to view image in full size
FastChat once aimed to cover the model’s full lifecycle — training, inference, and evaluation — but has since largely fallen out of active maintenance. The later surge of SGLang (whose core idea originated at Stanford with Ying Sheng) and Chatbot Arena (now renamed LMArena) likely built on FastChat’s early practices, branching into robust inference and evaluation ecosystems.
Today, core initiators Woosuk and Lianmin remain actively involved in maintenance and iteration. After a year or two of growth, the core developer cohorts of both projects have shifted to some extent. Recent six-month contributor data show that early-career academic researchers remain a major force — unsurprising given both projects’ deep academic roots. Beyond academia, vLLM’s contribution backbone includes Red Hat, while SGLang’s core contributors come from xAI, Skywork, Oracle, and LinkedIn.
Press enter or click to view image in full size
As many as 194 developers have contributed code to both vLLM and SGLang — about 30% of SGLang’s total code contributors to date.
Among them are several notable cross-contributors. Their contribution patterns offer a glimpse into how open-source contributors move between projects — and even invite some informed conjectures:
- comaniac: an engineer at OpenAI. In SGLang’s early days last year, he submitted 17 PRs. He’s also a major vLLMcontributor with 77 PRs overall. His activity has tapered off since March this year. Given that Zhuohan, an early vLLM author, largely stopped contributing after joining OpenAI, one can’t help but wonder: is OpenAI building its own internal inference engine?
- ShangmingCai: a researcher at Alibaba Cloud Feitian Lab. From last June to this April he submitted 18 PRs to vLLM; starting in April his focus shifted to SGLang, where he has filed 52 PRs and become a key contributor.
- CatherineSue: an engineer at Oracle. From July to October last year she submitted four bug-fix PRs to vLLM; from last July to now she has filed 76 PRs in SGLang and is a core contributor there.
Development, refactors, and fierce competition
By version and community-momentum timelines, vLLM surged after launch, then slowed noticeably from September to December last year; with V1, momentum returned and growth resumed. By contrast, SGLang has climbed steadily since v0.2. In the first half of this year, possibly buoyed by DeepSeek V3/R1, both entered another phase of rapid growth.
Press enter or click to view image in full size
Key milestones from an OpenRank perspective:
- June 2023: vLLM officially launches, introduces PagedAttention, and grows quickly on the strength of leading performance.
- January 2024: As vLLM races ahead, SGLang ships its first release and gains industry attention thanks to RadixAttention.
- July 2024: SGLang releases v0.2, entering its first acceleration phase.
- September 2024: vLLM ships v0.6.0, cutting latency ~5× and improving performance ~2.7× via CPU-scheduling and other optimizations; the day before, SGLang released v0.3. Thereafter, SGLang maintained steady growth while vLLM’s pace moderated.
- December 2024–January 2025: After months of preparation, vLLM unveils the V1 refactor. With DeepSeek V3/R1 bursting onto the scene, both vLLM and SGLang begin a second wave of explosive growth.
In 2024, as features, model coverage, and hardware support expanded rapidly, vLLM inevitably hit classic software-engineering headwinds: growing code and architectural complexity began to slow performance gains. A third-party performance study published in September showed that in some scenarios vLLM’s CPU scheduling overhead could exceed half of total inference time, leaving GPU compute underutilized due to heavy CPU overheads. The official blog likewise acknowledged that rapid evolution created horizontal-scaling challenges and made it hard to merge independently developed features, prompting a rethink and foundational refactor: V1 arrived in early 2025, after which growth re-accelerated. By contrast, SGLang in the same period appeared to support fewer features, models, and hardware targets, yet turned in equally strong results thanks to a more extensible architecture, excellent CPU scheduling, and later a “zero-overhead” scheduling design.
Press enter or click to view image in full size
Press enter or click to view image in full size
In 2025, the performance race among inference engines heated up: rapid integration of cutting-edge features, day-one support for mainstream open-source models, and constant expansion across hardware platforms had everyone sprinting. With nearly every release, both sides published benchmark results asserting leadership, fueling recurring social-media debates. Recognizing the limits of “number wars,” both gradually deemphasized same-day shootouts and shifted to reproducible methods, end-to-end metrics under real workloads, and encouragement of independent third-party evaluations to help users decide more rationally.
Press enter or click to view image in full size
A recent third-party comparison came from Alibaba Cloud in June: benchmarking vLLM vs. SGLang on the Qwen family, the overall single-GPU/dual-GPU results favored SGLang. That said, benchmark outcomes can vary widely across hardware/models/configurations, and performance isn’t the only decision factor — so results should be interpreted case by case.
Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
Trend-wise, model architectures are showing signs of convergence, and mainstream inference engines are increasingly similar in features, algorithms, and operator stacks. Leaders vLLM and SGLang now both support features like Continuous Batching, PagedAttention, RadixAttention, Chunked Prefill, Speculative Decoding, Disaggregated Serving, and CUDA Graphs; operator libraries such as FlashInfer, FlashAttention, and DeepGEMM; plus key capabilities like parallelism and quantization. These advances often deliver step-change speedups, leaving slower movers behind — for example, Hugging Face’s TGI has been falling further behind vLLM, SGLang, and TensorRT-LLM on performance. Meanwhile, good ideas propagate quickly in open source, with new optimizations often adopted across projects in short order. It’s likely that first-tier engines will converge further on raw performance, shifting competition toward factors beyond speed alone.
In one line each, other inference engines to watch:
- TensorRT-LLM: launched by NVIDIA in late 2023 and deeply tuned for its own hardware; historically tightly controlled by NVIDIA, making deep community participation harder.
- OpenVINO: developed by Intel, focused on efficient deployment and optimization across Intel CPUs/GPUs; important for both edge and clustered inference.
- Llama.cpp: written in C++ by community developer Georgi Gerganov in 2023, targets low-barrier edge inference — running LLMs on ordinary PCs and even phones — and is widely adopted by individual developers and small companies.
- LMDeploy: co-developed by the MMDeploy and MMRazor teams (Shanghai AI Lab), with dual backends — TurboMind for high performance and PyTorch for broad hardware coverage. Official data show clear throughput advantages, plus strong quantization support, making it competitive with vLLM/SGLang.
Moving Forward in the Ecosystem
During their rapid-growth phase, both vLLM and SGLang drew attention from investors and open-source foundations:
- In August 2023, the ever-opportunistic a16z launched the Open Source AI Grant to fund AI-related open-source projects. Among the first recipients were vLLM core developers Woosuk Kwon and Zhuohan Li. In the third cohort announced in June this year, SGLang core developers Ying Sheng and Lianmin Zheng were also funded.
- In July 2024, ZhenFund announced a donation to vLLM. At the same time, the Linux Foundation’s LF AI & Datasub-foundation announced vLLM’s entry into its incubation and donation process; this year vLLM was moved under another LF umbrella — the PyTorch Foundation — with plans for deep collaboration across multiple areas.
- Two months before vLLM formally joined the PyTorch Foundation — in March 2025 — PyTorch published a blog post welcoming SGLang to “the PyTorch ecosystem” (to be clear, this did not mean it was donated to the Foundation). Together, these moves rounded out the PyTorch landscape.
Both projects have become go-to inference solutions for tech companies in Silicon Valley and China. Their repos show active participation from engineers at Google, Meta, Microsoft, ByteDance, Alibaba, Tencent, and other top companies.
Beyond that, companies using vLLM include Red Hat, IBM, and Anyscale; adopters of SGLang include xAI, AMD, NVIDIA, Intel, LinkedIn, and Oracle.
Press enter or click to view image in full size
Press enter or click to view image in full size
Today, both projects have sizable Chinese developer communities. Roughly 33% of vLLM’s contributors are based in China, and the share is about 52% for SGLang.
From the outset, the vLLM community has shown strong convening power, hosting in-person meetups with users and developers roughly every month or two. This year, multiple offline meetups were held in Beijing, Shanghai, and Shenzhen. At the recently concluded GOSIM conference by scenic West Lake, SGLang hosted its first in-person workshop dedicated to Chinese developers.