Motivation.
Today, vLLM P-D disaggregation architecture utilizing Nixl KV connector has a distinct role for prefill (P) and decode (D) nodes for KV cache handling; the KV transfers are uni-directional with P node as a producer and D node as a consumer. This constraint introduces inefficiencies in two critical scenarios: multi-turn conversation and cache-evicted prefill nodes, requiring redundant recomputation of previously generated KV blocks during incremental prefill requests on P nodes.
1. Multi-turn conversations
Multi-turn conversational inference refers to sequential dialogue exchanges between end-users and LLM systems that maintain contextual coherence across interaction cycles. Implementation requires persistent conversation history management, wherein each subsequent user query is concatenated with relevant historical context: system prompts, prior user queries, and model-generated responses.
Optimal performance in non-disaggregated deployments is achieved through session persistence on LLM inference servers, enabling KV cache reuse across turns. However, P-D disaggregated architectures present a fundamental challenge: the KV cache corresponding to model-generated responses from previous turns resides exclusively on decode nodes, remaining inaccessible to prefill nodes (assuming absence of external shared KV storage infrastructure). The current architecture compels prefill nodes to recompute KV projections for all response tokens on each conversational turn, resulting in substantial computational waste on AI accelerator hardware. This inefficiency is particularly pronounced for reasoning-intensive models that generate extended response sequences.
Note: The reason for avoiding incremental prefill execution on decode node is covered in the alternate solutions section below.
2. Cache evicted Prefill node
In a P-D disaggregated multi-turn conversational systems, the decode node primarily maintains the session state while prefills nodes do not, creating elevated probability of cache eviction for active user sessions on prefill nodes. Consequently, prefill nodes are forced to recompute KV projections for entire conversational contexts (encompassing system prompts, historical user queries, and prior model responses) upon cache misses.
Both computational redundancy scenarios can be mitigated through implementation of bidirectional KV cache transfer capabilities between prefill and decode nodes, enabling mutual KV cache loading and eliminating unnecessary re-computation overhead.
Proposed Change.
This proposal introduces bidirectional KV cache transfer capabilities between prefill and decode nodes to optimize KV cache utilization in P-D disaggregated deployments and eliminate redundant prefill computations.
The implementation involves beyond simple KV connector state replication due to fundamental architectural differences between node types: prefill nodes compute KV representations for new prompt tokens while reusing cached entries and providing complete prompt KV blocks, whereas decode nodes lack the capability (or efficiency) to generate complete KV representations and can only serve previously cached KV blocks. Following are this high level implementation details:
Initially I had proposed a new API to query D node for its KV params that can be passed to P node but buring benchmarking I identified that the additional round-trip to the Decode node to query KV block metadata introduces latency overhead. To address this, I added support for returning kv_transfer_params as part of the streaming response from the Decode node, eliminating the need for a separate query.
1. Nixl KV Connector Enhancement for Remote Block Handling
Extend Nixl KV connector to handle remote kv blocks on both prefill and decode nodes and update the external and new token blocks accordingly. After the number of local and remote blocks are updated, the rest of the KV load sequence follows the same path on both prefill and decode nodes and reuses the existing KV load transfer function.
2. KV Cache Manager update
Update KV cache manager to differentiate node specific KV cache (local + dedicated external storage) vs shared KV cache (external storage shared across distributed nodes) to avoid two-hop transfers among prefill node, decode node, and the shared external storage node. The optimization helps avoiding two-hop in the current uni-directional KV transfer architecture as well. This optimization may be deferred to subsequent implementation phases.
In this new design, the proxy serves as a lightweight, stateful router between the client, prefill (P), and decode (D) nodes. It maintains an in-memory cache of kv_transfer_params keyed by conversation history, which it populates by parsing the D node's SSE streaming response inline -- no separate cache query API is needed. On the first turn of a conversation (cache miss), the proxy sends the request to P with empty remote_block_ids, causing P to compute the full KV cache from scratch. On subsequent turns (cache hit), the proxy looks up the cached block IDs from the previous turn and passes them to P, enabling P to pull the existing KV cache directly from D via RDMA and compute only the new tokens. This design keeps the proxy's role minimal — it never queries D for cache state, never holds KV data itself, and adds no extra hops to the critical path. The only state it tracks is a mapping from conversation history hashes to D's block metadata, which it captures for free from the token stream that's already flowing through it.
The following diagrams highlight the changes required for scheduler and KV connectors components and the call sequence for the optimized flow and the Proxy caching flow.
Implementation Details
PR: #32553
Alternate solutions considered
Handle incremental prefill requests on decode node
While this approach appears to offer a straightforward mechanism for reusing existing KV cache from decode nodes and eliminating redundant computations, it fundamentally undermines the architectural rationale for prefill-decode disaggregation. This design choice precludes the system from utilizing specialized prefill node computational capabilities for subsequent prefill operations and introduces significant architectural complexity on decode nodes, which must then accommodate both prefill and decode execution paths concurrently.
Feedback Period.
No response
CC List.
@robertgshaw2-redhat , @markmc
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

