[RFC]: [P/D] Prefill compute optimizations with bi-directional KV cache transfers between P and D nodes

5 min read Original article ↗

Motivation.

Today, vLLM P-D disaggregation architecture utilizing Nixl KV connector has a distinct role for prefill (P) and decode (D) nodes for KV cache handling; the KV transfers are uni-directional with P node as a producer and D node as a consumer. This constraint introduces inefficiencies in two critical scenarios: multi-turn conversation and cache-evicted prefill nodes, requiring redundant recomputation of previously generated KV blocks during incremental prefill requests on P nodes.

1. Multi-turn conversations

Multi-turn conversational inference refers to sequential dialogue exchanges between end-users and LLM systems that maintain contextual coherence across interaction cycles. Implementation requires persistent conversation history management, wherein each subsequent user query is concatenated with relevant historical context: system prompts, prior user queries, and model-generated responses.

Optimal performance in non-disaggregated deployments is achieved through session persistence on LLM inference servers, enabling KV cache reuse across turns. However, P-D disaggregated architectures present a fundamental challenge: the KV cache corresponding to model-generated responses from previous turns resides exclusively on decode nodes, remaining inaccessible to prefill nodes (assuming absence of external shared KV storage infrastructure). The current architecture compels prefill nodes to recompute KV projections for all response tokens on each conversational turn, resulting in substantial computational waste on AI accelerator hardware. This inefficiency is particularly pronounced for reasoning-intensive models that generate extended response sequences.
Note: The reason for avoiding incremental prefill execution on decode node is covered in the alternate solutions section below.

2. Cache evicted Prefill node

In a P-D disaggregated multi-turn conversational systems, the decode node primarily maintains the session state while prefills nodes do not, creating elevated probability of cache eviction for active user sessions on prefill nodes. Consequently, prefill nodes are forced to recompute KV projections for entire conversational contexts (encompassing system prompts, historical user queries, and prior model responses) upon cache misses.

Both computational redundancy scenarios can be mitigated through implementation of bidirectional KV cache transfer capabilities between prefill and decode nodes, enabling mutual KV cache loading and eliminating unnecessary re-computation overhead.

Proposed Change.

This proposal introduces bidirectional KV cache transfer capabilities between prefill and decode nodes to optimize KV cache utilization in P-D disaggregated deployments and eliminate redundant prefill computations.

The implementation involves beyond simple KV connector state replication due to fundamental architectural differences between node types: prefill nodes compute KV representations for new prompt tokens while reusing cached entries and providing complete prompt KV blocks, whereas decode nodes lack the capability (or efficiency) to generate complete KV representations and can only serve previously cached KV blocks. Following are this high level implementation details:

1. New API for Cache Query
Implement a lightweight vLLM frontend API for KV cache metadata retrieval; may be add a new request type as cache_query_request and reuse the completions endpoint from the current architecture. This special request bypasses prefill and decode execution paths, returning exclusively KV block identifiers (cached blocks at block-size granularity) without computational overhead.

2. Scheduler Integration for Cache Query Requests
Extend the vLLM scheduler to process cache query requests through the standard request pipeline while completing execution in a single pass immediately following KV cache metadata retrieval from KV cache manager. The model runner bypasses these requests due to zero scheduled tokens (num_scheduled_tokens=0), eliminating unnecessary computational graph execution.

3. Nixl KV Connector Enhancement for Remote Block Handling
Extend Nixl KV connector to handle remote kv blocks on both prefill and decode nodes and update the external and new token blocks accordingly. After the number of local and remote blocks are updated, the rest of the KV load sequence follows the same path on both prefill and decode nodes and reuses the existing KV load transfer function.

4. KV Cache Manager update
Update KV cache manager to differentiate node specific KV cache (local + dedicated external storage) vs shared KV cache (external storage shared across distributed nodes) to avoid two-hop transfers among prefill node, decode node, and the shared external storage node. The optimization helps avoiding two-hop in the current uni-directional KV transfer architecture as well. This optimization may be deferred to subsequent implementation phases.

The following diagram highlights (in RED) the changes required for scheduler and KV connectors components and the call sequence for the optimized flow.

Image

Implementation Details

The cache query request and bi-directional KV transfer features have been implemented in the following PR by overloading the current KV connector states. Next step is to implement them with new states as an explicit feature.
PR: #32553

Alternate solutions considered

Handle incremental prefill requests on decode node

While this approach appears to offer a straightforward mechanism for reusing existing KV cache from decode nodes and eliminating redundant computations, it fundamentally undermines the architectural rationale for prefill-decode disaggregation. This design choice precludes the system from utilizing specialized prefill node computational capabilities for subsequent prefill operations and introduces significant architectural complexity on decode nodes, which must then accommodate both prefill and decode execution paths concurrently.

Feedback Period.

No response

CC List.

@robertgshaw2-redhat , @markmc

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.