Effective LLM prompt caching

2 min read Original article ↗

Sharvanath Pathak

Most LLM apis have some form of prompt caching. Prompt caching helps reduce cost and latency for inference quite significantly. However, many users are not fully utilizing this feature. In this post, we will talk about some very standard ideas that help with your prompt hit rate, and as a result, improve your overall efficiency. For a quick background refer to the OpenAI prompt caching guide https://openai.com/index/api-prompt-caching/.

Put the static prompt in the beginning

Since the hits are based on prefix matches, it’s important to put the static part of the prompt in the beginning as much as possible.

Sort the dynamic prompts by LRU

It’s common to have few shot examples into the prompt which are stored or added in the system. In order to best use these examples, it’s good to sort them in the order of least usage:

System: "You are a knowledge assistant..."
User: "Describe details on example1..."
Assistant: "Response for example1..."
User: "Describe details on example2..."
Assistant: "Response for example2..."

In case the examples are being updated, it’s best to keep some kind of recency counter. The prompts can be sorted by the least recently used first so that the frequently changed examples are towards the end. For instance, if example1 is changing quite often here, all the other slowly changing examples are still getting cache misses.

Use the consistent LLM shard

This is to ensure the calls go to the same shard and get a high cache hit rate. In case of openai using the key prompt_cache_key will ensure the prompts are going to the consistent shard with a high cache rate. Make sure to use a different prefix for each specific caller, otherwise you risk overflowing the cache.

Measure the cache hit rate

For instance, in case of openai all requests will display a cached_tokens field of the usage.prompt_tokens_details. It’s best to add these numbers up and track a hit rate metric for each specific use case/caller.