Show HN: Butter, a muscle memory cache for LLMs

docs.butter.dev

23 points by edunteman 5 months ago · 11 comments · 1 min read

Reader

Hi HN, Erik here. Today we launch Butter, an OpenAI-compatible API proxy that caches LLM generations and serves them deterministically on revisit.

Since April, we’ve been working on this concept of “muscle memory,” or deterministic replay, for agent systems performing automations. You may recall our first post in May, launching a python package called Muscle Mem: https://news.ycombinator.com/item?id=43988381

Since then, the product has evolved entirely, now taking the form of an LLM Proxy. For a deep dive into this process, check out: https://blog.butter.dev/muscle-mem-as-a-proxy

The proxy’s killer feature is being template-aware, meaning it can reuse cache entries across structurally similar requests. Inducing variable structure from context windows is no easy task, which we cover in a technical writeup here: https://blog.butter.dev/template-aware-caching

The proxy is currently open-access and free to use so we can quickly discover and work through a slew of edge cases and template-induction errors. There’s much work to be done before it’s technically sound, but we’d love to see you take Butter for a spin and share how it went, where it breaks, if it’s helpful, if we're going down a dead end, etc.

Cheers!

MorganGallant 5 months ago

I've known Erik for a while now — simply incredible founder. Doing this as a simple API proxy makes this practically effortless to integrate into existing systems, just a simple URL swap and you're good to go. Then, it's just a matter of watching the cache hit rate go up!

ketan_around 5 months ago

Exciting to see a product like this launch! There are obviously a host of ‘memory’ solutions out there that try to integrate in fancy ways to cache knowledge / save tokens, but I think there’s a beauty in simplicity to just having a proxy over the OpenAI endpoint.

Interested to see where this goes!

eduntemanOP 5 months ago

An interesting alternative product to offer is injecting prompt cache tokens into requests where they could be helpful; not bypassing generations but at least low hanging fruit for cost savings

bigwheels 5 months ago

Are you able to walk through a specific use case or example case in detail? I'm not yet totally grokking what Butter is going to do exactly.

eduntemanOP 5 months ago

I've got a blog on this from the launch of Muscle Mem, which should paint a better picture https://erikdunteman.com/blog/muscle-mem
Computer use agents (as an RPA alternative) is the easiest example to reach to: UIs change but not often, so the "trajectory" of click and key entry tool calls is mostly fixed over time and worth feeding to the agent as a canned trajectory. I discuss the flaws of computer use and RPA in the blog above.
A counterexample is coding agents: it's a deeply user-interractive workflow reading from a codebase that's evolving. So the set of things the model is inferencing on is always different, and trajectories are never repeated.
Hope this helps
- bigwheels 5 months ago
  Still not clear - the tool calls come from the model, so what is being cached by Muscle Memory?
  Also:
  After my time building computer-use agents, I’m convinced that the hybrid approach of Muscle Memory is the only viable way to offer 100% coverage on an RPA workload.
  100% coverage of what?
  I guess it'd be great if you could clarify the value proposition, many folks will be even less patient than myself.
  Best of luck!

samraaj 5 months ago

logged back in to HN to comment on this. looks really sick - i've been saying for a while that a surprising amount of LLM inference really comes down to repetition down a known path.

it's good to see others have seen this problem and are working to make things more efficient. I'm excited to see where this goes.

zyadelgohary1 5 months ago

This is awesome, Erik! Excited to see this launch. Definitely fixes some issues we had while building pure CopyCat

tsvoboda 5 months ago

looks pretty cool! How would you integrate this into production agent stacks like langchain, autogpt, even closed loop robotics?

eduntemanOP 5 months ago

Thanks! For langchain you can repoint your base_url in the client. Autogpt I'm not as familiar with. Closed loop robotics using LLMs may be a stretch for now, especially since vision is a heavy component, but theoretically the patterns baked into small language models running on-device or hosted LLMs at higher level planning loops, could be emulated by a butter cache if observed in high enough volume.
raymondtana 5 months ago

For AutoGPT, there is the option to set a llamafile endpoint, which follows the Chat Completions API. So, theoretically, you should be able to use that to point to Butter's LLM proxy.

Settings

Show HN: Butter, a muscle memory cache for LLMs

Keyboard Shortcuts