Ask HN: Devs using LLMs, how are you keeping costs low for LLM calls locally?
My project has a multi step LLM flow using gpt-4o.
While developing new features/testing locally, the LLM flow frequently runs, and I use a bunch of tokens. My openAI bill spikes.
I've made some efforts to stub LLM responses but it adds a decent bit of complexity and work. I don't want to run a model locally with ollama because I need to output to be high quality and fast.
Curious how others are handling similar situations. Use The Cache Luke... Langchaing exemples: [1] Caching https://python.langchain.com/v0.1/docs/modules/model_io/llms... [2] Fake LLM https://js.langchain.com/v0.1/docs/integrations/llms/fake/ options if im not using langchain? Cloudflare AI gateway [0] [0] https://developers.cloudflare.com/ai-gateway/configuration/c... options if im not using cloudflare? there's an open source ai gateway - https://github.com/Portkey-AI/gateway You can do it easily with object caching / function memorization patterns in any modern language which should fit your desired solution. Best of luck! Here's a mega guide on keeping costs low with LLMs - https://portkey.ai/blog/implementing-frugalgpt-smarter-llm-u... tl;dr:
- Keep prompts short, combine prompts or make more detailed prompts but go to a smaller model
- Simple and semantic cache lookups
- Classify tasks and route to the best LLM using an AI gateway Portkey.ai could help with a lot of this came across this guide earlier - valuable insights. thanks for sharing!