Obscenely Long Rate Limits (Over 24 Hours) and Possible Distillation False Positive · community · Discussion #192435

I don't know much so im learning but tool calling is only done server side?
I also see that I truly don't understand the cache topic.

Here is an explanatory video about LLM harneses (or "Copilot"), which I think may be beneficial to you, as it seems that you are creating your own version of Copilot.

How does Claude Code actually work?

In summary, a tool call is a text-based agreement established between you and the LLM through a system prompt. When a mode "wants" to make a tool call, it generates that specific output pattern you defined, then generates end_of_conversation. Your application will parse the LLM's output by identifying that specific pattern, extracting the tool it intends to call (e.g., via Regex, and then switching the extracted string), invoking that tool on the user's computer, and appending the result to the LLM's original response.

The entire context, which includes all history since the conversation began—such as the system prompt and all previous exchanges—remains intact, plus the tool call results you appended. This updated context is then sent back to the LLM's API endpoint, the part that's intact, triggering caching (billed as cached inputs), with the newly appended results of the tool call (billed as input tokens). The LLM continues generating the next segment of text (billed as output tokens), and the process repeats until the LLM decides to generate an end_of_conversation signal without invoking the specified tool call pattern.