I have been skeptical about the capabilities of AI. I feel that I have always been more willing to jump on the side of pointing out the limitations and insufficiencies of large language models, rather than to celebrate and praise all of its impressive features. While I would like to continue looking at all and any tools with a sober look, weighing pros against cons, and argue once I have a complete picture of what it is that I am arguing about, I am writing this post to recognize the good that I have observed about the AI tools recently. And that good is the use of external tools by the AI.
AI vs. LLM
I would like to point out that I am using the word “AI” and not “LLM” in this context because AI is a program or even an entire platform which uses LLM as its core feature. There are, however, many more features that are embedded into AI solutions:
- Agents and Subagents - not an LLM feature, but rather a “multiplexing” of LLMs, which is orchestrated by a process, which passes the outputs from LLM subprocesses into each other.
- Optical Character Recognition (OCR) - processing of the visual input by a program which then in turn passes the output to the LLM.
- CLI Tool Usage - while LLM creates commands to run, LLM itself doesn’t run the commands.
These are just a couple of features that come to mind that are not part of predictive model behavior. Modern AI solutions are jam-packed with entire suites of tools and programs, and processes that enhance the function of the AI in some way. External tool usage is one of these features that add more value to AI workflow.
Non-determinism of LLMs
LLMs by nature are non-deterministic prediction models that take a look at everything that they have in their context and everything that they have been trained on, and output the most likely next token. And then the cycle repeats again, until it predicts the end of the output. If you have used an AI tool, you have probably observed that asking it the same question twice will create two different responses, while being similar in meaning.
In some contexts, non-deterministic output is a feature and not a bug. That is a key way in which AI responses often feel “human-like”, because humans do not respond with pure function outputs - the same inquiry does not produce identical output to that inquiry every time it is called.
There are contexts, in which more determinism is necessary though. Software engineering is one of those domains. Non-deterministic output in software code is often the line between a functioning and non-functioning code. We want to make sure that the code compiles and works as expected. That is where deterministic output is desirable.
I wrote a post recently about Anthropic suggesting developers to use Claude as a linter. That is one of the examples where I do not want to use an LLM, which will not only use more resources than a linter would, but it will also have a chance of producing different output from what I am expecting of it. I want the output to always be predictable based on the behavior of the program and provided configuration.
In the event when we need to have predictable output, we should use the tools that afford us such clarity in execution.
First of all, we can use CLI tools or API calls that behave like a traditional program would - deterministically. grep will only ever match provided patterns in the processed text. find will only locate files that exist and match the provided criteria. cat will always produce the contents of the provided file. None of these tools will ever produce a different output for the same set of inputs. If I were to train LLM to do the job of grep or cat, it would hallucinate most of the time trying to predice the contents of the file. With tools like grep and cat what I’m looking for is a proper true result, not a prediction of what is most likely to be the true output.
A second approach would be to have LLM write a script, which it then executes to get an exact answer. While working on several projects with Claude Code, I noticed that from time to time, it would generate a quick script to validate something, and then run it. One of those tasks was to validate JSON syntax. Whenever it needed to ensure that the files that it had modified were syntactically correct, it would generate a script either in Python or Node, and then execute it. The sample Python script looked like this:
⏺ Bash(python3 -c "import json, sys; json.load(open('./.prds/prd-phase-1.json')); print('JSON is
valid')" 2>&1)
⎿ JSON is valid
At times when Claude felt like using Node instead of Python for validation, it would produce this code:
⏺ Bash(node -e "JSON.parse(require('fs').readFileSync('./.prds/prd-phase-1.json', 'utf8'));
console.log('JSON is valid')" 2>&1)
⎿ JSON is valid
In both of the instances, after Claude would update the PRD file for the feature, it ensured that the syntax is still correct. I honestly have no idea what determined the choice of the language. It felt that it would choose Python at one point, and in the same session use Node to do the same job.
And while even this example shows how differently an LLM responds to the same query, it ends up producing a more reliable output. Generating a script to validate JSON is not a difficult task. I imagine that LLMs have been trained on thousands of examples of this exact functionality. What is more important here is the fact that the generated code has to run and produce an output, which then is fed back to the LLM. This approach is a lot more reliable than taking that JSON file and making LLM predict the syntactically correct output. It will not only have a chance of hallucinating, but will also consume more tokens in the process.
A good direction for AI
I believe that this is a good direction that the AI solutions are taking. Introducing more deterministic tools can create a more predictable output which could be critical in contexts where precise inputs or outputs are necessary. Tools and scripts gretly enhance the flow of AI by providing more reliable results.
This trend could further make AI outputs more predictable by making AI call to public APIs to get the latest relevant data, making the LLM be more intentional about asking the user if there are any specific tools that it could use in its workflows, and instead of generating the script code, have access to a verified dataset of commonly used scripts in development, which the AI could refer to, fetch, and execute on demand.
There are still many more use cases where a non-deterministic output could be a great thing. For example, I don’t want all of my web or mobile prototypes to look the same. I want to have variety in design and UX. So that is something I don’t mind for AI to improvise upon.
I like that I started seeing more of this behavior in Claude recently, and I hope that it would continue down that road - predicting the output where a more “creative” approach is desirable, and relying on tried and tested tools and scripts to do the tasks that require a precise true or false output.