rtb - NFHN Reader

Building software on top of an LLM is hard, but not that hard

Concrete advice for the proper care and feeding of your stochastic parrot.

There’s so much noisy discourse about AI chatbots, image generator slop, trillion-dollar server farms, and the impending dawn of superintelligence that I think it’s easy to forget how just how neat LLMs already are—at least, to me. We taught a long chain of matrix multiplications to talk like people! Not only can they hold a half-decent conversation, but we can actually use that simulated thought to solve certain classes of simple problems that were previously just “not something computers can do”! That’s completely nuts!

Because LLMs are so fascinating to me as a technology, and because they seem so naturally suited to the problem space of product copy (which is what I work on at Ditto), I had been itching for a reason to build something on top of a language model. Okay, yes, this is the classic software engineer trap of “if only I could find a problem for this solution,” but I’ll spoil the ending by saying this was a really good idea, and we built a pretty cool feature. However, we didn’t get to a feature without learning some meaningful lessons that weren’t totally intuitive to me up front; hopefully, I can save someone else some time.

You need to be doing more manual testing than you think

Actually, more than that. Nope, sorry—even more!

Working with truly non-deterministic outputs is extremely foreign to a traditional software development workflow. Sure, certain tech stacks can make you feel like you’re playing a cursed slot machine, but I promise you no race condition or CSS stacking issue compares to the latent possibilities of human language! Even if you think you know exactly how inconsistent and strange LLM output can be, you’re likely underestimating with respect to your specific problem space.

I went into this project knowing ahead of time that models be crazy, preparing to do a lot of evals; what I didn’t realize was how important it would be to develop an extremely-well honed sense of taste for your system’s outputs across across many, many possible inputs. Testing for early feasibility and getting a preliminary idea of good vs. bad was pretty quick, and gave me a false sense of confidence. We ended up needing to do a ton of iteration on our system prompt + default examples in the final week leading up to shipping our product because we’d overestimated based on our very limited set of early test inputs.

Building internal test tools pays huge dividends

One of the best decisions I made was to spin up a little web app very early on that we could use to test the feasibility of our idea. This let our designer, and anyone else who’s not always in the codebase, iterate on things like system prompts, possible inputs, and simply try the same thing over and over to see how much variation we were getting. Throughout the project, I added new options and tools to this app as we needed them, and it was critical particularly for building the sense of taste I describe above; generating a massive grid of 50 responses and quickly scanning them was dramatically quicker than testing from within the product.

Bonus note: you know who’s reeeeally good at building one-off / non-customer-facing software tools astonishingly quickly? I’ll give you one guess!

Early prototypes are even more important than usual

You could write this about any form of software development, I think, but it’s true ten times over with a product that has non-deterministic foundations. Test harnesses and manual evals can take you a long way, but there are myriad edge cases and weird hangups that you simply will not encounter until you’re actually using your text generation service in something that at least comes close to a real product.

LLM APIs are really, really cheap.

Two months ago, I would’ve thought this was an actively controversial take; it might still be, honestly. At some point, I’d gotten the idea in my head that running any kind of inference-based product was going to be really, really expensive. You hear about OpenAI melting GPUs and vibe-coders getting surprise $100k invoices, and if you’re me, you start to get a little nervous about the feasibility of your project. I am very unused to having to think about truly marginal costs in software, and the safe engineer tendency is to plan for the worst: what happens if we get 100x—or 10,000x—the usage we’re expecting?

We use Gemini 2.5 Flash Lite, which currently (8/16/25) costs $0.10/1M input tokens and $0.40/1M output tokens. In two months developing and testing our feature, plus a month of the feature being widely available, our GCP bill came to a grand total of $25.46.

I don’t wanna be totally careless with money, but let’s be honest: that is literally nothing relative to the costs of running even a small software company. Frankly it is probably too little spend; as I mentioned above, I wish we had been running way more, and larger, tests earlier in the development process.

Now, there are some caveats. Our use case means sending fairly small payloads—p95 less than 10,000 tokens, p50 under 1000. It’s also important to note that we’re a company with a relatively small userbase, compared to large B2C products, and there’s less of a possibility of a viral moment; if I were truly anticipating the possibility of 100x users tomorrow, I would probably have had to rethink the foundations here, or at least sit my CEO down for a quick chat about finances.

LLM-as-judge is a quick win in many cases

Influenced by both fears of exploding costs and worries about multiplying the non-deterministic nature of our feature, I initially dismissed the idea of submitting the LLM-generated output back to another LLM to assess it for quality.

This was dumb! As it turns out, if you have the prompt + contextual data required to get anything close to good output from your generations, you almost certainly have all the context you need to have an LLM constructively critique those generations.

The route I found most productive was prompting the judge LLM with a long and explicit prompt about how to judge properly, which included an explicit scoring guide plus instructions to return a “quality” score between 0 and 100.

rest of prompt
...
QUALITY SCORING GUIDANCE:
- 90-100: Essential correction that clearly improves the text
- 70-89: Helpful improvement that aligns well with rules
- 50-69: Minor improvement with questionable necessity
- 30-49: Unnecessary change that doesn't add value
- 0-29: Harmful change that actively worsens the text

Then, we just reject any suggestions with a quality below 50. This is not a perfect filter by any means; it probably removes some outputs we’d actually like to see. But for our particular product needs, it’s worse to return a suggestion that sucks than to return nothing at all.

Many people will tell you—or at least, various blog posts told me—not to do this, because “LLMs don’t know math” or something like that. For one, that’s not nearly as true these days as it once was; ask a frontier model what 22+53 is and you’ll get the right answer. For another, language models are great with associative relationships.

I didn’t give any explicit examples of judgement. I tried a few, but found that early results were too easily polluted by specific details from the examples I provided. That leads me to…

One of the earliest pieces of conventional LLM-prompting wisdom I remember seeing was “give examples.” This is, in fact, good advice, but it’s important to keep in mind the generality of both your examples and your overall prompt. LLMs will index heavily on any examples you provide them, and it’s easy to poison your results with a single bad example. In our case, we found that about half of our examples had typographic quotes (’) and half had standard quotes (’)—leading to constant suggestions flip-flopping back and forth between the two.

Epilogue

One of the infuriating things about building software is that by the end of a project, you wish you could do the whole thing over again with all the new stuff you learned. One of the amazing things about building software is that it’s malleable enough to actually do that.

We’ve already iterated on our system prompt and default context a number of times since launch, and we’re gathering tons of great data on usage which should let us refine the product as we go—and that’s before we get into anything like finetuning a model for our own usecases.

It’s a very exciting time to be building software that has anything to do with words—while LLMs are not a cure-all, they’re pretty damn neat. Get out there and play around!

P.S.: if you’re struggling to manage product copy at scale, I highly recommend you check out our product at Ditto!

P.P.S.: if the process of building this sounds interesting to you, you might be interested in working with me—we’re hiring)!