Settings

Theme

Show HN: Litmus – Specification testing for structured LLM outputs

github.com

1 points by lukecarr 18 days ago · 0 comments · 2 min read

Reader

Over the holidays, I've been working on a small side-project that includes some LLM prompting from the end user. Admittedly, I struggle to keep track of the latest and greatest models, and I've also never bothered to read up on "prompt engineering," so I built a little testing utility to solve both of these problems at the same time.

Enter Litmus. I'm pitching it as "specification testing" for LLMs. You define test cases (input prompt -> output JSON), as well as your system prompt and structured output (JSON Schema). All of this gets chucked at OpenRouter, and you get some nice terminal output summarising the test results (with a breakdown per-field for any failing cases) to see how well the model performed.

Although it's framed as an LLM testing tool, it also serves as a model comparator. You can pass the `--model` CLI argument multiple times, and this will let you run the test cases against multiple models, with a comparison table generated in the output at the end for evaluating latency, throughput, tokens, and accuracy (tests passing vs. failing).

The GitHub README contains a full example output of what a test report from Litmus looks like.

With this, I've managed to get my system prompt for my side-project whittled down to the point where the accuracy is acceptable and it's not an exorbitant amount of tokens. I've also found out, through model comparison, that I didn't need anywhere near as large of a model as I had originally envisioned.

You can grab it on GitHub as a single-file, zero-dependency executable (written in Go). Admittedly, I've not tested the pre-built binaries that are created via GitHub Actions, but there's no reason why they shouldn't work.

No comments yet.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection