EvalGen: Helping Developers Create LLM Evals Aligned to Their Preferences

Eval maxim: To grade outputs, people need to externalize and define their evaluation criteria; however, the process of grading outputs helps them to define that very criteria.

Press enter or click to view image in full size

A GPT-4.5 prettified version of a comic I drew, after editing in Krita.

It’s almost a year since Shreya Shankar, J.D. Zamfirescu-Pereira and I dropped our paper “Who Validates the Validators?” about generating LLM evals aligned to human preferences. In that time, it’s garnered a lot of attention online. It’s directly influenced new features to LLMOps platforms, including LangSmith, Autoblocks.ai, and Chroma’s new generative benchmarking feature, as well as Eugene Yan’s more home-grown solution. As developers continue to wrestle with LLM evals, people continue to find our insights important, especially the notion of “criteria drift.”

Yet, our EvalGen prototype wasn’t released. (OK, it was technically but you had to be hardcore enough to pull ChainForge’s repo and run a branch in dev mode... good luck with that!)

Today, I’m proud to say the wait is over—EvalGen is finally released to ChainForge’s main branch.

Press enter or click to view image in full size

The public EvalGen workflow, in the “per-criteria grading” stage. At any moment you can edit the criteria list directly. You can also go back to previous steps, revise your feedback or criteria, and EvalGen will re-query LLMs for candidate implementations. Just keep iterating until you’re satisfied.

This version of EvalGen in ChainForge is better than in our paper—and arguably, better than any other interface out there for this task. It improves in response to several points of feedback from our user study:

Developers wanted to grade on a per-criteria basis (more than just overall thumbs up/down),
Developers wanted to be able to go back and forth between different screens, changing criteria definitions at any point, improvisationally,
Developers benefited from being forced to give feedback on a few responses first—as a cognitive forcing function to engage with the data—before proceeding to defining their evaluative criteria,
Developers wanted to see exactly how many calls were being sent off

EvalGen makes it easy to get started writing evals. Instead of wracking my brain or fiddling in code, EvalGen gives me a set of aligned evals in minutes, as well as a structured step-by-step process to think through the eval generation process:

Press enter or click to view image in full size

Providing general feedback on a few responses can help people define their eval criteria and externalize what matters to them.

After you use EvalGen to get a set of starter evaluators, you run them and see results in a table view:

Press enter or click to view image in full size

Or plot results per metric:

Press enter or click to view image in full size

EvalGen also knows about ChainForge-specific features, like the format of ResponseObjects (to access metadata), and the {#template} variables that allow one to insert upstream metadata in the eval prompt:

Press enter or click to view image in full size

EvalGen can generate LLM eval prompts that include template variables pulling in upstream data, when necessary. Here, the eval prompt includes a {game} parameter that was inserted upstream.

EvalGen is finally available, but it’s still in beta

I wanted to finally get EvalGen out there, so people can try it. But that means this release still needs more polish to be perfect. Here are some limitations:

EvalGen only generates assertions (true/false), not numeric or categorical scoring functions. (If you’d like to help with this, the code is open-source… take a stab at it.)
EvalGen is powerful, but isn’t the cheapest. The longer you continue and the more LLM outputs in your example set, the more LLM calls it sends. (Luckily, you can directly see this.)
EvalGen doesn’t use reasoning for LLM evaluators (although this is a relatively minor change and would affect latency and cost).
The particular prompts that EvalGen uses could be further optimized (although we tested and iterated them within ChainForge, it’s unclear if they break down in particular circumstances).
EvalGen takes a random sample of max 16 responses from your inputs, which limits the number of calls it needs to make per cycle. In the future, we’ll further optimize the way EvalGen sends calls and give developers control over this max number.
EvalGen currently is locked to using OpenAI’s GPT-4o series for its backend, which means you need an OpenAI key. If you’d like other provider support, consider making a PR.
The Report Card screen still needs more work. Currently, it uses F1 scores as the “alignment” metric that shows up, but it also computes other metrics internally, like Cohen’s kappa, that we could show.

If you derive value from EvalGen and you’re interested in helping with any of these features—or any new ones you cook up—raise an Issue or make a Pull Request on the ChainForge repo.

Conclusion

EvalGen, though a simple concept, ended up being tens of thousands of lines of code — far from a minor change. However, I hope that it continues to inspire the community towards better, more humane workflows for generating aligned, trusted LLM evals for custom tasks.

Try it out with pip install chainforge in a Python 3.10+ environment, and let us know what you think.

Best,

~Ian Arawjo

Press enter or click to view image in full size

The original sketch, in case you didn’t believe me. :)