Show HN: Beval – Simple evaluations for your AI product
beval.spaceI have been working on a web app called Beval - Simple evaluations for your AI product.
In my day to day as a Product Manager working in a team that ships AI products, I often found myself wanting to do 'quick and dirty' LLM-based evaluation on conversation transcripts and traces. I didn't need anything fancy, just 'did the agent answer the question', 'did the agent cover the 5 things it needed to' - that type of thing.
I found myself blocked by 'Gemini in Google Sheets', it was too slow and cumbersome, and it didn't handle eval changes well - particularly when trying to associate evals with ground truth. And because I was exploring or working on new and experimental features, it wasn't helpful to try and set up something more robust with the team.
To fix the problem I eventually learned to call the OpenAI API in Python, but I really felt that I wanted a 'product' to help me and potentially help others who need answers fast - outside of building infrastructure and pipelines.
So over the last few weeks I built: https://beval.space
It has: - LLM-as-judge evals: boolean checks (yes/no), scores (1-5), categories, and freeform comments - Reusable eval definitions you can run across different datasets - Ground truth labelling so you can compare eval versions against human judgments - Per-trace reasoning so you can see why the judge scored something the way it did - An example dataset so you can try it without having your own traces ready
One of our early users described it as 'quick n dirty evals when you don't want to touch a shit load of infra.' I'm trying to figure out if that's a common need or just a niche thing.
Free during beta. Would love HN's take — what's missing, and would you actually use something like this? LLM as judge drifts in weird ways if you don't have ground truth to calibrate against. Good that you've got that built in. Would love to see eval stability tracking over time though, same prompt different day sometimes gives different scores.