Three steps. Five minutes.
Upload traces, describe what to check, and get results with charts and reasoning.
Upload a CSV
session_id, conversation
1, "User: Plan a trip to Porto..."
2, "User: I need a Morocco itinerary..."
3, "User: Anniversary trip to Japan..."
Each row is one conversation. That's it.
Describe what to check
"Did the assistant address the user's budget constraints?"
BooleanScoreCategoryComment
Pick a type, write a prompt in plain English. That's your eval.
Get results
True60%
"User set a $2K budget and assistant stayed within it."
Charts, per-trace reasoning, and LLM explanations. In minutes.
See it in action
Real results from evaluating 10 trip-planning assistant conversations
Budget responsiveness
Did the assistant address the user's budget?
True: 6False: 4
Ready to try it?
Sign up in seconds. Your first eval is five minutes away.
Get started in minutes
Free during beta. No credit card. No SDK. Just a CSV.
Such a clean interface and exactly the kind of quick n dirty evals I want when I don't want to touch a shit load of infra. Miles better than Langsmith tbh.
Sashank Pisupati, PhD
MTS @ Reflection | post-training, alignment, RL