Ask HN: Who is honestly evaluating AI outputs and how?

2 points by toddmorey 3 months ago · 1 comment · 1 min read

Especially with multimodal AI conversations, evaluating and benchmarking these models is an increasingly complex topic, but a frustrating interaction with AI can really leave customers feeling sour about your whole product / service.

For an in-product AI assistant (with grounding, doc retrieval, and tool calling) I'm having a hard time wrapping my head around how to evaluate and monitor its success with customer interactions, prompt adherence, correctness and appropriateness, etc.

Any tips or resources that have been helpful to folks investing this challenge? Would love to learn. What does your stack / process look like?

helain 3 months ago

Before everything i want to tell you that i am working on a RAG project and you can check https://www.ailog.fr and our app https://app.ailog.fr/ . You can check it out if you want a production-ready RAG ( we have an API and we can scale to enterprise level if necessary ).

Next for the feedback part :

Evaluate LLM systems as three separate layers: model, retrieval or grounding, and tools. Measure each with automated tests plus continuous human sampling. A single accuracy metric hides user frustration. Instrument failures, not just averages.

Practical framework you can implement quickly:

Human in the loop: Review 1 to 5 percent of production sessions for correctness, safety, and helpfulness. Train a lightweight risk flagger.

Synthetic tests: 100 to 500 canned conversations covering happy paths, edge cases, adversarial prompts, and multimodal failures. Run on every change.

Retrieval and hallucinations: Track precision at k, MRR, and grounding coverage. Use entailment checks against retrieved documents.

Tools and integrations: Validate schemas, assert idempotency, run end to end failure simulations. Track tool call and rollback rates.

Telemetry and drift: Log embeddings, latency, feedback, and escalations. Alert on drift, hallucination spikes, and tool failures.

Weekly metrics: correctness, hallucination rate, retrieval precision at 5 and MRR, tool success rate, CSAT, latency, escalation rate. Pilot plan: one week to wire logging, two weeks to build a 100 scenario suite, then nightly synthetic tests and daily human review.

You can check out https://app.ailog.fr/en/tools to get some insight on way to evaluate your RAG, we have free tools here for you to check and use :)

Settings

Ask HN: Who is honestly evaluating AI outputs and how?

Keyboard Shortcuts