Ask HN: How do you monitor LLM response quality?
For any of you that have systems with LLM interactions in production — how do you monitor how the quality of LLM outputs continuously?
Do you use another LLM to evaluate if the response was hallucinated, and grade it across a set of metrics? Quoting Unstract’s documentation on LLMEVAL to reduce hallucination “ it is always a good idea to choose an LLM from a completely different provenance or origin as the evaluator LLM as compared to the LLM that is used to structure documents.”
https://docs.unstract.com/unstract_platform/setup_accounts/s... What We’ve Learned From A Year of Building with LLMs: https://applied-llms.org