Ask HN: How do you evaluate a LLM these days?

1 points by pseudony 4 months ago · 0 comments · 1 min read

Hello HN. Recent events and me being Danish (EU) strongly encourage me to reconsider US services like Anthropic's Claude. I mention this to say that the problem of evaluating LLMs suddenly got very necessary for me. While I don't doubt Claude is nearly ideal for my corner of software development, I would like to have a better sense of how much I am giving up.

With that in mind, how do you go about best evaluating LLM's these days, short of going with "gut feel"? My best idea so far is to design/write various small "design a program/library" tasks with clear functional requirements and letting each model try implementing the tasks, probably using Open Code and Open Router as the common components throughout the evaluation.

But this field moves fast and I may well have missed many better or easier approaches. What would you do?

No comments yet.

Settings

Ask HN: How do you evaluate a LLM these days?

Keyboard Shortcuts