HellaSwag: 36% of this popular large language model benchmark contains errors
surgehq.aiI'm not entirely clear on what ActivityNet is (one of the primary sources for HellaSwag), but it looks like amateurish descriptions of videos, like you would write for audio descriptions for the blind, except written very badly.
I'm guessing it's just Mechanical Turk content which wasn't even spellchecked.
Agreed, but I am concerned over the chance that lots of these models are used in critical scenarios without much validation of the data set.
I'm not so sure the input "errors" called out in this post qualify as errors in the dataset. I wouldn't necessarily call an input prompt with errors a dataset problem. It's important to be robust to minor input errors, rather than requiring perfection on the part of the user.
I'm thinking here about "People is around the field watching the game", and other input errors, not necessarily output errors, but maybe if I thought about it a little more I'd be able to make similar arguments for accepting weirder outputs? Not as confident about that. For inputs, the hopeful effect of training/validating against such examples would be to make the model somewhat able to deal with imperfect inputs when the overall meaning is clear.
We are certainly at the “throw money at the buzzwords” stage of ML, especially LLMs. And while this is certainly caused by the gold-rush mentality hype cycle, there is an issue of those in this field wildly over-promising what this tech can do.
The scary thing about this hype cycle is that AI and ML are both being deployed in life-and-death scenarios like automated driving and health-care settings. This isn’t the normal web hype of “Uber for X” that we are used to.
This article is written such that you have to read the article twice to understand what it's conveying. It could benefit from a two-sentence introduction that addresses the context.
Here you go:
The HellaSwag benchmark is an example of a large language model (LLM) benchmark that is popular among researchers. However, it has been found to be inaccurate and unhelpful in measuring progress made in LLM research. Researchers analysed the validation set of HellaSwag and found errors in 36% of its rows. They also found that the "Activity Net" rows were particularly problematic. Real-world human evaluation is important in order to make good launch decisions on LLMs.
(summarised by ChatGPT, naturally)
>More and more researchers are starting to see the importance of good data.
Let me just leave this here and I just don't comment any further on this great progress within the research community
Perhaps the 36% errors help to beat humans' evaluation ;)