Settings

Theme

Ask HN: Is the ongoing AI research driving LLM models to be better?

3 points by thiago_fm 10 days ago · 4 comments · 2 min read


I'm just a curious hobbyist that has ran LLM models locally and follow a lot of content about it. Hope we have a few AI researchers here on HN to clarify this.

When using Opus or Codex vs. a chinese or Open source model, it feels like its reasoning capabilities are basically the same.

The difference is typically in coding. It looks like OpenAI and Anthropic invest a lot in pre-training (paying Mercor and the like).

Also a lot in creating synthetic data, I believe this has bigger AI research involvement and techniques.

Of course, there's the RLHF loop that developers using Anthropic/OpenAI products as well, which provides probably yields very good data.

This ends up creating the perspective that it is smart, after all, it has been trained with what you want to do, so it can do that for you.

But overall, is there really much AI research being done on those companies, or are the AI researchers mostly fine-tuning small aspects of the model, akin to what Google engineers used to do for Google search?

I ask this because this all looks like somebody with money could throw money at the problem and end up with a better model at the end, provided they do what I outlined above better -- with AI research being really not that important.

It still often feels like talking with ChatGPT 4 with just better data.

Even the big upgrade of Claude Code being able to work autonomously looks to be mainly due to it knowing how to grab context and do tool calls (not saying that this is easy), rather than the model's raw performance being better.

Or am I wrong, is there something extremely good on those models that AI researchers discovered that the others don't have? Or is it really mostly Data?

curioussquirrel 10 days ago

There are architectural changes (such as reasoning or mixture of experts) that measurably improve how well models perform. So the improvements are definitely not just from data.

I can speak for my area of expertise - multilingual capabilities. Some SOTA models are making huge strides in their support of various languages, and increasingly they understand and can produce text in languages where GPT-4 era models were absolutely lost. These are probably from a combination of richer training dataset and architectural improvements (more parameters?).

I posted about this here if you're interested: https://news.ycombinator.com/item?id=47847282

Now that doesn't necessarily mean that models are also getting substantially better at English or other major languages. They likely are to some degree, but we've reached a point with major languages where core linguistic proficiencies are covered, and what's left is the more squishy part: style, tone of voice, ability to use different registers naturally, or what some people would call linguistic taste. But that's much harder to measure and therefore trickier to provide evidence for.

Hope this helps.

Edit: typo, clarification

  • thiago_fmOP 10 days ago

    Thanks for your perspective from somebody working on the field. I still wonder though: what would the results be if we'd just use a richer dataset + more parameters? Would it be really that different results? (except costs, as MoE def helps with that)

    MoE: I assume some people just specialize in working with routing as with that, as by reducing the amount of params and just using a subset, you end up making it less costly. So, AI researchers are only working on optimizations on getting this better?

    Same question on Reasoning, so AI researchers are working mostly on optimizations on top of it, like CoT and so on, like mini-optimizations.

    So basically, they work on those micro-optimizations, put them together and see a % improvement in a benchmark?

    I'm sure this is probably awesome for languages, which if I'm not mistaken, it was the use-case initially used on "All you need is attention" and the entire LLM revolution.

    But this seems to be a very clear path to be "taking the car to the carwash by foot" for a long time, isn't it?

    It feels like we'll keep "taking the car to the carwash by foot" until somebody optimizes for that prompt, or some pre-training done, and then there'll be another prompt that will show that the AI has real trouble with very basic real-world reasoning and imagination.

    Isn't it the case, or do you see any kind of research that could take us from that plateau full of micro-optimizations that get us a few cm higher to the peak?

    • curioussquirrel 10 days ago

      MoE is mostly an optimization of the active parameters and therefore lowering the compute requirements, but it can provide some performance improvements over dense models in some cases.

      I would not describe reasoning as optimization: In fact, it's typically doing the opposite, as models spend way more tokens (and therefore compute) on responding to the same prompt. Some of the smartest models these days use ridiculous amounts of reasoning before they ever respond. Try Deep Research in Gemini or Claude and you'll see what I'm talking about.

      >> But this seems to be a very clear path to be "taking the car to the carwash by foot" for a long time, isn't it?

      I thought the progress was plateauing sometime last year too, but then some new models got released and we saw that the multilingual capabilities improvements are real. And if you want something more tangible and reported on, consider the Opus 4.5/4.6 coding revolution (Claude Code explosion) a few months back.

      LLMs being stochastic and statistical machines, there will always be funny things that people will come up with that will trick them, be it R's in strawberry or the carwash by foot. At the same time, I can tell you from my experience that a lot of the Misguided attention ( https://github.com/cpldcpu/MisguidedAttention ) type of stump questions work at a much lower rate with newer models. Progress is being made, it's just not in visible areas.

      BTW, you can come up with many trick questions that will stump even humans with PhDs. They will be of different kind than the ones for LLMs, but this is not a flaw unique to LLMs.

      If you're asking whether the progress to AGI isn't taking too long, then I personally think LLMs, at least with their current architecture, are not the foundation of AGI, and will always have inherent limitations. But we're fully in the "that's just like, your opinion, man" territory now :)

      • thiago_fmOP 10 days ago

        Beautiful answer. Priceless!

        LLMs for language feels like it's definitely the way to go. I feel like that by just improving it further can definitely reach perfection, if not very close.

        My concern is mostly all adjacent fields, like systems thinking, spatial reasoning, "real" human-like reasoning etc or as you put it, "AGI".

        Doesn't seen this will take us there at all. I don't feel like we're closer to AGI than we were on the earliest versions of ChatGPT.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection