There are no new ideas in AI, only new datasets

blog.jxmo.io

472 points by bilsbie 2 days ago


voxleone - a day ago

I'd say with confidence: we're living in the early days. AI has made jaw-dropping progress in two major domains: language and vision. With large language models (LLMs) like GPT-4 and Claude, and vision models like CLIP and DALL·E, we've seen machines that can generate poetry, write code, describe photos, and even hold eerily humanlike conversations.

But as impressive as this is, it’s easy to lose sight of the bigger picture: we’ve only scratched the surface of what artificial intelligence could be — because we’ve only scaled two modalities: text and images.

That’s like saying we’ve modeled human intelligence by mastering reading and eyesight, while ignoring touch, taste, smell, motion, memory, emotion, and everything else that makes our cognition rich, embodied, and contextual.

Human intelligence is multimodal. We make sense of the world through:

Touch (the texture of a surface, the feedback of pressure, the warmth of skin0; Smell and taste (deeply tied to memory, danger, pleasure, and even creativity); Proprioception (the sense of where your body is in space — how you move and balance); Emotional and internal states (hunger, pain, comfort, fear, motivation).

None of these are captured by current LLMs or vision transformers. Not even close. And yet, our cognitive lives depend on them.

Language and vision are just the beginning — the parts we were able to digitize first - not necessarily the most central to intelligence.

The real frontier of AI lies in the messy, rich, sensory world where people live. We’ll need new hardware (sensors), new data representations (beyond tokens), and new ways to train models that grow understanding from experience, not just patterns.

tippytippytango - 2 days ago

Sometimes we get confused by the difference between technological and scientific progress. When science makes progress it unlocks new S-curves that progress at an incredible pace until you get into the diminishing returns region. People complain of slowing progress but it was always slow, you just didn’t notice that nothing new was happening during the exponential take off of the S-curve, just furious optimization.

EternalFury - a day ago

What John Carmack is exploring is pretty revealing. Train models to play 2D video games to a superhuman level, then ask them to play a level they have not seen before or another 2D video game they have not seen before. The transfer function is negative. So, in my definition, no intelligence has been developed, only expertise in a narrow set of tasks.

It’s apparently much easier to scare the masses with visions of ASI, than to build a general intelligence that can pick up a new 2D video game faster than a human being.

jschveibinz - 2 days ago

I will respectfully disagree. All "new" ideas come from old ideas. AI is a tool to access old ideas with speed and with new perspectives that hasn't been available up until now.

Innovation is in the cracks: recognition of holes, intersections, tangents, etc. on old ideas. It has bent said that innovation is done on the shoulders of giants.

So AI can be an express elevator up to an army of giant's shoulders? It all depends on how you use the tools.

kogus - 2 days ago

To be fair, if you imagine a system that successfully reproduced human intelligence, then 'changing datasets' would probably be a fair summary of what it would take to have different models. After all, our own memories, training, education, background, etc are a very large component of our own problem solving abilities.

strangescript - a day ago

If you work with model architecture and read papers, how could not know there are a flood of new ideas? Only few yield interesting results though.

I kind of wonder if libraries like pytorch have hurt experimental development. So many basic concepts no one thinks about anymore because they just use the out of the box solutions. And maybe those solutions are great and those parts are "solved", but I am not sure. How many models are using someone else's tokenizer, or someone else's strapped on vision model just to check a box in the model card?

LarsDu88 - a day ago

If datasets are what we are talking about, I'd like to bring attention to the biological datasets out there that have yet to be fully harnessed.

The ability to collect gene expression data at a tissue specific level has only been invented and automated in the last 4-5 years (see 10X Genomics Xenium, MERFISH). We've only recently figured out how to collect this data at the scale of millions of cells. A breakthrough on this front may be the next big area of advancement.

ctoth - 2 days ago

Reinforcement learning from self-play/AlphaWhatever? Nah must just be datasets. :)

cadamsdotcom - a day ago

What about actively obtained data - models seeking data, rather than being fed. Human babies put things in their mouths, they try to stand and fall over. They “do stuff” to learn what works. Right now we’re just telling models what works.

What about simulation: models can make 3D objects so why not give them a physics simulator? We have amazing high fidelity (and low cost!) game engines that would be a great building block.

What about rumination: behind every Cursor rule for example, is a whole story of why a user added it. Why not take the rule, ask a reasoning model to hypothesize about why that rule was created, and add that rumination (along with the rule) to the training data. Providing opportunities to reflect on the choices made by their users might deepen any insights, squeezing more juice out of the data.

piinbinary - a day ago

AI training is currently a process of making the AI remember the dataset. It doesn't involve the AI thinking about the dataset and drawing (and remembering) conclusions.

It can probably remember more facts about a topic than a PhD in that topic, but the PhD will be better at thinking about that topic.

scrubs - 4 hours ago

True or false? One an llm is constructed, it mutates to include data from prompt-response interaction?

NetRunnerSu - a day ago

Because the externally injected loss function will empty the brain of the model.

Models need to decide for themselves what they should learn.

Eventually, after entering the open world, reinforcement learning/genetic algorithms are still the only perpetual training solution.

https://github.com/dmf-archive/PILF

JKCalhoun - 15 hours ago

> It’s not crazy to argue that all the underlying mechanisms of these breakthroughs existed in the 1990s, if not before.

That's not super relevant in my mind. It's because they're showing fruit now that will allow research to move forward. And the success, as we know, draws a lot of eyeballs, dollars, resources.

If this path was going to hit a wall, we will hit it more quickly now. If another way is required to move forward, we are more likely to find it now.

ahmedhawas123 - 15 hours ago

I think this is reflective of current state, but does not mean this will be the future. I think there is a lot of innovation to come on revisiting some of the 1990s principles of back propagation and optimization. Imagine if you could train current models to optimal weights in 1 day or 1 hour instead of weeks/months?

Just a hypothesis of mine

somebodythere - a day ago

I don't know if it matters. Even if the best we can do is get really good at interpolating between solutions to cognitive tasks on the data manifold, the only economically useful human labor left asymptotes toward frontier work; work that only a single-digit percentage of people can actually perform.

seydor - a day ago

There are new ideas, people are finding new ways to build vision models, which then are applied to language models and vice versa (like diffusion).

The original idea of connectionism is that neural networks can represent any function, which is the fundamental mathematical fact. So we should be optimistic, neural nets will be able to do anything. Which neural nets? So far people have stumbled on a few productive architectures, but it appears to be more alchemy than science. There is no reason why we should think there won't be both new ideas and new data. Biology did it, humans will do it too.

> we’re engaged in a decentralized globalized exercise of Science, where findings are shared openly

Maybe the findings are shared, if they make the Company look good. But the methods are not anymore

1vuio0pswjnm7 - 9 hours ago

What about hardware

Ideas are not new, according to author

But hardware is new and author never mentions impact of hardware improvements

sakex - a day ago

There are new things being tested and yielding results monthly in modelling. We've deviated quite a bit from the original multi head attention.

Daisywh - a day ago

If we’re serious about data being more important than models, then where are the Similar to ISO standards for dataset quality? We have so many model metrics, but almost nothing standardized for data integrity or reproducibility.

- a day ago
[deleted]
Leon_25 - a day ago

At Axon, we see the same pattern: data quality and diversity make a bigger difference than architecture tweaks. Whether it's AI for logistics or enterprise automation, real progress comes when we unlock new, structured datasets, not when we chase “smarter” models on stale inputs.

tim333 - a day ago

An interesting step forward, although an old idea we seem close to is recursive self improvement. Get the AI to make a modified version of itself to try to think better.

tantalor - a day ago

> If data is the only thing that matters, why are 95% of people working on new methods?

Because new methods unlock access to new datasets.

Edit: Oh I see this was a rhetorical question answered in the next paragraph. D'oh

mikewarot - a day ago

Hardware isn't even close to being out of steam. There are some breathtakingly obvious premature optimizations that we can undo to get at least 99% power reduction for the same amount of compute.

For example, FPGAs use a lot of area and power routing signals across the chip. Those long lines have a large capacitance, and thus cause a large amount of dynamic power loss. So does moving parameters around to/from RAM instead of just loading up a vast array of LUTs with the values once.

lossolo - a day ago

I wrote about it around a year ago here:

"There weren't really any advancements from around 2018. The majority of the 'advancements' were in the amount of parameters, training data, and its applications. What was the GPT-3 to ChatGPT transition? It involved fine-tuning, using specifically crafted training data. What changed from GPT-3 to GPT-4? It was the increase in the number of parameters, improved training data, and the addition of another modality. From GPT-4 to GPT-40? There was more optimization and the introduction of a new modality. The only thing left that could further improve models is to add one more modality, which could be video or other sensory inputs, along with some optimization and more parameters. We are approaching diminishing returns." [1]

10 months ago around o1 release:

"It's because there is nothing novel here from an architectural point of view. Again, the secret sauce is only in the training data. O1 seems like a variant of RLRF https://arxiv.org/abs/2403.14238

Soon you will see similar models from competitors." [2]

Winter is coming.

1. https://news.ycombinator.com/item?id=40624112

2. https://news.ycombinator.com/item?id=41526039

rar00 - a day ago

disagree, there are a few organisations exploring novel paths. It's just that throwing new data at an "old" algorithm is much easier and has been a winning strategy. And, also, there's no incentive for a private org to advertise a new idea that seems to be working (mine's a notable exception :D).

lsy - a day ago

This seems simplistic, tech and infrastructure play a huge part here. A short and incomplete list of things that contributed:

- Moore's law petering out, steering hardware advancements towards parallelism

- Fast-enough internet creating shift to processing and storage in large server farms, enabling both high-cost training and remote storage of large models

- Social media + search both enlisting consumers as data producers, and necessitating the creation of armies of Mturkers for content moderation + evaluation, later becoming available for tagging and rlhf

- A long-term shift to a text-oriented society, beginning with print capitalism and continuing through the rise of "knowledge work" through to the migration of daily tasks (work, bill paying, shopping) online, that allows a program that only produces text to appear capable of doing many of the things a person does

We may have previously had the technical ideas in the 1990s but we certainly didn't have the ripened infrastructure to put them into practice. If we had the dataset to create an LLM in the 90s, it still would have been astronomically cost-prohibitive to train, both in CPU and human labor, and it wouldn't have as much of an effect on society because you wouldn't be able to hook it up to commerce or day-to-day activities (far fewer texts, emails, ecommerce).

SamaraMichi - a day ago

This brings us to the problem AI companies are facing, the lack of data, they have already hoovered as much as they can from the internet and desperately need more data.

Which make sit blatantly obvious why we're beginning to see products being marketed under the guise of assistants/tools to aid you whose actual purpose is to gather real world picture and audio data, think meta glasses and what Ives and Altman are cooking up with their partnership.

russellbeattie - a day ago

Paradigm shifts are often just a conglomeration of previous ideas with one little tweak that suddenly propels a technology ahead 10x which opens up a whole new era.

The iPhone is a perfect example. There were smartphones with cameras and web browsers before. But when the iPhone launched, it added a capacitive touch screen that was so responsive there was no need for a keyboard. The importance of that one technical innovation can't be overstated.

Then the "new new thing" is followed by a period of years where the innovation is refined, distributed, applied to different contexts, and incrementally improved.

The iPhone launched in 2007 is not really that much different than the one you have in your pocket today. The last 20 years has been about improvements. The web browser before that is also pretty much the same as the one you use today.

We've seen the same pattern happen with LLMs. The author of the article points out that many of AI's breakthroughs have been around since the 1990s. Sure! And the Internet was created in the 1970s and mobile phones were invented in the 1980s. That doesn't mean the web and smartphones weren't monumental technological events. And it doesn't mean LLMs and AI innovation is somehow not proceeding apace.

It's just how this stuff works.

Kapura - a day ago

Here's an idea: make the AIs consistent at doing things computers are good at. Here's an anecdote from a friend who's living in Japan:

> i used chatgpt for the first time today and have some lite rage if you wanna hear it. tldr it wasnt correct. i thought of one simple task that it should be good at and it couldnt do that.

> (The kangxi radicals are neatly in order in unicode so you can just ++ thru em. The cjks are not. I couldnt see any clear mapping so i asked gpt to do it. Big mess i had to untangle manually anyway it woulda been faster to look them up by hand (theres 214))

> The big kicker was like, it gave me 213. And i was like, "why is one missing?" Then i put it back in and said count how many numbers are here and it said 214, and there just werent. Like come on you SHOULD be able to count.

If you can make the language models actually interface with what we've been able to do with computers for decades, i imagine many paths open up.

krunck - a day ago

Until these "AI" systems become always-on, always-thinking, always-processing, progress is stuck. The current push button AI - meaning it only processes when we prompt it - is not how the kind of AI that everyone is dreaming of needs to function.

nyrulez - a day ago

Things haven't changed much in terms of truly new ideas since electricity was invented. Everything else is just applications on top of that. Make the electrons flow in a different way and you get a different outcome.

blobbers - a day ago

Why is DeepSeek specifically called out?

TimByte - a day ago

What happens when we really run out of fresh, high-quality data? YouTube and robotics make sense as next frontiers, but they come with serious scaling, labeling, and privacy headaches

AbstractH24 - 19 hours ago

Imagine if the original moores law tracked how often CPUs doubled the semi conductors while still functioning properly 50% of the time.

I don’t think it would have had the same impact

ks2048 - a day ago

The latest LLMs are simply multiplying and adding various numbers together... Babylonians were doing that 4000 years ago.

anon291 - a day ago

I mean there's no new ideas for saas but just new applications and that worked out pretty well

Night_Thastus - a day ago

Man I can't wait for this '''''AI''''' stuff to blow over. The back and forth gets a bit exhausting.

saltserv - 2 days ago

[dead]

b0a04gl - a day ago

[dead]

luppy47474 - a day ago

Hmmm

code_for_monkey - 2 days ago

[flagged]

alganet - a day ago

Dataset? That's so 2000s.

Each crawl on the internet is actually a discrete chunk of a more abstractly defined, constant influx of information streams. Let's call them rivers (it's a big stream).

These rivers can dry up, present seasonal shifts, be poisoned, be barraged.

It will never "get there" and gather enough data to "be done".

--

Regarding "new ideas in AI", I think there could be. But this whole thing is not about AI anymore.