Do Machine Learning Models Memorize or Generalize?

pair.withgoogle.com

454 points by 1wheel 3 years ago · 216 comments

Reader

Sometimes I think the reason human memory in some sense is so amazing, is what we lack in storage capacity that machines have, we makeup for in our ability to create patterns that compress the amount of information stored dramatically, and then it is like we compress those patterns together with other patterns and are able to extract things from it. Like it is an incredibly lossy compression, but it gets the job done.

ComputerGuru 3 years ago

That’s not exactly true, there doesn’t seem to be an upper bound (that we can reach) on storage capacity in the brain [0]. Instead, the brain actually works to actively distill knowledge that doesn’t need to be memorized verbatim into its essential components in order to achieve exactly this “generalized intuition and understanding” to avoid overfitting.
[0]: https://www.scientificamerican.com/article/new-estimate-boos...
- halflings 3 years ago
  
  > That’s not exactly true [...] Instead, the brain actually works to actively distill knowledge that doesn’t need to be memorized verbatim into its essential components
  ...but that's exactly what OP said, no?
  I remember attending an ML presentation where the speaker shared a quote I can't find anymore (speaking of memory and generalization :)), which said something like: "To learn is to forget"
  If we memorized everything perfectly, we would not learn anything: instead of remembering the concept of a "chair", you would remember thousands of separate instances of things you've seen that have a certain combination of colors and shapes etc
  It's the fact that we forget certain details (small differences between all these chairs) that makes us learn what a "chair" is.
  Likewise, if you remembered every single word in a book, you would not understand its meaning; understanding its meaning = being able to "summarize" (compress) this long list of words into something more essential: storyline, characters, feelings, etc.
  - cmpalmer52 3 years ago
    
    There’s a story by Jorge Luis Borges called “Funes the Memorious” about a man who remembers everything, but can’t generalize. There’s a line about him not knowing if a dog on the square glimpsed at noon from the side is the same dog as the one seen from the back at 12:01 or something like that. Swirls of smoke from a cigarette are memorized forever. He mostly sits in a dark room.
    
    DevKoala 3 years ago
    
    Thank you for reminding me of this story. He is my favorite author.
  - mikewarot 3 years ago
    
    Long ago, I was introduced to the theory of Mappers and Packers[1], which are polar opposites in the ways that people can learn things. Mappers (like me) have a mental model of the universe which represents facts and knowledge as puzzle pieces that have to fit together into a coherent whole. Any inconsistencies in the fit between those pieces drive us nuts. When we encounter a new set of facts, we have a background process that tries to make them fit. Then all the new connections arise over time as we realize new ways we can combine old facts.
    On the other extreme, are packers. They have optimized for packing facts in bulk, with little regard for how they fit together. If you give this type of person a set of instructions that require a wider knowledge of how things fit, they will get lost, frustrated, and/or need support. If you anticipate this, and spend a bit extra time to show how to handle all of the possible contingencies, (and give them a document of this) they're good, and will be quite happy with your support.
    I think that mappers take more time figuring out the model, compressing the facts to save space, and increase applicability in general.
    [1] https://wiki.c2.com/?MappersVsPackers
  - JieJie 3 years ago
    
    My mind is a blurry jpeg of my life.
    (https://www.newyorker.com/tech/annals-of-technology/chatgpt-...)
  - ComputerGuru 3 years ago
    
    > but that's exactly what OP said, no?
    Not precisely. We don’t know if verbatim capacity is limited (and it doesn’t seem to be) but the brain operates in a space-efficient manner all the same. So there isn’t necessarily a causative relationship between “memory capacity” and “means of storage”.
    > Likewise, if you remembered every single word in a book, you would not understand its meaning
    I understand your meaning but I want to clarify for the sake of the discussion that unlike with ML, the human brain can both memorize verbatim and understand the meaning because there is no mechanism for memorizing something but not processing it (i.e. purely storage). The first pass(es) are stripped to their essentials but subsequent passes provide the ability to memorize the same input.
    
    SanderNL 3 years ago
    
    We know for certain it is limited. Do brains not adhere to physics?
    
    astrange 3 years ago
    
    They only live 100 years and the write rate is only so high.
    
    whimsicalism 3 years ago
    
    > verbatim capacity is limited
    I am but a simple physicist and I can already tell you it is.
    
    ComputerGuru 3 years ago
    
    I mean in terms of our ability to reach those limits, naturally.
  - WanderPanda 3 years ago
    
    Compression = Intelligence
    http://prize.hutter1.net/
    
    briansm 3 years ago
    
    Nope, more specifically:
    Lossy compression = Intelligence
    That's where the Hutter Prize falls down, it's based on lossless compression, which is nothing like how the brain works.
- jjk166 3 years ago
  
  Distilling knowledge is data compression.
  - w10-1 3 years ago
    
    You're conflating memorization with generalization, no?
    
    jjk166 3 years ago
    
    Memorization is storing data. Generalization is developing the heuristics by which you compress stored data. To distill knowledge is to apply heuristics to lossily-compress a large amount of data to a much smaller amount of data from which you nevertheless can recover enough information to be useful in the future.
    
    staunton 3 years ago
    
    > Generalization is developing the heuristics by which you compress stored data
    such that
    > you nevertheless can recover enough information to be useful in the future.
    I disagree (in case you meant to imply it) that compression implies generalization.
    
    jjk166 3 years ago
    
    I did not mean to imply compression implies generalization, if anything the reverse. Compression is the act of cutting, generalization is the whetstone by which you may sharpen a blade, which is the compression heuristic. A more general heuristic is to compression what a sharper blade is to cutting.
- nonameiguess 3 years ago
  
  I've thought about this a lot in the context of the desire people seem to have to try and achieve human immortality or at least indefinite lifespans. If SciAm is correct here and the upper bound is a quadrillion bytes, we may not be able to hit that given the bound on possible human experiences, but someone who lived long enough would eventually hit that. After a hundred million years or whatever the real number is of life, you'd either lose the ability to form new memories or you'd have to overwrite old ones to do so.
  Aside from having to eventually experience the death of all stars and light and the decay of most of the universe's baryonic matter and then face an eternity of darkness with nothing to touch, it's yet another reason I don't think immortality (as opposed to just a very long lifespan) is actually desirable.
  - mewpmewp2 3 years ago
    
    I imagine there would be perhaps tech or technique which you can choose to determine which memories to compress and countless of others techniques like extra storage that you can instantly access, so I don't see all of these as being real arguments why not become immortal. If I have to choose to be dead and memoryless compared to losing some of my memories, but being still alive, why should I choose being dead and memoryless?
    And when losing memories you would first just discard some details, like you lose now anyway, but you would start compressing centuries into rough ideas of what happened, it's just the details that would lack a bit.
    I don't see it being a problem at all. And if really something happens with the Universe, sure I can die then, but why would I want to die before?
    I want to know what happens, what gets discovered, what happens with humanity, how far do we reach in terms of understanding of what is going on in this place. Why are we here. Imagine dying and not even knowing why you were here.
  - p1necone 3 years ago
    
    My naive assumption would be that it would be a fairly gradual process. You'd just always have a sliding window of the last N years of memories, with the older ones being progressively more fuzzy and unreliable.
  - imtringued 3 years ago
    
    Longtermists argue that we will be harvesting hawking radiation from blackholes trillions of years after the heat death of the universe.
    
    __loam 3 years ago
    
    The last civilizations will be built around black holes.
- oneTbrain23 3 years ago
  
  You obviously hand wave alzheimer and dementia. Human don't know exactly how brains works. The computational storage is just an estimate of what we understand von Neuman computer storing data 1 and 0. In every psychological test conducted on human mind, they clearly have a limit.
- downboots 3 years ago
  
  Can "distill knowledge" be made precise ?
  - ComputerGuru 3 years ago
    
    As best as I’ve been able to research, it’s still under active exploration and there are hypotheses but no real answers. I believe research has basically been circling around the recent understanding that in addition to being part of how the brain is wired, it is also an active, deliberate (if unconscious) mechanism that takes place in the background and is run “at a higher priority” during sleep (sort of like an indexing daemon running at low priority during waking hours then getting the bulk of system resources devoted to it during idle).
    There are also studies that show “data” in the brain isn’t stored read-only and the process of accessing that memory involves remapping the neurons (which is how fake memories are possible) - so my take is if you access a memory or datum sequentially start to finish each time the brain knows this is to be stored verbatim for as-is retrieval but if you access snapshots of it or actively seek to and replay a certain part while trying to relate that memory to a process or a new task, the brain rewires the neural pathways accusingly. Which implies that there us an unconscious part that takes place globally plus an active, modifying process where how we use a stored memory affects how it is stored and indexed (so data isn’t accessed by simple fields but rather by complex properties or getters, in programming parlance).
    I guess the key difference from how machine learning works (and I believe an integral part of AGI, if it is even possible) is that inference is constant, even when you’re only “looking up” data and you don’t know the right answer (i.e. not training stage). The brain recognizes how the new query differs from queries it has been trained on and can modify its own records to take into account the new data. For example, let’s say you’re trying to classify animals into groups and you’ve “been trained” on a dataset that doesn’t include monotremes or marsupials. The first time you come across a platypus in the wild (with its mammaries but no nipples, warm-blooded but lays eggs, and a single duct for waste and reproduction) you wouldn’t just mistakenly classify it as a bird or mammal - you would actively trigger a (delayed/background) reclassification of all your existing inferences to account for this new phenomenon, even though you don’t know what the answer to the platypus classification question is.
  - esafak 3 years ago
    
    https://en.wikipedia.org/wiki/Rate%E2%80%93distortion_theory
  - clord 3 years ago
    
    imo, it amounts to revisiting concepts once more general principles are found — and needed. For instance, you learn the alphabet, and it's hard. the order is tricky. the sounds are tricky, etc. but eventually, it get distilled to a pattern. But you still have to start from A to remember what letter 6 is, until you encounter that problem many times, and then the brain creates a 6=F mapping. I think of it in economic terms: when the brain realizes it's cheaper to create a generalization, it does so on the fly, and that generalization takes over the task.
    Somtimes it's almost like creating a specialist shard to take over the task. Driving is hard at first, with very high task overload, lots to pay attention to. With practice, it becomes a little automated part of yourself takes care of those tasks while your main general intelligence can do whatever it likes, even as the "driver" deals with seriously difficult tasks.
    
    pyinstallwoes 3 years ago
    
    It’s almost like a tuning fork. When the mapping becomes 6=F the two separate tuning forms are now sympathetic to a frequency.
  - __loam 3 years ago
    
    Unless you know something the neuroscientists don't, it cannot.
- gattilorenz 3 years ago
  
  Is there a “realistic upper bound” in things that should be memorized verbatim? Ancient greeks probably memorized the Iliad and other poems (rhyming and metre might work as a substitute for data compression, in this case), and many medieval preachers apparently memorized the whole Bible…
- firecall 3 years ago
  
  Does the brain require more energy to store more information?
  Or is it always running at the same pace regardless of if it’s empty or not?
  I guess the Brian doesn’t really work like that…. But I’m curious :-)
  - ReactiveJelly 3 years ago
    
    The brain doesn't seem to ever "clock down" when it's idle, which is interesting to me
    https://en.wikipedia.org/wiki/Human_brain#Metabolism
    > The energy consumption of the brain does not vary greatly over time
    
    gmadsen 3 years ago
    
    maybe between sleep and normal waking idle because there is actually quite a bit going on during sleep. There has been quite a bit of research though regarding higher "clock up" states consuming far more energy, such as grandmasters playing a chess tournament
    
    lobocinza 3 years ago
    
    My lazy research pointed to 6K calories which is a lot more than a typical person can burn by physical exercise.
- TheRealSteel 3 years ago
  
  You seem to have just re-stated what the other person said.
  - whimsicalism 3 years ago
    
    Thank you, thought I was losing it for a second
bufferoverflow 3 years ago

There are rare people who remember everything
https://youtu.be/hpTCZ-hO6iI
- svachalek 3 years ago
  
  It's pretty fascinating to me how "normal" Marilu Henner seems to be. I'm getting older and my memory is not what it was, but when I was younger it was pretty extraordinary. I did really well in school and college but over time I've realized it was mostly due to being able to remember most things pretty effortlessly, over being truly "smart" in a classic sense.
  But having so much of the past being so accessible is tough. There are lots of memories I'd rather not have, that are vivid and easily called up. And still, I think it's only a fraction of what her memory seems to be like.
  - 93po 3 years ago
    
    As someone on the other end of the spectrum, I have an awful memory, and don't remember most of my life aside from really wide, sweeping generalizations and maybe a couple hundred very specific memories. My way of existence is also very sad, and it makes me feel like I've not really lived.
    
    obscurette 3 years ago
    
    It's likely that you actually have memories about details, but don't have a way to recall these memories. I always wondered how the heck people write memories until I saw someone to do it. He used a lot of triggers – photos, newspapers, letters etc. Later I had a chance to visit museum where typical home environment of my childhood was exhibited (yes, I'm that old) and realized how many memories small things can trigger in my brain.
    
    93po 3 years ago
    
    I agree. There are definitely triggers for old memories and I have heard cannabis also adds flexibility to that recall
  - TheRealSteel 3 years ago
    
    " I did really well in school and college but over time I've realized it was mostly due to being able to remember most things pretty effortlessly"
    Same! They thought I was a genius in primary school but I ended up a loser adult with a dead end job. Turns out I just liked technology and was good at remembering facts and names for things.
- hgsgm 3 years ago
  
  Is there scientific evidence of that or just claims?
  - bufferoverflow 3 years ago
    
    They were tested. They all correctly recall weather and major events on any day they were asked about.
  - badumtsss 3 years ago
    
    some people don't want to be studied or tested.
tbalsam 3 years ago

For more information and the related math behind associative memories, please see Hopfield Neural Networks.
While the upper bound is technically "infinity", there is a tradeoff between the amount of concepts stored and the fundamental amount of information storable per concept, similar to how other tradeoff principles like the uncertainty principle, etc work.
- scrps 3 years ago
  
  Thank you
mr_toad 3 years ago

Artificial neural networks work a lot like compression algorithms in their ability to predict the future. The trained network is a compression algorithm - it does not store compressed data.
We don’t know if the animal brain works the same way, but I suspect it is mostly compression algorithms designed to predict things, and doesn’t store much data at all.
bobboies 3 years ago

Good example in my math and physics classes I found it really helpful to understand the general concepts, then instead of memorizing formulas could actually derive them from other known (perhaps easier-to-remember) facts.
Geometry is good for training in this way—and often very helpful for physics proofs too!
- lacrimacida 3 years ago
  
  Too bad this method is penalized most on tests (timed) where memorization is favored. But deriving results reinforce knowledge, understanding and patterns best in my opinion.
pillefitz 3 years ago

That is essentially what embeddings do
- nightski 3 years ago
  
  Maybe, except from my understanding an embedding vector tends to be much larger than the source token (due to the high dimensionality of the embedding space). So it's almost like a reverse compression in a way. That said I know vector DBs have much more efficient ways of storing those vector embedding.
  - jncfhnb 3 years ago
    
    Tokens are not 1:1 with vectors.
BSEdlMMldESB 3 years ago

yes, when we do this to history, it becomes filled with conspiracies. but is merely a process to 'understand' history by projecting intentionalities.
this 'compression' is what 'understanding' something really entails; at first... but then there's more.
when knowledge becomes understood it enables perception (e.g. we perceive meaning in words once we learn to read).
when we get really good at this understanding-perception we may start to 'manipulate' the abstractions we 'perceive'. an example would be to 'understand a cube' and then being able to rotate it around so to predict what would happen without really needing the cube. but this is an overly simplistic example
- NovaDudely 3 years ago
  
  This was the thinking I was taking. It is a useful tool at first but taken too far can be a bad thing in some situations.
pyinstallwoes 3 years ago

Maxwell’s demon to entropy

greenflag 3 years ago

It seems the take home is weight decay induces sparsity which helps learn the "true" representation rather than an overfit one. It's interesting the human brain has a comparable mechanism prevalent in development [1]. I would love to know from someone in the field if this was the inspiration for weight decay (or presumably just the more equivalent nn pruning [2]).

[1] https://en.wikipedia.org/wiki/Synaptic_pruning [2] https://en.wikipedia.org/wiki/Pruning_(artificial_neural_net...

tbalsam 3 years ago

ML researcher here wanting to offer a clarification.
L1 induces sparsity. Weight decay explicitly _does not_, as it is L2. This is a common misconception.
Something a lot of people don't know is that weight decay works because when applied as regularization it causes the network to approach the MDL, which reduces regret during training.
Pruning in the brain is somewhat related, but because the brain uses sparsity to (fundamentally, IIRC) induce representations instead of compression, it's basically a different motif entirely.
If you need a hint here on this one, think about the implicit biases of different representations and the downstream impacts that they can have on the learned (or learnable) representations of whatever system is in question.
I hope this answers your question.
- mmmmpancakes 3 years ago
  
  can you please spell out what MDL is an acronym for?
  - sva_ 3 years ago
    
    https://en.wikipedia.org/wiki/Minimum_description_length
    
    mmmmpancakes 3 years ago
    
    thanks
- naasking 3 years ago
  
  > because the brain uses sparsity to (fundamentally, IIRC) induce representations instead of compression
  What's the evidence for this?
  - heyitsguay 3 years ago
    
    https://bernstein-network.de/wp-content/uploads/2021/03/Lect... this has an awesome overview of the current understanding of neural encoding mechanisms.
    
    tbalsam 3 years ago
    
    I enjoyed this presentation, thank you for sharing it. Good stuff in here.
    I think things are a bit off about the reasoning behind the basis functions, but as I noted elsewhere here that's work I'm not entirely able to talk about as I'm actively working on developing it right now, and will release it when I can.
    However, you can see some of the empirical consequences of an updated understanding on my end of encoding and compression in a release of hlb-CIFAR10 that's coming up soon that should cut out another decent chunk of training time. As a part of it, we reduce the network from a ResNet8 architecture to a ResNet7, and we additionally remove one of the (potentially less necessary) residuals. It is all 'just' empirical, of course, but long-term, as they say, the proof is in the pudding, since things are already so incredibly tightened down.
- joaogui1 3 years ago
  
  That looks interesting, do you know what paper talks about the connection between MDL, regret, and weight decay?
  - tbalsam 3 years ago
    
    I would start with Shannon's information theory and the Wikipedia page on L2/the MDL as a decent starting point.
    For the first, there are a few good papers that simplify the concepts even further.
    
    joaogui1 3 years ago
    
    Sorry, I know what MDL and L2 regularization are, I would like the paper that connects them in the way you mentioned
visarga 3 years ago

The inspiration for weight decay was to reduce the capacity to memorize of the model until it perfectly fits the complexity of the task, not more not less. A model more complex than the task is over-fitting, the other one is under-fitting. Got to balance them out.
But the best cure for over-fitting is to make the dataset larger and ensure data diversity. LLMs have datasets so large they usually train one epoch.
- nightski 3 years ago
  
  It sounds nice in theory, but the data itself could be problematic. There is no temporal nature to it. You can have duplicate data points, many data points that are closely related but describe the same thing/event/etc.. So while only showing the model each data point once ensures you do not introduce any extra weight on a data point, if the dataset itself is skewed it doesn't help you at all.
  Just by trying to make the dataset diverse you could skew things to not reflect reality. I just don't think enough attention has been paid to the data, and too much the model. But I could be very wrong.
  There is a natural temporality to the data humans receive. You can't relive the same moment twice. That said, human intelligence is on a scale too and may be affected in the same way.
  - visarga 3 years ago
    
    > I just don't think enough attention has been paid to the data, and too much the model.
    I wholly agree. Everyone is blinded by models - GPT4 this, LLaMA2 that - but the real source of the smarts is in the dataset. Why would any model, no matter how its architecture is tweaked, learn about the same ability from the same data? Why would humans be all able to learn the same skills when every brain is quite different. It was the data, not the model
    And since we are exhausting all the available quality text online we need to start engineering new data with LLMs and validation systems. AIs need to introspect more into their training sets, not just train to reproduce them, but analyse, summarise and comment on them. We reflect on our information, AIs should do more reflection before learning.
    More fundamentally, how are AIs going to evolve past human level unless they make their own data or they collect data from external systems?
    
    ben_w 3 years ago
    
    > It was the data, not the model
    It's both.
    It's clearly impossible to learn how to translate Linear A into modern English using only content written in pure Japanese that never references either.
    Yet also, none of the algorithms before Transformers were able to first ingest the web, then answer a random natural language question in any domain — closest was Google etc. matching on indexed keywords.
    > how are AIs going to evolve past human level unless they make their own data?
    Who says they can't make their own data?
    Both a priori (by development of "new" mathematical and logical tautological deductions), and a posteriori by devising, and observing the results of, various experiments.
    Same as us, really.
    
    riversflow 3 years ago
    
    I see this brought up consistently on the topic of AI take-off/X-risk.
    How does an AI language model devise an experiment and observe the results? The language model is only trained on what’s already known, I’m extremely incredulous that this language model technique can actually reason a genuinely novel hypothesis.
    A LLM is a series of weights sitting in the ram of GPU cluster, it’s really just a fancy prediction function. It doesn’t have the sort of biological imperatives (a result of being complete independent beings) or entropy that drive living systems.
    Moreover, if we consider how it works for humans, people have to _think_ about problems. Do we even have a model or even an idea about what “thinking” is? Meanwhile science is a looping process that mostly requires a physical element(testing/verification) to it. So unless we make some radical breakthroughs in general purpose robotics, as well as overcome the thinking problem I don’t see how AI can do some sort tech breakout/runaway.
    
    ben_w 3 years ago
    
    Starting with the end so we're on the same page about framing the situation:
    > I don’t see how AI can do some sort tech breakout/runaway.
    I'm expecting (in the mode, but with a wide and shallow distribution) a roughly 10x increase in GDP growth, from increased automation etc., not a singularity/foom.
    I think the main danger is bugs and misuse (both malicious and short-sighted).
    -
    > How does an AI language model devise an experiment and observe the results?
    Same way as Helen Keller.
    Same way scientists with normal senses do for data outside human sense organs, be that the LHC or nm/s^2 acceleration of binary stars or gravity waves (or the confusingly similarly named but very different gravitational waves).
    > The language model is only trained on what’s already known, I’m extremely incredulous that this language model technique can actually reason a genuinely novel hypothesis.
    Were you, or any other human, trained on things unknown?
    If so, how?
    > A LLM is a series of weights sitting in the ram of GPU cluster, it’s really just a fancy prediction function. It doesn’t have the sort of biological imperatives (a result of being complete independent beings) or entropy that drive living systems.
    Why do you believe that biological imperatives are in any way important?
    I can't see how any of a desire to eat, shag, fight, run away, or freeze up… help with either the scientific method nor pure maths.
    Even the "special sauce" that humans have over other animals didn't lead to any us doing the scientific method until very recently, and most of us still don't.
    > Do we even have a model or even an idea about what “thinking” is?
    AFAIK, only in terms of output, not qualia or anything like that.
    Does it matter if the thing a submarine does is swimming, if it gets to the destination? LLMs, for all their mistakes and their… utterly inhuman minds and transhuman training experience… can do many things which would've been considered "implausible" even in a sci-fi setting a decade ago.
    > So unless we make some radical breakthroughs in general purpose robotics
    I don't think it needs to be general, as labs are increasingly automated even without general robotics.
    
    kaba0 3 years ago
    
    > Do we even have a model or even an idea about what “thinking” is
    At the least, it is a computable function (as we don’t have any physical system that would be more general than that, though some religions might disagree). Which already puts human brains ahead of LLM systems, as we are Turing-complete, while LLMs are not, at least in their naive application (their output can be feeded to subsequent invocations and that way it can be).
    
    swid 3 years ago
    
    I googled whether or not universal function approximators, which neural nets are considered, are also considered Turing complete. It seems the general consensus is kind of not, since they are continuous and can’t do discreet operations in the same way.
    But also, that isn’t quite the whole story, since they can be arbitrarily precise in their approximation. Here[0] is a white paper addressing this issue which concludes attention networks are Turing complete.
    0: https://jmlr.org/papers/volume22/20-302/20-302.pdf
    
    kaba0 3 years ago
    
    If I’m not mistaken that’s only for arbitrary precision, which is not realistic.
    
    ben_w 3 years ago
    
    Is it provably not turning complete? That property pops up everywhere even when not intended, like Magic: The Gathering card interactions.
    Technically you may not want to call it Turing complete given the limited context window, but I'd say that's like insisting a Commodore 64 isn't Turing complete for the same reason.
    Likewise the default settings may be a bit too random to be a Turing machine, but that criticism would also apply to a human.
    
    kaba0 3 years ago
    
    It is basically a single huge matrix multiplication — you need some form of loop/repetition/recursion to be Turing complete.
    Sure it is not a hard property, excel, css with mouse movements, game of life are all that, but they need a “possibly forever running” part.
    
    ben_w 3 years ago
    
    ChatGPT does have a loop, that's why it produces more than one token.
    In this context, that the possibility of running "forever" would also exclude the humans (to which it is being compared) is relevant — even if we spend all day thinking in words at the rate of 160wpm and .75 words per token, we fall asleep around every 200k tokens, and some models (not from OpenAI) exceed that in their input windows.
    
    kaba0 3 years ago
    
    Yet I can solve many sudoku problems in a single wake cycle.
    Also, its output is language and it can’t change a former part of speech, can only append to it. When “thinking” about what to say next, it can’t “loop” over that, only whether to append some more text to it. Its looping is strictly within a “static context”.
    
    imtringued 3 years ago
    
    It's not just a series of weights. It is an unchanging series of weights. This isn't necessarily artificial intelligence. It is the intelligence of the dead.
    
    whimsicalism 3 years ago
    
    > Yet also, none of the algorithms before Transformers were able to first ingest the web, then answer a random natural language question in any domain — closest was Google etc. matching on indexed keywords.
    Wrong, recurrent models were able to do this, just not as well.
    
    Salgat 3 years ago
    
    This is definitely current models' biggest issue. You're training a model against millions of books worth of data (which would take a human tens of thousands of lifetimes) to achieve a superficial level of conversational ability to match a human, which can consume at most 3 novels a day without compromising comprehension. Current models are terribly inefficient when it comes to learning from data.
    
    famouswaffles 3 years ago
    
    Modern LLMs are nowhere near the scale of the human brain however you want to slice things so terribly inefficient is very arguable. also language skills seemingly take much less data and scale when you aren't trying to have it learn the sum total of human knowledge. https://arxiv.org/abs/2305.07759
    
    Salgat 3 years ago
    
    Scale is a very subjective thing since one is analog (86B neurons) and one is digital (175B parameters). Additionally, consider how many compute hours GPT 3 took to train (10,000 V100s were set aside for exclusive training of GPT 3). I'd say that GPT 3 scale vastly dwarfs the human brain, which runs at a paltry 12 watts.
    
    kaba0 3 years ago
    
    Neumann’s Computer and The Brain book is way out of date in terms of today’s hardware, but funnily it is still relevant in this metric. Biological systems are more analogous to a distributed system of small, very slow CPUs. Even GPUs that somewhat close the gap in-between the few, crazy fast CPUs vs the aforementioned many, slow ones - are still much faster than any one neuron in calculations, but are still overly serial. It is not the number of CPUs, but the number of their connections that make biological systems so powerful.
    
    Salgat 3 years ago
    
    Parameters have many connections too though. If the next layer is 1000 parameters wide, you have potentially 1000 connections from a single parameter.
    
    whimsicalism 3 years ago
    
    You have to count the training process from the origin of the human brain imo, not from the birth of any individual human.
    Neural nets look much more competitive by that standard.
    
    Salgat 3 years ago
    
    Yet humans designed the models, so the training process for chat gpt etc includes human evolution by your logic.
    
    whimsicalism 3 years ago
    
    This is a good point and the level of so-called task specific "inductive bias" in models is an active point of discussion, but I don't think it is fair to add all of our evolution to the model inductive bias because most of evolution was not towards giving better understanding of language to the model, it was towards better understanding of language in humans.
    
    imtringued 3 years ago
    
    They are inefficient by design. Gradient descent and backpropagation scale poorly, but they work and GPUs are cheap, so here we are.
- crdrost 3 years ago
  
  And there have been a lot of approaches to do this, my favorite one being the idea that maybe if we just randomly zap out some of the neurons while we train the rest, that forcing it to acquire that redundancy might privilege structured representations over memorization. Just always seemed like some fraternity prank, “if you REALLY know the tenets of Delta Mu Beta you can recite them when drunk after we spin you around in a circle twelve times fast!”
  - two_in_one 3 years ago
    
    > just randomly zap out some of the neurons while we train the rest
    It's already done: https://pytorch.org/docs/stable/generated/torch.nn.functiona...
  - whimsicalism 3 years ago
    
    https://nitter.net/Yampeleg/status/1688441683946377216
- kaibee 3 years ago
  
  > But the best cure for over-fitting is to make the dataset larger and ensure data diversity.
  This is also good life advice.
BaseballPhysics 3 years ago

The human brain has synaptic pruning. The exact purpose of it is theorized but not actually understood, and it's a gigantic leap to assume some sort of analogous mechanism between LLMs and the human brain.
pcwelder 3 years ago

Afaik weight decay is inspired from L2 regularisation which goes back to linear regression where L2 regularisation is equivalent to having gaussian prior on the weights with zero mean.
Note that L1 regularisation produces much more sparsity but it doesn't perform as well.
- nonameiguess 3 years ago
  
  This. Weight decay is just a method of dropping most weights to zero which is a standard technique used by statisticians for regularization purposes for decades. As far as I understand, it goes back at least to Tikhorov from 1970 and was mostly called ridge regression in the regression context. Normal ordinary least squares attempts to minimize the L2 norm of the squared residuals. When a system is overdetermined, adding a penalty term (usually just a scalar multiple of an identity matrix) and also minimizing the L2 norm of that biases the model to produce mostly near-zero weights. This helps with underdetermined systems and gives a better conditioned model matrix that is actually possible to solve numerically without underflow.
  It's kind of amazing to watch this from the sidelines, a process of engineers getting ridiculously impressive results from some combo of sheer hackery and ingenuity, great data pipelining and engineering, extremely large datasets, extremely fast hardware, and computational methods that scale very well, but at the same time, gradually relearning lessons and re-inventing techniques that were perfected by statisticians over half a century ago.
  - tbalsam 3 years ago
    
    L1 drops weights to zero, L2 biases towards Gaussianality.
    It's not always relearning lessons or people entirely blindly trying things either, many researchers use the underlying math to inform decisions for network optimization. If you're seeing that, then that's probably a side of the field where people are newer to some of the math behind it, and that will change as things get more established.
    The underlying mathematics behind these kinds of systems are what has motivated a lot of the improvements in hlb-CIFAR10, for example. I don't think I would have been able to get there without sitting down with the fundamentals, planning, thinking, and working a lot, and then executing. There is a good place for blind empirical research too, but it loses its utility past a certain point of overuse.
  - whimsicalism 3 years ago
    
    this comment is so off base, first off no l2 des not encourage near 0 weights, second off they are not relearning, everyone already knew what l1/l2 penalties are

gorjusborg 3 years ago

Grr, the AI folks are ruining the term 'grok'.

It means roughly 'to understand completely, fully'.

To use the same term to describe generalization... just shows you didn't grok grokking.

erwald 3 years ago

"Grok" in AI doesn't quite describe generalization, it's more specific that that. It's more like "delayed and fairly sudden generalization" or something like that. There was some discussion of this in the comments of this post[1], which proposes calling the phenomenon "eventual recovery from overfitting" instead.
[1] https://www.lesswrong.com/posts/GpSzShaaf8po4rcmA/qapr-5-gro...
- tbalsam 3 years ago
  
  Part of the issue here is posting a LessWrong post. There is some good in there, but much of that site is like a Flat Earth conspiracy theory for neural networks.
  Neural network training [edit: on a fixed point task, as is often the case {such as image->label}] is always (always) biphasic necessarily, so there is no "eventual recovery from overfitting". In my experience, it is just people newer to the field or just noodling around fundamentally misunderstanding what is happening, as their network goes through a very delayed phase change. Unfortunately there is a significant amplification to these kinds of posts and such, as people like chasing the new shiny of some fad-or-another-that-does-not-actually-exist instead of the much more 'boring' (which I find fascinating) math underneath it all.
  To me, as someone who specializes in optimizing network training speeds, it just indicates poor engineering to the problem on the part of the person running the experiments. It is not a new or strange phenomenon, it is a literal consequence of the information theory underlying neural network training.
  - PoignardAzur 3 years ago
    
    > Part of the issue here is posting a LessWrong post
    I mean, this whole line of analysis comes from the LessWrong community. You may disagree with them on whether AI is an existential threat, but the fact that people take that threat seriously is what gave us this whole "memorize-or-generalize" analysis, and glitch tokens before that, and RLHF before that.
    
    tbalsam 3 years ago
    
    I think you may be missing the extensive lines of research covering those topics. Memorization vs Generalization has been a debate before LW even existed in the public eye, and inputs that networks have unusual sensitivity to have been well studied as well (re:chaotic vs linear regimes in neural networks). Especially the memorization vs generalization bit -- that has been around for...decades. It's considered a fundamental part of the field, and has had a ton of research dedicated to it.
    I don't know much either way about RLHF in terms of its direct lineage, but I highly doubt that is actually what happened, since DeepMind is actually responsible for the bulk of the historical research supporting those methods.
    It's possible ala the broken clock hypothesis + LessWrong is obviously not the "primate at a typewriter" situation, so there's a chance of some people scoring meaningful contributions, but the signal to noise ratio is awful. I want to get something out of some of the posts I've tried to read there, but there are so many bad takes written with more bombastic language that it's really quite hard indeed.
    Right now, it's an active detractor to the field because it pulls attention away from things that are much more deserving of energy and time. I honestly wish the vibe was back to people even just making variations of Char-RNN repos based on Karpathy's blog posts. That was a much more innocent time.
    
    PoignardAzur 3 years ago
    
    > I think you may be missing the extensive lines of research covering those topics. Memorization vs Generalization
    I meant this specific analysis, that neural networks that are over-parameterized will at first memorize but, if they keep training on the same dataset with weight decay, will eventually generalize.
    Then again, maybe there have been analyses done on this subject I wasn't aware of.
    
    tbalsam 3 years ago
    
    Gotcha. I'm happy to do the trace as it likely would be fruitful for me.
    Do you have a link to a specific post you're thinking of? It's likely going to be a Tishby-like (the classic paper from 2015 {with much more work going back into the early aughts, just outside of the NN regime IIRC}: https://arxiv.org/abs/1503.02406) lineage, but I'm happy to look to see if it's novel.
    
    PoignardAzur 3 years ago
    
    The specific post I'm thinking of is A Mechanistic Interpretability Analysis of Grokking - https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mec...
    I originally thought the PAIR article was another presentation by the same authors, but upon closer reading, I think they just independently discovered similar results. Though the PAIR article quotes Progress measures for grokking via mechanistic interpretability, the Arxiv paper by the authors of the alignmentforum article.
    (In researching this I found another paper about grokking finding similar results a few months earlier; again, I suspect these are all parallel discoveries.)
    You could say that all of these avenues of research are all re-statements of well-known properties, eg deep double-descent, but I think that's a stretch. Double descent feels related, but I don't think a 2018 AI researcher who knew about double descent would spontaneously predict "if you train your model past the point it starts overfitting, it will start generalizing again if you train it for long enough with weight decay".
    But anyway, in retrospect, I agree that saying "the LessWrong community is where this line of analysis comes from" is false; it's more like they were among the people working on it and reaching similar conclusions.
    
    woopwoop 3 years ago
    
    I don't think that is true? As far as I know the grokking phenomenon was first observed (and the name coined) in this paper, not in any blog post:
    https://arxiv.org/abs/2201.02177
    
    tbalsam 3 years ago
    
    That's true, and I probably should have done some better backing up, sorting out, and clarification. I remember when that paper came out, it rubbed me the wrong way too then, because it is people rediscovering double descent from a different perspective, and not recognizing it as such.
    What it would be better defined as is "a sudden change in phase state after a long period of metastability". Even then it ignores that those sharp inflections indicate a poor KL between some of the inductive priors and the data at hand.
    You can think about it as the loss signal from the support of two gaussians extremely far apart with narrow standard deviations. Sure, they technically have support, but in a noisy regime you're going to have nothing.... nothing.... nothing....and then suddenly something as you hit that point of support.
    Little of the literature, definitions around the word, or anything like that really takes this into account generally, leading to this mass illusion that this is not a double descent phenomenon, when in fact it is.
    Hopefully this is a more appropriate elaboration, I appreciate your comment pointing out my mistake.
    
    rileyphone 3 years ago
    
    Singular learning theory explains the sudden phase changes of generalization in terms of resolution of singularities. Alas it's still associated with the LW crowd.
    https://www.lesswrong.com/s/mqwA5FcL6SrHEQzox/p/fovfuFdpuEwQ...
    
    tbalsam 3 years ago
    
    If it's any consolation, that post is...hot word salad garbage. It's like they learned the words on Wikipedia and then proceeded to try to make a post that used as many of them as possible. It's a good litmus test for experience vs armchair observers -- certainly scanning the article without decoding the phrasing to see how silly the argument is would seem impressive because "oooooh, fancy math". It's sort of why LW is more popular, because it is basically white collar flat-earthery, and many of the relevant topics discussed have already been discussed ad infinitum in the academic world and are accepted as general fact. We're generally not dwelling on silly arguments like that.
    One of the most common things I see is people oftentimes assuming something that came from LW is novel and "was discovered through research published there", and that's because oftentimes it's really incentivized to make a lot of noise and sound plausible over there. Whereas arxiv papers, while there is some battle for popularity, are inherently more "boring" and formal.
    For example, the LW post as I understand it completely ignores existing work and just... doesn't cite things which are rigorously reviewed and prepared. How about this paper from five years ago in a long string of research about generalization loss basins, for example? https://papers.nips.cc/paper_files/paper/2018/hash/be3087e74...
    If someone earnestly tried to share the post you linked at a workshop at a conference, they would not be laughed out of the room, but instead have to deal with the long, draining, and muffling silence of walking to the back of the room without any applause when it was over. It's not going to fly with academics/professionals who are academia-adjacent.
    This whole thing is not too terribly complicated, either, I personally feel -- a little information theory and the basics, and time studying and working on it, and someone is 50% of the way there, in my personal opinion. I feel frustrated that this kind of low quality content is parasitically supplanting actual research with meaning and a well-documented history. This is flashy nonsense that goes nowhere, and while I hesitate to call it drivel, is nigh-worthless. This barely passes muster for a college essay on the subject, if even that. If I was their professor, I would pull them aside to see if there is a more productive way for them to channel their interests in the Deep Learning space, and how we could better accomplish that.
    
    rileyphone 3 years ago
    
    I appreciate the thoughts. In such a fast moving field, it's difficult for the layman to navigate without a heavy math background. There's some more academic research I should have pointed to like https://arxiv.org/abs/2010.11560
  - ShamelessC 3 years ago
    
    > Part of the issue here is posting a LessWrong post. There is some good in there, but much of that site is like a Flat Earth conspiracy theory for neural networks.
    Indeed! It’s very frustrating that so many people here are such staunch defenders of LessWrong. Some/much of the behavior there is honestly concerning.
    
    tbalsam 3 years ago
    
    100% agreed. I'm pretty sure today was the first time I learned that the site was founded by Yudkowsky, which honestly explains quite a bit (polite 'lol' added here for lightheartedness)
  - tbalsam 3 years ago
    
    To further clarify things, the reason there is no mystical 'eventual recovery from overfitting ' is because overfitting is a stable bound that is approached. Adding this false denomination to this implies a non-biphasic nature to neural network training, and adds false information that wasn't there before.
    Thankfully things are pretty stable in the over/underfitting regime. I feel sad when I see ML misinformation propagated on a forum that requires little experience but has high leverage due to the rampant misuse of existing terms and complete invention of a in-group-language that has little touch with the mathematical foundations of what's happening behind the scenes. I've done this for 7-8 years at this point at a pretty deep level and have a strong pocket of expertise, so I'm not swinging at this one blindly.
  - Noumenon72 3 years ago
    
    What are the two phases? What determines when you switch?
    
    tbalsam 3 years ago
    
    Memorization of individual examples -> generalization, I can't speak about the determinant of switching as that is (partially, to some degree) work I'm working on, and I have a personal rule not to share work in progress until it's completed (and then be very open and explicit about it). My apologies on that front.
    However, I can point you to one comment I made earlier in this particular comment section about the MDL and how that relates to the L2 norm. Obviously this is not the only thing that induces a phase change, but it is one of the more blatant ones that's been covered little more publicly by different people.
- gorjusborg 3 years ago
  
  Whoever suggested 'eventual recovery from overfitting' is a kindred spirit.
  Why throw away the context and nuance?
  That decision only further leans into the 'AI is magic' attitude.
  - jeremyjh 3 years ago
    
    No, actually this is just how language evolves. I'm glad we have the word "car" instead of "carriage powered by internal combustion engine" even if it confused some people 100 years ago when the term became used exclusively to mean something a bit more specfic.
    Of course the jargon used in a specific sub-field evolves much more quickly than common usage because the intended audience of paper like this is expected to be well-read and current in the field already.
    
    smolder 3 years ago
    
    Language devolves just as it evolves. We (the grand we) regularly introduce ambiguity --words and meanings with no useful purpose, or that are worse than useless.
    I'm not really weighing in on the appropriateness of the use "grok" in this case. It's just a pet peeve of mine that people bring out "language evolves" as an excuse for why any arbitrary change is natural and therefore acceptable and we should go with the flow. Some changes are strictly bad ones.
    A go-to example is when "literally" no longer means "literally", but its opposite, or nothing at all. We don't have a replacement word, so now in some contexts people have to explain that they "literally mean literally".
    
    krapp 3 years ago
    
    Language only evolves, "devolving" isn't a thing. All changes are arbitrary. Language is always messy, fluid and ambigious. You should go with the flow because being a prescriptivist about the way other people speak is obnoxious and pointless.
    And "literally" has been used to mean "figuratively" for as long as the word has existed[0].
    [0]https://blogs.illinois.edu/view/25/96439
    
    smolder 3 years ago
    
    I'm going to take a rosier view of prescriptivists and say they are a necessary part of the speaking/writing public, doing the valuable work of fighting entropic forces to prevent making our language dumb. They don't always need to win or be right.
    That's the first time I've seen literally-as-figuratively defended from a historical perspective. I still think we'd all be better off if people didn't mindlessly use it as a filler word or for emphasis, which is generally what people are doing these days that is the source of controversy, not reviving an archaic usage.
    Also, it's kind of ironic you corrected my use of "devolves", where many would accept it. :)
    
    mdp2021 3 years ago
    
    > devolving isn't a thing
    Incompetent use is devolution.
    
    gorjusborg 3 years ago
    
    Also being overlooked is that the nuances in what we accept is in large part how we define group culture.
    If you want to use the word 'irregardless' unironically there are people who will accept that. Then there are the rest of us.
    
    kaba0 3 years ago
    
    Just as an added data point, some languages (e.g. Hungarian) do use double negative “natively”, and I have definitely caught myself having to fight some native expression seeping into my English, including ‘irregardless’. For example a Hungarian would say “I have never done nothing bad” over “anything bad”, but it is used not in a logical sense, but more as an emphasis, perhaps?
    (!)Regardless, what I’m trying to say is that due to the unique position of English as the de facto world language, it has to “suffer” some non-idiomatic uses seeping in from non-natives. Actually, I would go even further and say that most smaller languages will slowly stop evolving and only English will have that property going forward (most new inventions no longer gets a native name in most languages, the English one is used).
    
    gorjusborg 3 years ago
    
    > No, actually this is just how language evolves
    Stop making 'fetch' happen, it's not going to happen.
benreesman 3 years ago

Sci-Fi Nerd Alert:
“Grok” was Valentine Michael Smith’s rendering for human ears and vocal cords of a Martian word with a precise denotational semantic of “to drink”. The connotational semantics range from to literally or figuratively “drink deeply” all the way up through to consume the absented carcass of a cherished one.
I highly recommend Stranger in A Strange Land (and make sure to get the unabridged re-issue, 1990 IIRC).
mxwsn 3 years ago

They're just defining grokking in a different way. It's reasonable to me though - grokking suggests elements of intuitive understanding, and a sudden, large increase in understanding. These mirror what happens to the loss.
whimsicalism 3 years ago

I literally do not see the difference between the two uses that you are trying to make
jjk166 3 years ago

I've always considered the important part of grokking something to be the intuitiveness of the understanding, rather than the completeness.
paulddraper 3 years ago

What the difference between understanding and generalizing?
And what is the indicator for a machine understanding something?
NikkiA 3 years ago

I've always taken 'grok' to be in the same sense as 'to be one with'
- gorjusborg 3 years ago
  
  Yeah, there is definitely irony that I'm trying to push my own definition of an extra-terrestrial word, complaining that someone is ruining it.
  If anyone wants to come up with their own definition, read Robert Heinlein's 'Stranger in a Strange Land'. There is no definition in there, but you build an intuition of the meaning by its use.
  One of the issues I have w/ the use in AI is that using the word 'grok' suggests that the machine understands (that's a common interpretation of the word grok, that it is an understanding greater than normal understanding).
  By using an alien word, we are both suggesting something that probably isn't technically true, while simultaneously giving ourselves a slimy out. If you are going to suggest that AI understands, just have the courage to say it with common english, and be ready for argument.
  Redefining a word that already exists to make the argument technical feels dishonest.
  - snewman 3 years ago
    
    Actually the definition of 'grok' is discussed in the book; you can find some relevant snippets at https://en.m.wikipedia.org/wiki/Grok. My recollection is that the book says the original / literal meaning is "drink", but this isn't supported by the Wikipedia quotes and perhaps I am misremembering, it has been a long time.
    
    NikkiA 3 years ago
    
    The book also points out that it is much more than just 'drink', and 'drink' would by no means cover 99% of the way it is used in the book.
    That said, I've only ever read the full unabridged re-issue from the mid-90s, it's possible the earlier, edited, releases had many of the uses elided.
dogcomplex 3 years ago

Same thing. To grok is to fully incorporate the new into your intuitive view of the world - changing your view of both in the process. An AI is training their model with the new data, incorporating it into their existing world view in such a way that may even subtly change every variable they know. A human is doing the same. We integrate it deeper the more we can connect it to existing metaphor and understanding - and it becomes one less thing we need to "remember" precisely because we can then recreate it from "base principles" because we fully understand it. We've grokked it.
thuuuomas 3 years ago

“Grok” is more about in-group signaling like “LaTex credibility” or publishing blog posts on arxiv.
mr_toad 3 years ago

In programming circles ‘grok’ has long been used to describe that moment when you finally understand the piece of code you’ve been staring at all day.
So the AI folks are just borrowing something that had already been co-opted 30+ years ago.
93po 3 years ago

I have heard grok used tremendously more frequently in the past year or two and I find it annoying because they're using it as a replacement for the word "understand" for reasons I don't "grok"
momirlan 3 years ago

grok, implying a mystical union, is not applicable to AI
- Filligree 3 years ago
  
  Why not?

jimwhite42 3 years ago

I'm not sure if I'm remembering it right, but I think it was on a Raphaël Millière interview on Mindscape, where Raphaël said something along the lines of when there are many dimensions in a machine learning model, the distinction between interpolation and extrapolation is not clear like it is in our usual areas of reasoning. I can't work out if this could be something similar to what the article is talking about.

_ache_ 3 years ago

Does anyone know how that charts are created ? I bet that it's half generated by some sort of library and them manually improved but the generated animated SVG are beautiful.

1wheelOP 3 years ago

Basically just a bunch of d3 — could be cleaned up significantly, but that's hard to do while iterating and polishing the charts.
I also have a couple of little libraries for things like annotations, interleaving svg/canvas and making d3 a bit less verbose.
- https://github.com/PAIR-code/ai-explorables/tree/master/sour...
- https://1wheel.github.io/swoopy-drag/
- https://github.com/gka/d3-jetpack
- https://roadtolarissa.com/hot-reload/
- iaw 3 years ago
  
  I was going to ask the same question. Those are some great visualizations

ComputerGuru 3 years ago

PSA: if you’re interested in the details of this topic, it’s probably best to view TFA on a computer as there is data in the visualizations that you can’t explore on mobile.

SimplyUnknown 3 years ago

First of all, great blog post with great examples. Reminds me of distill.pub used to be.

Second, the article correctly states that typically L2 weight decay is used, leading to a lot of weights with small magnitudes. For models that generalize better, would it then be better to always use L1 weight decay to promote sparsity in combination with longer training?

I wonder whether deep learning models that only use sparse fourier features rather than dense linear layers would work better...

medium_spicy 3 years ago

Short answer: if the inputs can be represented well on the Fourier basis, yes. I have a patent in process on this, fingers crossed.
Longer answer: deep learning models are usually trying to find the best nonlinear basis in which to represent inputs; if the inputs are well-represented (read that as: can be sparsely represented) in some basis known a-priori, it usually helps to just put them in that basis, e.g., by FFT’ing RF signals.
The challenge is that the overall-optimal basis might not be the same as those of any local minima, so you’ve got to do some tricks to nudge the network closer.
qumpis 3 years ago

Slightly related but sparsity-inducing activation function Relu is often used in neural networks

taeric 3 years ago

I'm curious how representative the target function is? I get that it is common for you to want a model to learn the important pieces of an input, but a string of bits, and only caring about the first three, feels particularly contrived. Literally a truth table on relevant parameters of size 8? And trained with 4.8 million samples? Or am I misunderstanding something there? (I fully expect I'm misunderstanding something.)

jaggirs 3 years ago

I have observed this pattern before in computer vision tasks (train accuracy flatlining for a while before test acc starts to go up). The point of the simple tasks is to be able to interpret what could be going on behind the scenes when this happens.
- taeric 3 years ago
  
  No doubt. But I have also seen what people thought were generalized models failing on outlier, but valid, data. Quite often.
  Put another way, it isn't just how simple this task seems to be in the number of terms that are important, but isn't it also a rather dense function?
  Probably better question to ask is how sensitive are models that are looking at less dense functions to this? (Or more dense.). I'm not trying to disavow the ideas.
  - visarga 3 years ago
    
    Maybe humans are also failing a lot in out of distribution settings. It might be inherent.
    
    taeric 3 years ago
    
    We have names for that. :D. Stereotypes being a large one. Racism being motivated interpretation on the same ideas. Right?
    
    lucubratory 3 years ago
    
    Yes, although there are less political examples. PTSD, the difficulty of learning higher dimensional mathematics in a way you can genuinely understand, substance abuse, mass killings.

superkuh 3 years ago

There were no auto-discovery RSS/Atom feeds in the HTML, no links to the RSS feed anywhere, but by guessing at possible feed names and locations I was able to find the "Explorables" RSS feed at: https://pair.withgoogle.com/explorables/rss.xml

lachlan_gray 3 years ago

It looks like grid cells!

https://en.wikipedia.org/wiki/Grid_cell

If you plot a head map of a neuron in the hidden layer on a 2D chart where one axis is $a$ and the other is $b$, I think you might get a triangular lattice. If it's doing what I think it is, then looking at another hidden neuron would give a different lattice with another orientation + scale.

Also you could make a base 67 adding machine by chaining these together.

I also can't help the gut feeling that the relationship between W_in-proj's neurons compared to the relationship between W_out-proj's neurons looks like the same mapping as the one between the semitone circle and the circle of fifths

https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Pi...

flyer_go 3 years ago

I don't think I have seen an answer here that actually challenges this question - from my experience, I have yet to see a neural network actually learn representations outside the range in which it was trained. Some papers have tried to use things like sinusoidal activation functions that can force a neural network to fit a repeating function, but on its own I would call it pure coincidence.

On generalization - its still memorization. I think there has been some proof that chatgpt does 'try' to perform some higher level thinking but still has problems due to the dictionary type lookup table it uses. The higher level thinking or agi that people are excited about is a form of generalization that is so impressive we don't really think of it as memorization. But I actually question if our wantingness to generate original thought isn't as actually separate from what we currently are seeing.

smaddox 3 years ago

> I have yet to see a neural network actually learn representations outside the range in which it was trained
Generalization doesn't require learning representations outside of the training set. It requires learning reusable representations that compose in ways that enable solving unseen problems.
> On generalization - its still memorization
Not sure what you mean by this. This statement sounds self contradictory to me. Generalization requires abstraction / compression. Not sure if that's what you mean by memorization.
Overparameterized models are able to generalize (and tend to, when trained appropriately) because there are far more parameterizations that minimize loss by compressing knowledge than there are parameterizations that minimize loss without compression.
This is fairly easy to see. Imagine a dataset and model such that the model has barely enough capacity to learn the dataset without compression. The only degrees of freedom would be through changes in basis. In contrast, if the model uses compression, that would increase the degrees of freedom. The more compression, the more degrees of freedom, and the more parameterizations that would minimize the loss.
If stochastic gradient descent is sufficiently equally as likely to find any given compressed minimum as any given uncompressed minimum, then the fact that there are exponentially many more compressed minimums than uncompressed minimums means it will tend to find a compressed minimum.
Of course this is only a probabilistic argument, and doesn't guarantee compression / generalization. And in fact we know that there are ways to train a model such that it will not generalize, such as training for many epochs on a small dataset without augmentation.
jhaenchen 3 years ago

The issue is that we are prone to inflate the complexity of our own processing logic. Ultimately we are pattern recognition machines in combination with abstract representation. This allows us to connect the dots between events in the world and apply principles in one domain to another.
But, like all complexity, it is reduceable to component parts.
(In fact, we know this because we evolved to have this ability. )
- agalunar 3 years ago
  
  Calling us "pattern recognition machines capable of abstract representation" I think is correct, but is (rather) broad description of what we can do and not really a comment on how our minds work. Sure, from personal observation, it seems like we sometimes overcomplicate self-analysis ("I'm feeling bad – why? oh, there are these other things that happened and related problems I have and maybe they're all manifestations of one or two deeper problems, &c" when in reality I'm just tired or hungry), but that seems like evidence we're both simpler than we think and also more complex than you'd expect (so much mental machinery for such straightforward problems!).
  I read Language in Our Brain [1] recently and I was amazed by what we've learned about the neurologicial basis of language, but I was even more astounded at how profoundly little we know.
  > But, like all complexity, it is reduceable to component parts.
  This is just false, no? Sometimes horrendously complicated systems are made of simple parts that interact in ways that are intractable to predict or that defy reduction.
  [1] https://mitpress.mit.edu/9780262036924/language-in-our-brain

mjburgess 3 years ago

Statistical learning can typically be phrased in terms of k nearest neighbours

In the case of NNs we have a "modal knn" (memorising) going to a "mean knn" ('generalising') under the right sort of training.

I'd call both of these memorising, but the latter is a kind of weighted recall.

Generalisation as a property of statistical models (ie., models of conditional freqs) is not the same property as generalisation in the case of scientific models.

In the latter a scientific model is general because it models causally necessary effects from causes -- so, necessarily if X then Y.

Whereas generalisation in associative stats is just about whether you're drawing data from the empirical freq. distribution or whether you've modelled first. In all automated stats the only diff between the "model" and "the data" is some sort of weighted averaging operation.

So in automated stats (ie., ML,AI) it's really just whether the model uses a mean.

autokad 3 years ago

I disagree, it feels like you are just fusing over words and not what's happening in the real world. If you were right, a human doesn't learn anything either, they just memories.
you can look at it by results: I give these models inputs its never seen before but it gives me outputs that are correct / acceptable.
you can look at it in terms of data: we took petabytes of data, and with an 8gb model (stable difusion) we can output an image of anything. That's an unheard of compression, only possible if its generalizing - not memorizing.
ActivePattern 3 years ago

I'd be curious how much of the link you read.
What they demonstrate is a neural network learning an algorithm that approximates modular addition. The exact workings of this algorithm is explained in the footnotes. The learned algorithm is general -- it is just as valid on unseen inputs as seen inputs.
There's no memorization going on in this case. It's actually approximating the process used to generate the data, which just isn't possible using k nearest neighbors.
visarga 3 years ago

> Statistical learning can typically be phrased in terms of k nearest neighbours
We have suspected that neural nets are a kind of kNN. Here's a paper:
Every Model Learned by Gradient Descent Is Approximately a Kernel Machine
https://arxiv.org/abs/2012.00152
bippihippi1 3 years ago

it's been proven that all models learned by gradient descent are equivalent to kernel machines. interpolation isn't generalization. if theres a new input sufficiently different from the training data the behaviour is unknown
- drdeca 3 years ago
  
  Can you say what that says about the behavior described with the modular arithmetic in the article?
  And, in particular, how to interpret the fact that different hyperparameters determined whether runs, obtaining equally high accuracy on the training data, got good or bad scores on the test data, in terms of the "view it as a kernel machine/interpolation" lens?
  My understanding is that the behavior in at least one of those "models learned by gradient descent are equivalent to [some other model]" papers, works by constructing something which is based on the entire training history of the network. Is that the kernel machines one, or some other one?
  - bippihippi1 3 years ago
    
    if you train a model on modular arithmatic, it can only learn what's in the training data. if all of the examples are of the form a + b mod 10, it isn't likely to generalize to be able to solve a + b mod 12. a human can learn the rule and figure it out. a model can't that's why a diverse training set is so important. it's possible to train a model to aproximate any function, but whether the approximation is accurate outside of the datapoints you trained on is not reliable, as far as I understand.
    different hyperparameters can give a model that us over or underfit, but this helps the model interpolate, not generalize. it can know all the answers similar to the training data, not answers different to or it
- xapata 3 years ago
  
  One weird trick ...
  There's some fox and hedgehog analogy I've never understood.
- visarga 3 years ago
  
  but when the model trains on 13T tokens it is hard to be OOD

esafak 3 years ago

I haven't read the latest literature but my understanding is that "grokking" is the phase transition that occurs during the coalescing of islands of understanding (increasingly abstract features) that eventually form a pathway to generalization. And that this is something associated with over-parameterized models, which have the potential to learn multiple paths (explanations).

https://en.wikipedia.org/wiki/Percolation_theory

A relevant, recent paper I found from a quick search: The semantic landscape paradigm for neural networks (https://arxiv.org/abs/2307.09550)

ajuc 3 years ago

I was trying to make an AI for my 2d sidescrolling game with asteroid-like steering learn from recorded player input + surroundings.

It generalized splendidly - it's conclusion was that you always need to press "forward" and do nothing else, no matter what happens :)

huijzer 3 years ago

A bit of both, but it does certainly generalize. Just look into the sentiment neuron from OpenAI in 2017 or come up with an unique question to ChatGPT.

davidguetta 3 years ago

hierarchize would be a better term than generalize

version_five 3 years ago

Anything would be better than "grokking".
From what I gather they're talking about double descent which afaik is the consequence of overparameterization leading to a smooth interpolation between the training data as opposed to what happens in traditional overfitting. Imagine a polynomial fit with the same degree as the number of data points (swinging up and down wildly away from the data) compared with a much higher degree fit that could smoothly interpolate between the points while still landing right on them.
None of this is what I would call generalization, it's good interpolation, which is what deep learning does in a very high dimensional space. It's notoriously awful at extrapolating, ie generalizing to anything without support in the training data.
- Jack000 3 years ago
  
  double descent is a different phenomenon from grokking
  - tysam_and 3 years ago
    
    Nope, they are the same, just that grokking is when the KL between the representable information of the implicit biases and the data is extremely high (i.e. the network is poorly-designed or oriented for the task).
    It's an informal term that not everyone accepts. Double-descent is acceptable as it describes a general phenomenon that is a natural consequence of a phase transition during neural network training. Grokking is like, to me, the 'fetch' of neural network terms. It's not new, it adds a seeming layer of separation from double-descent (which is is -- just very delayed), and it's not really accepted by everyone.
    I personally do not like it at all. Especially because language affects _our_ implicit biases about what neural networks can and cannot do. We've already seen that their capacities and performance can be pushed way beyond what we traditionally expect of them.
    But to summarize, they are the same. And this is why we need good terminology, as well, because poor adoption and boosting of improper terminology induces excess regret in the information exchange surface between agents in a game-theoretic sense in this lovely landscape of the ML world.
- visarga 3 years ago
  
  > It's notoriously awful at extrapolating, ie generalizing to anything without support in the training data.
  Scientists are also pretty lousy at making new discoveries without labs. They just need training data.
3cats-in-a-coat 3 years ago

Generalize is seeing common principles, patterns, between disparate instances of a phenomena. It's a proper word for this.
- Chabsff 3 years ago
  
  That's a common mechanism to achieve generalization, but the term is a little more general (heh) than that. It specifically refers to correctly handling data that lives outside the distribution presented by the training data.
  It's a description of a behavior, not a mechanism. Which may or may not be appropriate depending on whether you are talking about *what* the model does or *how* it achieves it.
  - 3cats-in-a-coat 3 years ago
    
    Kinda fuzzy what's "in the distribution", because it depends on how deeply the model interprets it. If it understands examples outside the distribution... that kinda puts them in the distribution.
    General understanding makes the information in the distribution very wide. Shallow understanding makes it very narrow. Like say recognizing only specific combinations of pixels verbatim.
    
    Chabsff 3 years ago
    
    I think you are misinterpreting. The distribution present in the training set in isolation (the one I'm referring to, and is not fuzzy in the slightest) is not the same thing as the distribution understood by the trained model (the one you are referring to, and is definitely more conceptual and hard to characterize in non-trivial cases).
    "Generalization" is simply the theoretical measure of how much the later extends beyond the former, regardless of how that's achieved.
    
    3cats-in-a-coat 3 years ago
    
    I'm saying how you determine the distribution in the training set depends on what the model understands and what the people who selected the dataset understand.
    There's no distribution of meaning in the training set that's independent of interpretation and understanding. Aside from maybe the literal series of bits (and words and pixels) in it, as encoded.
    In statistics that is not as severe a problem because you can plot how the data distribution lies in a commonly agreed upon position in one or more clearly defined and agreed upon dimensions. And you can look at the chart and talk about this shared interpretation objectively, and its distribution.
    Although as a matter of fact just as often it matters what questions you asked, and how and when and whom you asked, for the distribution of answers you got. Lying with statistics is easy as it's full of hidden variables. This is why statistics is great when the data is simple and the analysis is simple, mathematical, objective, but social studies tend to yield, whatever you want them to yield.
    So. What dimensions are we talking about with a self-evolved model? You have some understanding of what the data is, subjective to you. Maybe your team has some shared understanding of what the data covers, you have overlap. But the model has its own understanding, evolved independently. How much does it overlap with you? Not as much as you think.
    It's a problem decades old, that people give to the model data that contains things they didn't realize it contains. They themselves didn't see that. And then get surprised by the results.
    Say when an apple falls on your head, did you realize this contains the data required to describe classic mechanics? For centuries, billions of people didn't realize. To Newton it was there as clear as daylight. In the apple's fall. I know, the example is a myth, but the principle stands.
    Another example, a video of the change of light patterns reflected on the floor around the corner of room where a person, out of frame, is writing on a computer. What does this data contain? You think nothing much. Maybe it contains how a floor looks. To a model, it can easily also contain what the person who is not in frame, wrote on their keyboard.
    So given all this... what IS in the distribution? Depends with whose eyes you're looking. Your eyes are not the most objective eyes, nor the most intelligent eyes. You have no anchor to point to as the ultimate arbiter of what complex data contains or does not.
- davidguetta 3 years ago
  
  Generalize has a tendency to imply you can extrapolate. And in most case it's actually the opposite that happens: neural nets tend to COMPRESS the data. (which in turn is a good thing in many case because the data is noisy)
  - 3cats-in-a-coat 3 years ago
    
    The point of compression is to decompress after. That's what happens during inference, and when the extrapolation occurs.
    Let's say I tell GPT "write 8 times foobar". Will it? Well then it understands me and can extrapolate from the request to the proper response, without having specifically "write 8 times foobar" in its model.
    Most decompression algorithms focus on predicting the next token (byte, term, etc.), believe it or not. The more accurately they predict the next token, the less information you need to store to correct misprediction.
ot 3 years ago

"hierarchize" only describes your own mental model of how knowledge organization and reasoning may work in the model, not the actual phenomenon being observed here.
"generalize" means going from specific examples to general cases not seen before, which is a perfectly good description of the phenomenon. Why try to invent a new word?
- davidguetta 3 years ago
  
  > hierarchize" only describes your own mental model of how knowledge organization and reasoning may work in the model, not the actual phenomenon being observed here
  It's not true, if you look at deep CNN the lower layers show lines, the higher complex stuff like eyes or football players etc.. Herarchisation of information actually emerges naturally in NNs.
  Generalization often implies extrapolation on new data, which is just not the case most of the time with NNs and why i didn't like the word

westurner 3 years ago

If you omit the training data points where the baseball hits the ground, what will a machine learning model predict?

You can train a classical ML model on the known orbits of the planets in the past, but it can presumably never predict orbits given unseen n-body gravity events like another dense mass moving through the solar system because of classical insufficiency to model quantum problems, for example.

Church-Turing-Deutsch doesn't say there could not exist a Classical / Quantum correspondence; but a classical model on a classical computer cannot be sufficient for quantum-hard problems. (e.g. Quantum Discord says that there are entanglement and non-entanglement nonlocal relations in the data.)

Regardless of whether they sufficiently generalize, [LLMs, ML Models, and AutoMLs] don't yet Critically Think and it's dangerous to take action without critical thought.

Critical Thinking; Logic, Rationality: https://en.wikipedia.org/wiki/Critical_thinking#Logic_and_ra...

tehjoker 3 years ago

Well they memorize points and lines (or tanh) between different parts of the space right? So it depends on whether a useful generalization can be extracted from the line estimation and how dense the points on the landscape are no?

djha-skin 3 years ago

How is this even a shock.

Anyone who so much as taken a class on this knows that even the simplest of perceptron networks, decision trees, or any form of machine learning model generalizes. That's why we use them. If they don't, it's called overfit[1], where the model is so accurate on the training data that its inferential ability on new data suffers.

I know that the article might be talking about a higher form of generalization with LLMs or whatever, but I don't see why the same principle of "don't overfit the data" wouldn't apply to that situation.

No, really: what part of their base argument is novel?

1: https://en.wikipedia.org/wiki/Overfitting

halflings 3 years ago

The interesting part is the sudden generalization.
Simple models predicting simple things will generally slowly overfit, and regularization keeps that overfitting in check.
This "grokking" phenomenon is when a model first starts by aggressively overfitting, then gradually prunes unnecessary weights until it suddenly converges on the one generalizable combination of weights (as it's the only one that both solves the training data and minimizes weights).
Why is this interesting? Because you could argue that this justifies using overparametrized models with high levels of regularization; e.g. models that will tend to aggressively overfit, but over time might converge to a better solution by gradual pruning of weights. The traditional approach is not to do this, but rather to use a simpler model (which would initially generalize better, but due to its simplicity might not be able to learn the underlying mechanism and reach higher accuracy).
- timy2shoes 3 years ago
  
  It's interesting that the researchers chose example problems where the minimum norm solution is the best at generalization. What if that's not the case?
  - mashygpig 3 years ago
    
    Yea, this is what’s really going on here and feels like it’s been shrouded in language to make it seem more grandiose. That being said, I would believe generalization to occur from minimum norm solutions in some sense, but whether that corresponds to minimum norm weights or not is a different question, and one you probably won’t know a priori (not to mention even knowing which norm to choose).
rosenjcb 3 years ago

There's so many idiots in the AI space that are completely ignorant of how Machine Learning works. The worst are the grifters that fearmonger about AI safety by regurgitating singularity memes.
godelski 3 years ago

It's because you over generalized your simple understanding. There is a lot more nuance to that thing you are calling overfitting (and underfitting). We do not know why it happens or when it happens, in all cases. We do know cases where it does happen and why it happens, but that doesn't me we don't know others. There is still a lot of interpretation left that is needed. How much was overfit? How much underfit? Can these happen at the same time? (yes) What layers do this, what causes this, and how can we avoid it? Reading the article shows you that this is far from a trivial task. This is all before we even introduce the concept of sudden generalization. Once we do that then all these things start again but now under a completely different context that is even more surprising. We also need to talk about new aspects like the rate of generalization and rate of memorization what what affects these.
tldr: don't oversimplify things: you underfit
P.S. please don't fucking review. Your complaints aren't critiques.

MagicMoonlight 3 years ago

Memorise because there is no decision component. It attempts to just brute force a pattern rather than thinking through the information and making a conclusion.

blueyes 3 years ago

If your data set is too small, they memorize. If you train them well on a large dataset, they learn to generalize.

visarga 3 years ago

they only generalise with big datasets, that is the rule
- blueyes 3 years ago
  
  That's what I said.

wwarner 3 years ago

This is such a good explainer

lsh123 3 years ago

Current ML models neither memorize or generalize, but instead approximate.

tipsytoad 3 years ago

Seriously, are they only talking about weight decay? Why so complicated?

agumonkey 3 years ago

They ponderize.

lewhoo 3 years ago

So, the TLDR could be: they memorize at first and then generalize ?

drdeca 3 years ago

depends on the hyperparameters, and the architecture (and probably the task)

aappleby 3 years ago

They digest.

xaellison 3 years ago

what's the TLDR: memorize, or generalize?

Settings

Do Machine Learning Models Memorize or Generalize?

Keyboard Shortcuts