Neural Turing Machines

60 points by willwill100 11 years ago · 7 comments

Reader

Previous discussion: https://news.ycombinator.com/item?id=8487807

I'm going to be stupid in public on the hope that someone will correct me.

1. I'm not clear on the point of this paper.

There are a lot of buzzwords and an extremely diverse set of references. The heart of the paper seems to be a comparison between Long-Short-Term-Memory (LSTM) recurrent nets and their NTM nets. But they don't expose the network to very long sequences, or sequences broken by arbitrarily long delays which are what LSTM nets are particularly good at. They seem to make the jump from "LSTM nets are theoretically turing complete" to "LSTM nets are a good benchmark for any computational task."

2. The number of training examples seems huge

For many of the tasks they trained over hundreds of thousands of sequences. This seems like very very slow learning. If I'm meant to interpret these results as a network learning a computational rule (copying, sorting etc) is it really that impressive if it takes 200k examples before it gets it right? (Not sarcasm, I really don't know.)

dave_sullivan 11 years ago

Re: point of the paper, I think it's addressing a current need within representation learning research where there's this question of "Ok, we can teach really large neural networks stuff, but how do we compress that knowledge efficiently?" How can we learn more compact/efficient/reliable/discrete representations? I've only just finished reading it through and this seems to me to be a promising direction and one I'd like to see more research on.
Re: number of training examples, I'm taking the chart on pg 11 to mean the number of training examples shown. Based on that, it looks like the NTM is learning a lot faster than the LSTM. As far as I can tell, it's getting near 0 loss about 20,000 examples in? It depends on the domain for whether learning w/ 20k examples is impressive or not, personally I think it's comparatively impressive.
Re: cherry picking of tasks to highlight perceived strengths of NTM, fair enough. Although this is one I'll be playing around with a bit to find out where that starts and stops...
Any thoughts on how this compares to the approach of HTMs?
botman 11 years ago

I think your criticisms are mostly misplaced.
- Re: "buzzwords...references": I don't see any buzzwords, in fact the word "deep" doesn't even appear in the text. Regarding references, A typical conference paper references cites a bunch of related papers written by people who might be reviewing it. This paper, on the other hand, cites some seminal work from other fields, which is more interesting and enriching for most readers.
- Re: point of the paper. How to design a learning computer that can access a long-term memory storage of large capacity, which can be optimized by gradient descent. (I.e., everything is differentiable.)
- Re: number of training examples is huge. Training neural networks often takes a huge number of iterations, and the problems considered in the paper are numerically challenging so the iteration count is not surprising. Also, just like the regular Turing machine, the "neural Turing machine" isn't the most efficient architecture, but it's conceptually the simplest one that has the desired properties.

macrael 11 years ago

Does a "typical" neural network not have any storage to speak of? When I've seen examples of neural networks working, it's seemed like they work in cycles in some way, with the states of each "neuron" affecting the state of others. Is that not potentially storage?

robert_tweed 11 years ago

Recurrent neural networks work that way, but typically when anyone says "neural network" they mean a feed-forward network, which is as dumb as a lead pipe. It has no memory other than the trained synapse weights. After training, all it does is transform some input data into output data. It's a pure function. It's also a lot like a matrix transform.
I haven't read this paper fully yet, but it seems to be an attempt at simplifying RNNs by replacing some of the magic internal state, which tends to make them hard to reason about, with a more direct memory architecture.
albertzeyer 11 years ago

Yes, but its finite storage. (You could argue that with infinite float precision, you would could get infinite storage. Some old proofs that RNNs are as powerful as a Turing machine work that way. Of course, in practice, it's very much limited.)
Also, it's somewhat hard to train the network that it can remember much information. (Not sure if much was done to measure this, but from my gut feeling, I would say, a RNN layer with about 500 neurons can be trained with the standard methods to maybe store 10 bits of information effectively.) The problem is that it can become easily very unstable.
The LSTM cell is already somewhat better in this regard. But of course, this is still finite memory then, and you cannot have much more than 500-1000 LSTM cells in one layer, because training becomes too computational expensive then. (You could introduce some bottleneck projection layer as Google did recently, and then get it to maybe 2000 LSTM cells.) Maybe count one LSTM cell as one bit of information (again, this is very much taken out of air).
This is much less memory than what your PC has (which of course also has finite memory, but it's so huge that you can count it as infinite and as powerful as a Turing machine).

Settings

Neural Turing Machines

Keyboard Shortcuts