How Google Translate squeezes deep learning onto a phone

googleresearch.blogspot.com

403 points by xwintermutex 11 years ago · 98 comments

Reader

liabru 11 years ago

This is great. I particularly like that they also automatically generated dirty versions for their training set, because that's exactly what I ended up doing for my dissertation project (a computer vision system [1] that automatically referees Scrabble boards). I also used dictionary analysis and the classifier's own confusion matrix to boost its accuracy.

If you're also interested in real time OCR like this, I did a write up [2] of the approach that worked well for my project. It only needed to recognize Scrabble fonts, but it could be extended to more fonts by using more training examples.

[1] http://brm.io/kwyjibo/

[2] http://brm.io/real-time-ocr/

joe_the_user 11 years ago

It seems your dissertation paper is behind something password protected [1]. It would be nice to see that too.
Can't get [1]https://www.dcs.shef.ac.uk/intranet/teaching/campus/projects...
- liabru 11 years ago
  
  Hmm looks like they have, well here's another link to it: https://dl.dropboxusercontent.com/u/1672291/scrabble-referee...
JabavuAdams 11 years ago

Glad there's prior art on that. I had a small project where I iterated all the fonts on the system and used them to generate glyph training images. The next step was to dirty them up, but I never continued the project.
More generally, I really like the idea of generating controlled synthetic images and then messing them up for regularization.
megalodon 11 years ago

Funny, just read an article today proposing the same feature detection algorithm (the one you called 'grid merge'). Have you tried applying these techniques on scanned/photographed documents?
- liabru 11 years ago
  
  Could you link to it please?
  I've not tried it on anything else, but I remember thinking that it has a lot of potential uses. Also I only used it on gray-scale features, but I'm sure it could make use of full RGB too. I'll have to try it some time!
  - megalodon 11 years ago
    
    "We also investigated hierarchical features where the image is overlaid with a grid of cell size c × c and pixels withins each cell are added up. This is same as downsampling the image and using the raw pixels in the downsampled image as features." (p. 3)
    http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-15...
    
    pedrosorio 11 years ago
    
    Sounds similar to one level of a pyramid:
    https://en.wikipedia.org/wiki/Pyramid_(image_processing)
zem 11 years ago

excellent project. as a scrabble player, i'm very interested - it would be a great way to run a blitz tournament, for instance.

motoboi 11 years ago

I am 15 years into this computers thing and this blog post made me feel like "those guys are doing black magic".

Neural networks and deep learning are truly awesome technologies.

dr_zoidberg 11 years ago

They are, but once you start learning about them, you realize the "black magic" part comes mostly from their mathematical nature and very little from them being "inteligent computers".
A neural net is a graph, in which a subset of nodes are "inputs" (that's where the net gets information), some are outputs, and there are other nodes which are called "hidden neurons".
The nodes are interconnected between each other in a fashion, which is called the "topology" or sometimes "architecture" of the net. For example I-H-O is a tipical feed forward net, in which I (inputs) is the input layer, H is the hidden layer and O the output layer. All the hidden neurons connect with all the input neurons "output", and all the output neurons connect to the hidden neurons "output". The connections are called "weights", and the training adjusts the weights of all the neuron with lots of cases until the desired output is achieved. There are also algorithms and criteria to stop before the net "learns too much" and looses the ability to generalize (this is called overfitting). In particular, a net with one hidden layer and one output layer is a universal function estimator -- that is, an estimator that can model any mathematical function of the form f(x1, x2, x3, ..., xn) = y.
Deep learning means you're using a feedforward net with lots of hidden layers (I think it's usually between 5 to 15 now), which apply convolution operators (hence the "convolutional" in the name), and lots of neurons (in the order of thousands). All this was nearly impossible until the GPGPUs came along, because of the time it took to train a modest network (minutes to hours for a net with a between 50 to 150 neurons in one hidden layer).
This is a very shortened explanation -- if you want to read more I recommend this link[1] which gives some simple Python code to illustrate and implement the innards of a basic neural network and you can learn from the inside. Once you get that you should move to more mature implementations, like Theano or Torch to get the full potential of neutral nets without worrying about implementation.
[1] http://iamtrask.github.io/2015/07/12/basic-python-network/
- frozenport 11 years ago
  
  >>They are, but once you start learning about them, you realize the "black magic" part comes mostly from their mathematical nature and very little from them being "inteligent computers".
  Oh humbug! The black magic comes from the vast resources Google drew to obtain perfect training datasets. Each step in the process took years to tune, demonstrating that data is indeed for those who dont have enough priors.
- beambot 11 years ago
  
  You could say very much the same about the brain...
  > [...] the "black magic" part comes mostly from their mathematical nature and very little from them being "inteligent computers". A brain is a graph, in which a subset of neurons are "inputs", some are outputs, and others are "hidden". The nodes are interconnected between each other in a fashion, which is called the "topology" or sometimes "architecture" of the net.
  The deep question about deep learning is "Why is it so bloody effective?"
  - dr_zoidberg 11 years ago
    
    I work in the field, and while some models are based on biological structures/systems, there's a lot of fuzz about them being "based on biological foundations" that is now best avoided. Yes, it is true the model is based on them, but it's a model that only covers very little of the real complexity. So in a sense, it's naive to say "put a billion neurons in there and you'll get a rat brain" (as was publicized one time).
    The effectiveness comes from their non-linear nature and their ability to "learn" (store knowledge in the weights, that is derived from the training process). And black magic, of course!
  - Lawtonfogle 11 years ago
    
    If there is magic to be found, it may be in that question. Why about graphs (namely the subset that are deep neural networks) allow them to not only contain such powerful heuristics, but also allow them to be created from scratch with barely any knowledge of the problem domain.
    As a side note, I was playing a board game last night (Terra Mystica I believe) and wondering if you could get 5 different neural networks to play the game and then train them against each other (and once they are good enough, against players). I wonder how quickly one could train a network that is unbeatable by humans? Maybe even scale it up to training it to play multiple board games til it is really good at all of them before setting it lose on a brand new one (with a similar genre). Maybe Google could use this to make a Go bot.
    But what happens if this is used for evil instead? Say a neural network that reads a person's body language and determines how easily they can be intimidated by either a criminal or the government. Or one that is used to hunt down political dissidents. Imagine the first warrant to be signed by a judge for no reason other than a neural network saying the target is probably committing a crime...
    
    thaumasiotes 11 years ago
    
    The best Go bot approach (as of some years ago, but it's not like neural networks are a new idea) uses a very different strategy. Specifically, the strategy of "identify a few possible moves, simulate the game for several steps after each move using a very stupid move-making heuristic instead of using this actual strategy recursively, and then pick the move that yielded the best simulated board state".
    
    deepnet 11 years ago
    
    Monte Carlo Tree Search ( Random playout ) is currently the best computer strategy for evaluating a Go position.
    This is likely due to the way Go works , random playout provides a rough estimate of who controls what territory ( this is how Go is scored ).
    Recently two deep-learning papers showed very impressive results.
    http://arxiv.org/abs/1412.3409
    http://arxiv.org/abs/1412.6564
    The neural networks were tasked with predicting what move an expert would make given a position.
    The MCTS takes a long time 100,000 playouts are typical - once trained the neural nets are orders of magnitude faster.
    The neural nets output a probability for each move ( that an expert would make that move ) - all positions are evauluated in a single forward pass.
    Current work centers around combining the two approaches, MCTS evaluates the best suggestions from the neural net.
    Expert Human players are still unbeatable by computer Go.
    
    deepnet 11 years ago
    
    For Chess see David Silver's work on TreeStrap
    It learns to master level from self-play.
    http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Applications_fil...
    also his lecture bootstrapping from tree based search
    http://www.cse.unsw.edu.au/~cs9414/15s1/lect/1page/TreeStrap...
    and Silver's overview on board game learning
    http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/g...
    
    arielby 11 years ago
    
    The "use a stupid heuristic as part of the evaluation function" is is, in fact, also an important part of Chess AI's mode (as Quiescence Search), through for different reasons.
    
    zwegner 11 years ago
    
    > Maybe Google could use this to make a Go bot.
    There was in fact a group within Google that worked on this: http://www.cs.toronto.edu/~cmaddis/pubs/deepgo.pdf
    
    deepnet 11 years ago
    
    and the follow up from Google's Deepmind group :
    Move Evaluation in Go Using Deep Convolutional Neural Networks Chris J. Maddison, Aja Huang, Ilya Sutskever, David Silver
    http://arxiv.org/abs/1412.6564
    
    z92 11 years ago
    
    Before clicking I was assuming it would fail. Then read this in the summary: "When the trained convolutional network was used directly to play games of Go, without any search, it beat the traditional search program GnuGo in 97% of games, and matched the performance of a state-of-the-art Monte-Carlo tree search that simulates a million positions per move."
  - sushirain 11 years ago
    
    They are effective because:
    - They use more parameters (and fewer computations per parameter.)
    - They are hierarchical (convolutions are apparently useful at different levels of abstraction of data).
    - They are distributed (word2vec, thought-vectors). Not restricted to a small set of artificial classes such as parts-of-speech or parts of visual objects.
    - They are recurrent (RNN).
    etc.
    
    kylebgorman 11 years ago
    
    word2vec isn't "deep" in the relevant sense. The both skipgram and CBOW forms have a single hidden layer.
  - hyperbovine 11 years ago
    
    It's not really that deep, imo: a typical deep net these days has O(10^8) parameters (e.g. http://stackoverflow.com/questions/28232235/how-to-calculate...). You can store a hell of a lot of patterns in that many parameters, making them the best pattern matchers the world has ever seen. (Un)fortunately, pattern matching != intelligence. More interesting deep questions for which there is precious little theory revolve around the design of the networks themselves.
    
    shostack 11 years ago
    
    Is "pattern matching != intelligence" what occurred when the Google image recognition stuff in the news recently was shown to recognize the pattern of a "dumbbell" as always having a large muscular arm attached to it?
    Seemed like a great way to highlight the limitations of patterns.
    
    hyperbovine 11 years ago
    
    I hadn't heard about that but it sounds like what I'm talking about. With their ever expanding training corpus Google's net will eventually learn that dumbbells and arms are separate entities, but it will never deduce that on its own. And if it did it would not be able to generalize that to the fact that wedding rings and fingers are different (I hypothesize). Basically there is a whole other component of "intelligence" that feels absent from neural nets, which is why visions of AI lording over humanity don't exactly keep me up at night. (Autonomous weapons otoh...)
- mistercow 10 years ago
  
  > Deep learning means ... which apply convolution operators
  Convolutional networks are only one kind of deep learning. In particular, they generally apply only to image processing.
dchichkov 11 years ago

They are doing matrix multiplications. To pass input a single time through even some very large neural network - it is a relatively fast operation (if compared to training such a network, that is). Training requires data centers and arrays of GPUs. Passing the input through the network - usually you can get away with a single core and vectorized operations. Unless you are doing high resolution computer vision in real time... You can still get away with the single core even then, but that requires some very smart sublinear processing.
- ibrahima 11 years ago
  
  Completely right. Applying a neural network is much faster than training one. The main trick here is fitting the trained model into cache (or smaller) so that the matrix multiplies are fast.
amelius 11 years ago

> this blog post made me feel like "those guys are doing black magic".
Two remarks. First, these guys probably don't know very well why what they are doing works so well ;) It requires a lot of trial and error, and a lot of patience and a lot of compute power (the latter being the reason why we are seeing breakthroughs only now).
Second, training a neural net requires different computing power from deploying the net. The neural network that is installed on your phone has been trained using a lot of time and/or a very large cluster. Your phone is merely "running" the network, and this requires much less compute power.
jchomali 11 years ago

Of course they are
ytdht 11 years ago

they are awesome, but not that difficult to implement

sytelus 11 years ago

The most awesome and surprising thing about this is that the whole thing runs locally on your smartphone! You don't need network connection. All dictionaries, grammar processing, image processing, DNN - the whole stack runs on phone. I used this on my trip to Moscow and it was truely god send because it didn't need expensive international data plans (assuming you have connectivity!). English usage is fairly rare in Russia and it was just fun to learn Russian this way by pointing at interesting things.

eosrei 11 years ago

I used this in Brazil this last March to read menus. It works extremely well. The mistranslations make it even more fun. Much faster than learning Portuguese!

I took a few screen shots. Aligning the phone, focus, light, shadows on the small menu font was difficult. You must keep steady. Sadly, I ended up hitting the volume control on this best example. Tasty cockroaches! Ha! http://imgur.com/j9iRaY0

shkkmo 11 years ago

I had some Brazilian roomates who didn't speak english (and I don't speak portugues). We used a combination of my poor spanish and google translate off my phone to comunicate.
It worked ok (much better than nothing.) However there were a number of times when there were very large issues in the translations that created some pretty big misunderstandings. Luckily we had a friend who had fluent English and Portuguese who would translate when things got to confused.
To reduce errors, you do need to be really careful to use short, complete sentences with simple and correct grammar. It's also better to use and that contain words that aren't ambiguous. (Those two sentences would probably not translate well.)
e.g. Please write simple words, short phrases and simple phrases. Please write words with just one meaning. Those phrases and words are easier to translate.
- thaumasiotes 11 years ago
  
  > Please write words with just one meaning.
  Those words are very rare and tend to only be useful in very technical contexts.
  - shkkmo 11 years ago
    
    Fair enough. The idea that is intended to express is 'unambiguous'. I tend to try to avoid more obscure words when writing text for automatic translation, often at the expense of explicit accuracy.
raverbashing 11 years ago

Interesting
It seems it can't really handle context, so 'cockroaches' may have been a mistranslation of 'cheap' in some contexts, as the 'it had stopped chestnut' may have simply been 'brazil nuts'
- aylons 11 years ago
  
  Most probably the OCR read "batata" (potato) as "barata" (cockroach).
  - raverbashing 11 years ago
    
    Yeah, this is very likely as well, especially if the t was printed incorrectly

Animats 11 years ago

Word Lens is impressive. It came from a small startup. Google didn't develop it; it was a product before Google bought it. I saw an early version being shown around TechShop years ago, before Google Glass, even. It was quite fast even then, translating signs and keeping the translation positioned over the sign as the phone was moved in real time. But the initial version was English/Spanish only.

murbard2 11 years ago

I see no mention of it, but I'd be surprised if they didn't use some form of knowledge distilling [1] (which Hinton came up with, so really no excuse), to condense a large neural network into a much smaller one.

[1] http://arxiv.org/abs/1503.02531

josu 11 years ago

WordLens/Google Translate is the most futuristic thing that my phone is able to do. It's specially useful in countries that don't use the latin alphabet.

api 11 years ago

"Squeezes" is very relative. These phones are equal to or larger than most desktops 10-15 years ago, back when I was doing AI research with evolutionary computing and genetic algorithms. We did some pretty mean stuff on those machines, and now we have them in our pockets.

afsina 11 years ago

The main issue here is probably not squeezing memory but squeezing performance. Even using regular SIMD is not good enough if your network is medium sized. They apply linear quantization, lookups and special SIMD operations to make it speedy.
See here for what they did for offline speech recognition: http://static.googleusercontent.com/media/research.google.co...

afsina 11 years ago

They did this even more impressively when squeezing their speech recognition engine to mobile devices.

http://static.googleusercontent.com/media/research.google.co...

teraflop 11 years ago

A possibly relevant research paper that they didn't mention: "Distilling the Knowledge in a Neural Network" http://arxiv.org/abs/1503.02531

cossatot 11 years ago

International travel now has a new source of entertainment: On-the-spot generation of humorous mistranslations.

joosters 11 years ago

The oddest result I ever got from WordLens was when using it to translate a page of poetry on a plaque. The output was wonderful :)
WordLens was awesome for translating fragments of foreign languages - stuff like signs, menus and so on. But its offline translation seemed to be little more than a word->word translation, so there is a huge scope for improvement there. Very difficult when working offline!
chipgap98 11 years ago

Reddit is going to have a field day
rasz_pl 11 years ago

Chinese restaurants did it first.
cjslep 11 years ago

Just capture the screenshot and you have a meme generator as well!

zippzom 11 years ago

What are the advantages of using a neural network over generating classification trees or using other machine learning methods? I'm not too familiar with how neural nets work, but it seems like they require more creator input than other methods, which could be good or bad I suppose.

boomzilla 11 years ago

Neural networks, and the plain old trusted logistic regression :) handles raw, continuous data better than the other learning algorithms. For example, if your inputs are images or audio recordings, it's really hard to do classification with decision trees or random forests as you'd need to construct the features manually. What would be a feature: color densities, color histograms, edges, corners, Haar-like, etc.? The promise of multilayer neural network is that given a lot of data, the right network structures, an appropriate learning strategy, and a huge farm of GPUs, the network can automatically learn the right features from raw data in the first layers, and utilizes the features in later layers. The big advantage of this approach is that you abstract away the domain problems (hopefully), and focus on picking the right network design, the right learning strategy, collecting a good data set etc. Neural network training is also easy to parallelize, so Google and the like can leverage their huge infrastructures.
Now if the features in the domain problem is more well defined, like credit ratings, and data is sparse, and domain expertise is available, decision trees are perfectly valid options.
- microtonal 11 years ago
  
  For example, if your inputs are images or audio recordings,
  Just wanted to add: and word/character/phrase embeddings.

poslathian 11 years ago

The article mentions algorithmically generating the training set. See here for some earlier research in this area: http://bheisele.com/heisele_research.html#3D_models

modfodder 11 years ago

Here's a short video about Google Translate just released.

https://www.youtube.com/watch?v=0zKU7jDA2nc&index=1&list=PLe...

up_and_up 11 years ago

This technology has been around since 2010 and was developed by Word Lens, which was acquired by google in 2014:

https://en.wikipedia.org/wiki/Word_Lens

mrigor 11 years ago

For those unfamiliar with google's deep learning, this talk covers their recent efforts pretty well https://youtu.be/kO-Iw9xlxy4 (not technical)

dharma1 11 years ago

Would be great to see a more in depth article about this, and maybe even some open source code?

sarwechshar 11 years ago

I would be interested in this as well. So far I found a similar app called Mitzuli which is based on open source tools:
http://www.mitzuli.com/en/

pschanely 11 years ago

Doesn't this article seem to say that the size of the training set is related to the size of the resulting network? It should be proportional to the number of nodes/layers that the network is configured for, not proportional to the number of training instances. Am I missing something?

alok-g 11 years ago

The network is sized to be able to learn the training data reasonably well (e.g. via hyper-parameter optimization). If there is too much variation in data that is not seen in the real application (like rotation of letters mentioned in the article), an appropriately sized network will still learn it, but would be an overkill for the application at hand.

megalodon 11 years ago

I generated training sets for an OCR project in JavaScript [1] a while ago using a modified version of a captcha generator [2] (practically the same technique mentioned in this article).

[1] https://github.com/mateogianolio/mlp-character-recognition

[2] https://github.com/mateogianolio/mlp-character-recognition/b...

hellrich 11 years ago

I wonder if they use some kind of (neural) language model for their translations. Using only a dictionary (as in the text) would be about 60 years behind the state of the art...

tdaltonc 11 years ago

Anyone want to do a $1 bet on an over/under for how long until word lens can handle Chinese?

agazso 11 years ago

There is an app called Waygo that's already capable of handling Chinese, so I guess it's not too far.

birdsbolt 11 years ago

Why do they need a deep learning model for this? They are obviously targeting signs, product names, menus and similar. Model will obviously fail in translating large texts.

Was there any advantage of using a deep learning model instead of something more computationally simple?

Uhhrrr 11 years ago

I don't get it. They say they use a dictionary, and they say it works without an Internet connection. How can both things be true? I'm pretty sure there's not, say, a Quechua dictionary on my phone.

mattmanser 11 years ago

It doesn't come with all the languages, you have to download them.
- bduerst 11 years ago
  
  I think it's Android only that you can download the language packs, FYI. The language packs + offline maps are super helpful when travelling abroad.
- rndn 11 years ago
  
  You can download the language packs on Android and iOS, and each one is about 4 MB in size.
ori_b 11 years ago

Are you sure?
On my desktop, the english dictionary is ~1 megabyte uncompressed, and compresses to ~250k with gzip. The download for Google Translate is somewhere around 30 megabytes.
josu 11 years ago

You have to download them beforehand, and offline translating is limited to just a few languages.

xigency 11 years ago

Given the reliability of closed captions on YouTube and the frequency of errors in plaintext Google translate, I wouldn't be surprised if this service fails often, and often when you need it most.

joosters 11 years ago

WordLens was an awesome app and it's good to see that Google is continuing the development.

The new fad for using the 'deep' learning buzzword annoys me though. It seems so meaningless. What makes one kind of neural net 'deep' and are all the other ones suddenly 'shallow' ?

teraflop 11 years ago

> What makes one kind of neural net 'deep' and are all the other ones suddenly 'shallow' ?
If this is a serious question, then googling "what is a deep neural network" would take you to any number of explanations. But to summarize very briefly, it's not a buzzword; it's a technical term referring to a network with multiple nonlinear layers that are chained together in sequence. Deep networks have been talked about for as long as neural networks have been a research subject, but it's only in the last few years that the mathematical techniques and computational power have been available to do really interesting things with them.
The "fad" (as you call it) is not mainly because the word "deep" sounds cool, but because companies like Google have been seeing breakthrough results that are being used in production as we speak. For example:
http://papers.nips.cc/paper/4687-large-scale-distributed-dee...
http://static.googleusercontent.com/media/research.google.co...
http://static.googleusercontent.com/media/research.google.co...
- joosters 11 years ago
  
  I honestly didn't realise that it had any definition - I see now that calling it a 'fad' is unfair. However, the boundary between deep learning and (representational) machine learning still seems murky.
  - strebler 11 years ago
    
    Considering the very significant accuracy gains deep learning has achieved over previous approaches (and across a number of fields), it's certainly not a simple fad. Having worked in computer vision for a good 8+ years, deep learning is basically amazing.
    Deep learning is a form of representation/feature learning.
  - ssalazar 11 years ago
    
    Machine learning proper encompasses a swath of applied statistical techniques, of which deep learning is only one. Machine learning could refer to linear regression, SVM, hidden markov models, dimensionality reduction, neural nets, or any number of other loosely related methods. Intro ML classes often don't even get to deep learning because theres so much more fundamental stuff to cover.
- ocdtrekkie 11 years ago
  
  So was Word Lens doing this before Google even bought them? Because Word Lens worked fine, locally on a phone, long before Google was doing it's whole deep learning thing.
  - JeremyBanks 11 years ago
    
    It's not entirely clear to me, but this sentence from the article:
    In the end, we were able to get our networks to give us significantly better results while running about as fast as our old system—great for translating what you see around you on the fly.
    Suggests that they previously were not using neural networks, or were using less powerful ones.
raverbashing 11 years ago

> What makes one kind of neural net 'deep' and are all the other ones suddenly 'shallow'
Number of layers
It's that simple
- discardorama 11 years ago
  
  To expand on this some more: for a long time, thanks to Cybenko's theorem[1], people just used 1 hidden layer in their neural networks (also because computing was sloowww..). So, your typical NN architecture was input_layer --> hidden_layer --> output_layer.
  Eventually, people realized that you could improve performance by adding more hidden layers. So while theoretically Cybenko was correct, practically stacking a bunch of hidden layers made more sense. These network architectures with stacks of hidden layers were then labelled as "deep" neural networks.
  [1] https://en.wikipedia.org/wiki/Universal_approximation_theore...
- dnautics 11 years ago
  
  It is that simple but the more complex story is that when the number of hidden layers exceeds 2, training becomes difficult. Also convnets for example cheat by having the connections between layers be incomplete bipartite graphs (not every node is connected to every other node), usually chosen because of some physical property - for computer vision nearest neighbors - eg.
  - lisivka 11 years ago
    
    Use another deep learning network to supervise training of your DLN. You can also use it to supervise itself. It is simple idea invented about decade ago (at least I heard it about decade ago here, in Ukraine).
StavrosK 11 years ago

Well, if all it cares about is looks...

anantzoid 11 years ago

Just waiting for the paper to come out that'll detail all the transformations that were done on the training data specifically for the phone and how did they arrive at deciding to use them.

> To achieve real-time, we also heavily optimized and hand-tuned the math operations. That meant using the mobile processor’s SIMD instructions and tuning things like matrix multiplies to fit processing into all levels of cache memory.

Let's see how this turns out to be. I'm still skeptical if other apps might crash because of this.

sp332 11 years ago

Not fitting into cache just means it will run slower. Why would it crash?
- anantzoid 11 years ago
  
  Other apps getting slow is also not a very good thing!

Settings

How Google Translate squeezes deep learning onto a phone

Keyboard Shortcuts