Settings

Theme

How Google Translate squeezes deep learning onto a phone

googleresearch.blogspot.com

403 points by xwintermutex 10 years ago · 98 comments

Reader

liabru 10 years ago

This is great. I particularly like that they also automatically generated dirty versions for their training set, because that's exactly what I ended up doing for my dissertation project (a computer vision system [1] that automatically referees Scrabble boards). I also used dictionary analysis and the classifier's own confusion matrix to boost its accuracy.

If you're also interested in real time OCR like this, I did a write up [2] of the approach that worked well for my project. It only needed to recognize Scrabble fonts, but it could be extended to more fonts by using more training examples.

[1] http://brm.io/kwyjibo/

[2] http://brm.io/real-time-ocr/

motoboi 10 years ago

I am 15 years into this computers thing and this blog post made me feel like "those guys are doing black magic".

Neural networks and deep learning are truly awesome technologies.

  • dr_zoidberg 10 years ago

    They are, but once you start learning about them, you realize the "black magic" part comes mostly from their mathematical nature and very little from them being "inteligent computers".

    A neural net is a graph, in which a subset of nodes are "inputs" (that's where the net gets information), some are outputs, and there are other nodes which are called "hidden neurons".

    The nodes are interconnected between each other in a fashion, which is called the "topology" or sometimes "architecture" of the net. For example I-H-O is a tipical feed forward net, in which I (inputs) is the input layer, H is the hidden layer and O the output layer. All the hidden neurons connect with all the input neurons "output", and all the output neurons connect to the hidden neurons "output". The connections are called "weights", and the training adjusts the weights of all the neuron with lots of cases until the desired output is achieved. There are also algorithms and criteria to stop before the net "learns too much" and looses the ability to generalize (this is called overfitting). In particular, a net with one hidden layer and one output layer is a universal function estimator -- that is, an estimator that can model any mathematical function of the form f(x1, x2, x3, ..., xn) = y.

    Deep learning means you're using a feedforward net with lots of hidden layers (I think it's usually between 5 to 15 now), which apply convolution operators (hence the "convolutional" in the name), and lots of neurons (in the order of thousands). All this was nearly impossible until the GPGPUs came along, because of the time it took to train a modest network (minutes to hours for a net with a between 50 to 150 neurons in one hidden layer).

    This is a very shortened explanation -- if you want to read more I recommend this link[1] which gives some simple Python code to illustrate and implement the innards of a basic neural network and you can learn from the inside. Once you get that you should move to more mature implementations, like Theano or Torch to get the full potential of neutral nets without worrying about implementation.

    [1] http://iamtrask.github.io/2015/07/12/basic-python-network/

    • frozenport 10 years ago

      >>They are, but once you start learning about them, you realize the "black magic" part comes mostly from their mathematical nature and very little from them being "inteligent computers".

      Oh humbug! The black magic comes from the vast resources Google drew to obtain perfect training datasets. Each step in the process took years to tune, demonstrating that data is indeed for those who dont have enough priors.

    • beambot 10 years ago

      You could say very much the same about the brain...

      > [...] the "black magic" part comes mostly from their mathematical nature and very little from them being "inteligent computers". A brain is a graph, in which a subset of neurons are "inputs", some are outputs, and others are "hidden". The nodes are interconnected between each other in a fashion, which is called the "topology" or sometimes "architecture" of the net.

      The deep question about deep learning is "Why is it so bloody effective?"

      • dr_zoidberg 10 years ago

        I work in the field, and while some models are based on biological structures/systems, there's a lot of fuzz about them being "based on biological foundations" that is now best avoided. Yes, it is true the model is based on them, but it's a model that only covers very little of the real complexity. So in a sense, it's naive to say "put a billion neurons in there and you'll get a rat brain" (as was publicized one time).

        The effectiveness comes from their non-linear nature and their ability to "learn" (store knowledge in the weights, that is derived from the training process). And black magic, of course!

      • Lawtonfogle 10 years ago

        If there is magic to be found, it may be in that question. Why about graphs (namely the subset that are deep neural networks) allow them to not only contain such powerful heuristics, but also allow them to be created from scratch with barely any knowledge of the problem domain.

        As a side note, I was playing a board game last night (Terra Mystica I believe) and wondering if you could get 5 different neural networks to play the game and then train them against each other (and once they are good enough, against players). I wonder how quickly one could train a network that is unbeatable by humans? Maybe even scale it up to training it to play multiple board games til it is really good at all of them before setting it lose on a brand new one (with a similar genre). Maybe Google could use this to make a Go bot.

        But what happens if this is used for evil instead? Say a neural network that reads a person's body language and determines how easily they can be intimidated by either a criminal or the government. Or one that is used to hunt down political dissidents. Imagine the first warrant to be signed by a judge for no reason other than a neural network saying the target is probably committing a crime...

        • thaumasiotes 10 years ago

          The best Go bot approach (as of some years ago, but it's not like neural networks are a new idea) uses a very different strategy. Specifically, the strategy of "identify a few possible moves, simulate the game for several steps after each move using a very stupid move-making heuristic instead of using this actual strategy recursively, and then pick the move that yielded the best simulated board state".

        • zwegner 10 years ago

          > Maybe Google could use this to make a Go bot.

          There was in fact a group within Google that worked on this: http://www.cs.toronto.edu/~cmaddis/pubs/deepgo.pdf

          • deepnet 10 years ago

            and the follow up from Google's Deepmind group :

            Move Evaluation in Go Using Deep Convolutional Neural Networks Chris J. Maddison, Aja Huang, Ilya Sutskever, David Silver

            http://arxiv.org/abs/1412.6564

            • z92 10 years ago

              Before clicking I was assuming it would fail. Then read this in the summary: "When the trained convolutional network was used directly to play games of Go, without any search, it beat the traditional search program GnuGo in 97% of games, and matched the performance of a state-of-the-art Monte-Carlo tree search that simulates a million positions per move."

      • sushirain 10 years ago

        They are effective because:

        - They use more parameters (and fewer computations per parameter.)

        - They are hierarchical (convolutions are apparently useful at different levels of abstraction of data).

        - They are distributed (word2vec, thought-vectors). Not restricted to a small set of artificial classes such as parts-of-speech or parts of visual objects.

        - They are recurrent (RNN).

        etc.

        • kylebgorman 10 years ago

          word2vec isn't "deep" in the relevant sense. The both skipgram and CBOW forms have a single hidden layer.

      • hyperbovine 10 years ago

        It's not really that deep, imo: a typical deep net these days has O(10^8) parameters (e.g. http://stackoverflow.com/questions/28232235/how-to-calculate...). You can store a hell of a lot of patterns in that many parameters, making them the best pattern matchers the world has ever seen. (Un)fortunately, pattern matching != intelligence. More interesting deep questions for which there is precious little theory revolve around the design of the networks themselves.

        • shostack 10 years ago

          Is "pattern matching != intelligence" what occurred when the Google image recognition stuff in the news recently was shown to recognize the pattern of a "dumbbell" as always having a large muscular arm attached to it?

          Seemed like a great way to highlight the limitations of patterns.

          • hyperbovine 10 years ago

            I hadn't heard about that but it sounds like what I'm talking about. With their ever expanding training corpus Google's net will eventually learn that dumbbells and arms are separate entities, but it will never deduce that on its own. And if it did it would not be able to generalize that to the fact that wedding rings and fingers are different (I hypothesize). Basically there is a whole other component of "intelligence" that feels absent from neural nets, which is why visions of AI lording over humanity don't exactly keep me up at night. (Autonomous weapons otoh...)

    • mistercow 10 years ago

      > Deep learning means ... which apply convolution operators

      Convolutional networks are only one kind of deep learning. In particular, they generally apply only to image processing.

  • dchichkov 10 years ago

    They are doing matrix multiplications. To pass input a single time through even some very large neural network - it is a relatively fast operation (if compared to training such a network, that is). Training requires data centers and arrays of GPUs. Passing the input through the network - usually you can get away with a single core and vectorized operations. Unless you are doing high resolution computer vision in real time... You can still get away with the single core even then, but that requires some very smart sublinear processing.

    • ibrahima 10 years ago

      Completely right. Applying a neural network is much faster than training one. The main trick here is fitting the trained model into cache (or smaller) so that the matrix multiplies are fast.

  • amelius 10 years ago

    > this blog post made me feel like "those guys are doing black magic".

    Two remarks. First, these guys probably don't know very well why what they are doing works so well ;) It requires a lot of trial and error, and a lot of patience and a lot of compute power (the latter being the reason why we are seeing breakthroughs only now).

    Second, training a neural net requires different computing power from deploying the net. The neural network that is installed on your phone has been trained using a lot of time and/or a very large cluster. Your phone is merely "running" the network, and this requires much less compute power.

  • jchomali 10 years ago

    Of course they are

  • ytdht 10 years ago

    they are awesome, but not that difficult to implement

sytelus 10 years ago

The most awesome and surprising thing about this is that the whole thing runs locally on your smartphone! You don't need network connection. All dictionaries, grammar processing, image processing, DNN - the whole stack runs on phone. I used this on my trip to Moscow and it was truely god send because it didn't need expensive international data plans (assuming you have connectivity!). English usage is fairly rare in Russia and it was just fun to learn Russian this way by pointing at interesting things.

eosrei 10 years ago

I used this in Brazil this last March to read menus. It works extremely well. The mistranslations make it even more fun. Much faster than learning Portuguese!

I took a few screen shots. Aligning the phone, focus, light, shadows on the small menu font was difficult. You must keep steady. Sadly, I ended up hitting the volume control on this best example. Tasty cockroaches! Ha! http://imgur.com/j9iRaY0

  • shkkmo 10 years ago

    I had some Brazilian roomates who didn't speak english (and I don't speak portugues). We used a combination of my poor spanish and google translate off my phone to comunicate.

    It worked ok (much better than nothing.) However there were a number of times when there were very large issues in the translations that created some pretty big misunderstandings. Luckily we had a friend who had fluent English and Portuguese who would translate when things got to confused.

    To reduce errors, you do need to be really careful to use short, complete sentences with simple and correct grammar. It's also better to use and that contain words that aren't ambiguous. (Those two sentences would probably not translate well.)

    e.g. Please write simple words, short phrases and simple phrases. Please write words with just one meaning. Those phrases and words are easier to translate.

    • thaumasiotes 10 years ago

      > Please write words with just one meaning.

      Those words are very rare and tend to only be useful in very technical contexts.

      • shkkmo 10 years ago

        Fair enough. The idea that is intended to express is 'unambiguous'. I tend to try to avoid more obscure words when writing text for automatic translation, often at the expense of explicit accuracy.

  • raverbashing 10 years ago

    Interesting

    It seems it can't really handle context, so 'cockroaches' may have been a mistranslation of 'cheap' in some contexts, as the 'it had stopped chestnut' may have simply been 'brazil nuts'

Animats 10 years ago

Word Lens is impressive. It came from a small startup. Google didn't develop it; it was a product before Google bought it. I saw an early version being shown around TechShop years ago, before Google Glass, even. It was quite fast even then, translating signs and keeping the translation positioned over the sign as the phone was moved in real time. But the initial version was English/Spanish only.

murbard2 10 years ago

I see no mention of it, but I'd be surprised if they didn't use some form of knowledge distilling [1] (which Hinton came up with, so really no excuse), to condense a large neural network into a much smaller one.

[1] http://arxiv.org/abs/1503.02531

josu 10 years ago

WordLens/Google Translate is the most futuristic thing that my phone is able to do. It's specially useful in countries that don't use the latin alphabet.

api 10 years ago

"Squeezes" is very relative. These phones are equal to or larger than most desktops 10-15 years ago, back when I was doing AI research with evolutionary computing and genetic algorithms. We did some pretty mean stuff on those machines, and now we have them in our pockets.

afsina 10 years ago

They did this even more impressively when squeezing their speech recognition engine to mobile devices.

http://static.googleusercontent.com/media/research.google.co...

teraflop 10 years ago

A possibly relevant research paper that they didn't mention: "Distilling the Knowledge in a Neural Network" http://arxiv.org/abs/1503.02531

cossatot 10 years ago

International travel now has a new source of entertainment: On-the-spot generation of humorous mistranslations.

  • joosters 10 years ago

    The oddest result I ever got from WordLens was when using it to translate a page of poetry on a plaque. The output was wonderful :)

    WordLens was awesome for translating fragments of foreign languages - stuff like signs, menus and so on. But its offline translation seemed to be little more than a word->word translation, so there is a huge scope for improvement there. Very difficult when working offline!

  • chipgap98 10 years ago

    Reddit is going to have a field day

  • rasz_pl 10 years ago

    Chinese restaurants did it first.

  • cjslep 10 years ago

    Just capture the screenshot and you have a meme generator as well!

zippzom 10 years ago

What are the advantages of using a neural network over generating classification trees or using other machine learning methods? I'm not too familiar with how neural nets work, but it seems like they require more creator input than other methods, which could be good or bad I suppose.

  • boomzilla 10 years ago

    Neural networks, and the plain old trusted logistic regression :) handles raw, continuous data better than the other learning algorithms. For example, if your inputs are images or audio recordings, it's really hard to do classification with decision trees or random forests as you'd need to construct the features manually. What would be a feature: color densities, color histograms, edges, corners, Haar-like, etc.? The promise of multilayer neural network is that given a lot of data, the right network structures, an appropriate learning strategy, and a huge farm of GPUs, the network can automatically learn the right features from raw data in the first layers, and utilizes the features in later layers. The big advantage of this approach is that you abstract away the domain problems (hopefully), and focus on picking the right network design, the right learning strategy, collecting a good data set etc. Neural network training is also easy to parallelize, so Google and the like can leverage their huge infrastructures.

    Now if the features in the domain problem is more well defined, like credit ratings, and data is sparse, and domain expertise is available, decision trees are perfectly valid options.

    • microtonal 10 years ago

      For example, if your inputs are images or audio recordings,

      Just wanted to add: and word/character/phrase embeddings.

poslathian 10 years ago

The article mentions algorithmically generating the training set. See here for some earlier research in this area: http://bheisele.com/heisele_research.html#3D_models

modfodder 10 years ago

Here's a short video about Google Translate just released.

https://www.youtube.com/watch?v=0zKU7jDA2nc&index=1&list=PLe...

up_and_up 10 years ago

This technology has been around since 2010 and was developed by Word Lens, which was acquired by google in 2014:

https://en.wikipedia.org/wiki/Word_Lens

mrigor 10 years ago

For those unfamiliar with google's deep learning, this talk covers their recent efforts pretty well https://youtu.be/kO-Iw9xlxy4 (not technical)

dharma1 10 years ago

Would be great to see a more in depth article about this, and maybe even some open source code?

pschanely 10 years ago

Doesn't this article seem to say that the size of the training set is related to the size of the resulting network? It should be proportional to the number of nodes/layers that the network is configured for, not proportional to the number of training instances. Am I missing something?

  • alok-g 10 years ago

    The network is sized to be able to learn the training data reasonably well (e.g. via hyper-parameter optimization). If there is too much variation in data that is not seen in the real application (like rotation of letters mentioned in the article), an appropriately sized network will still learn it, but would be an overkill for the application at hand.

megalodon 10 years ago

I generated training sets for an OCR project in JavaScript [1] a while ago using a modified version of a captcha generator [2] (practically the same technique mentioned in this article).

[1] https://github.com/mateogianolio/mlp-character-recognition

[2] https://github.com/mateogianolio/mlp-character-recognition/b...

hellrich 10 years ago

I wonder if they use some kind of (neural) language model for their translations. Using only a dictionary (as in the text) would be about 60 years behind the state of the art...

tdaltonc 10 years ago

Anyone want to do a $1 bet on an over/under for how long until word lens can handle Chinese?

  • agazso 10 years ago

    There is an app called Waygo that's already capable of handling Chinese, so I guess it's not too far.

birdsbolt 10 years ago

Why do they need a deep learning model for this? They are obviously targeting signs, product names, menus and similar. Model will obviously fail in translating large texts.

Was there any advantage of using a deep learning model instead of something more computationally simple?

Uhhrrr 10 years ago

I don't get it. They say they use a dictionary, and they say it works without an Internet connection. How can both things be true? I'm pretty sure there's not, say, a Quechua dictionary on my phone.

  • mattmanser 10 years ago

    It doesn't come with all the languages, you have to download them.

    • bduerst 10 years ago

      I think it's Android only that you can download the language packs, FYI. The language packs + offline maps are super helpful when travelling abroad.

    • rndn 10 years ago

      You can download the language packs on Android and iOS, and each one is about 4 MB in size.

  • ori_b 10 years ago

    Are you sure?

    On my desktop, the english dictionary is ~1 megabyte uncompressed, and compresses to ~250k with gzip. The download for Google Translate is somewhere around 30 megabytes.

  • josu 10 years ago

    You have to download them beforehand, and offline translating is limited to just a few languages.

xigency 10 years ago

Given the reliability of closed captions on YouTube and the frequency of errors in plaintext Google translate, I wouldn't be surprised if this service fails often, and often when you need it most.

joosters 10 years ago

WordLens was an awesome app and it's good to see that Google is continuing the development.

The new fad for using the 'deep' learning buzzword annoys me though. It seems so meaningless. What makes one kind of neural net 'deep' and are all the other ones suddenly 'shallow' ?

  • teraflop 10 years ago

    > What makes one kind of neural net 'deep' and are all the other ones suddenly 'shallow' ?

    If this is a serious question, then googling "what is a deep neural network" would take you to any number of explanations. But to summarize very briefly, it's not a buzzword; it's a technical term referring to a network with multiple nonlinear layers that are chained together in sequence. Deep networks have been talked about for as long as neural networks have been a research subject, but it's only in the last few years that the mathematical techniques and computational power have been available to do really interesting things with them.

    The "fad" (as you call it) is not mainly because the word "deep" sounds cool, but because companies like Google have been seeing breakthrough results that are being used in production as we speak. For example:

    http://papers.nips.cc/paper/4687-large-scale-distributed-dee...

    http://static.googleusercontent.com/media/research.google.co...

    http://static.googleusercontent.com/media/research.google.co...

    • joosters 10 years ago

      I honestly didn't realise that it had any definition - I see now that calling it a 'fad' is unfair. However, the boundary between deep learning and (representational) machine learning still seems murky.

      • strebler 10 years ago

        Considering the very significant accuracy gains deep learning has achieved over previous approaches (and across a number of fields), it's certainly not a simple fad. Having worked in computer vision for a good 8+ years, deep learning is basically amazing.

        Deep learning is a form of representation/feature learning.

      • ssalazar 10 years ago

        Machine learning proper encompasses a swath of applied statistical techniques, of which deep learning is only one. Machine learning could refer to linear regression, SVM, hidden markov models, dimensionality reduction, neural nets, or any number of other loosely related methods. Intro ML classes often don't even get to deep learning because theres so much more fundamental stuff to cover.

    • ocdtrekkie 10 years ago

      So was Word Lens doing this before Google even bought them? Because Word Lens worked fine, locally on a phone, long before Google was doing it's whole deep learning thing.

      • JeremyBanks 10 years ago

        It's not entirely clear to me, but this sentence from the article:

        In the end, we were able to get our networks to give us significantly better results while running about as fast as our old system—great for translating what you see around you on the fly.

        Suggests that they previously were not using neural networks, or were using less powerful ones.

  • raverbashing 10 years ago

    > What makes one kind of neural net 'deep' and are all the other ones suddenly 'shallow'

    Number of layers

    It's that simple

    • discardorama 10 years ago

      To expand on this some more: for a long time, thanks to Cybenko's theorem[1], people just used 1 hidden layer in their neural networks (also because computing was sloowww..). So, your typical NN architecture was input_layer --> hidden_layer --> output_layer.

      Eventually, people realized that you could improve performance by adding more hidden layers. So while theoretically Cybenko was correct, practically stacking a bunch of hidden layers made more sense. These network architectures with stacks of hidden layers were then labelled as "deep" neural networks.

      [1] https://en.wikipedia.org/wiki/Universal_approximation_theore...

    • dnautics 10 years ago

      It is that simple but the more complex story is that when the number of hidden layers exceeds 2, training becomes difficult. Also convnets for example cheat by having the connections between layers be incomplete bipartite graphs (not every node is connected to every other node), usually chosen because of some physical property - for computer vision nearest neighbors - eg.

      • lisivka 10 years ago

        Use another deep learning network to supervise training of your DLN. You can also use it to supervise itself. It is simple idea invented about decade ago (at least I heard it about decade ago here, in Ukraine).

  • StavrosK 10 years ago

    Well, if all it cares about is looks...

anantzoid 10 years ago

Just waiting for the paper to come out that'll detail all the transformations that were done on the training data specifically for the phone and how did they arrive at deciding to use them.

> To achieve real-time, we also heavily optimized and hand-tuned the math operations. That meant using the mobile processor’s SIMD instructions and tuning things like matrix multiplies to fit processing into all levels of cache memory.

Let's see how this turns out to be. I'm still skeptical if other apps might crash because of this.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection