Why is Siri still so clueless?

7 min read Original article ↗

Joe Hootman

AI is progressing at a fast and furious pace. Over twenty years ago, IBM’s Deep Blue beat Gary Kasparov at chess. Seven years ago, IBM’s Watson beat Ken Jennings at Jeopardy. Last year, DeepMind’s AlphaGo beat world champ Ke Jie at Go. And just two days ago, moving to the very apex of civilization’s ladder, DeepMind’s AlphaStar walloped two of the best human StarCraft II players.

An inside-the-mind peek at AlphaStar’s defeat of one of the best human players of StarCraft II. Are Ninja’s days numbered?

Despite all this progress, your day-to-day attempts to interact with Siri, Alexa, or Cortana often result in forehead-smacking frustration. There’s this classic:

Press enter or click to view image in full size

Several iOS versions later, you’ll still get results like:

Press enter or click to view image in full size

You: “When is black Friday?” Siri: “It will happen Thursday.”

If you harbor any doubt that it’s maybe just you and the way you’re phrasing questions, the existence of an entire subreddit should clear up the fact that an AI may be able to beat you at chess, but it can’t yet map its way to your daughter’s ballet recital.

Jonathan Mugan of DeepGrammar spoke at Data Day Texas today about the history and current state of attempts to move natural language processing forward into a more generalized AI that can answer questions a four-year-old can, like: “why can you pull a wagon with a string, but not push it?”

Jonathan Mugan

I enjoyed Jonathan’s talk because it threw historical illumination upon some of my graduate work and because it raised questions about what we mean we talk about the goal of getting AI to “understand language.” Here’s a Medium-sized summary of his talk suffixed with a keen observation of the history of the field he offered during Q&A. (Any errors are mine, and corrections are welcome.)

Historically, the NLP quest for understanding, he noted, has moved along two paths: symbolic and sub-symbolic.

Press enter or click to view image in full size

The symbolic path transposes words into symbols and attempts to map relationships between those symbols. In its earliest phase, this approach chopped up all the words of a book into tokens and threw them into a bag (like a salad), shook them up into vectors to perform calculations based on their frequency, and, if feeling creative, seasoned them from lists of external meanings representing sentiment.

Following it were attempts at manual representations, which struck me as the left brain’s Spock-like hunger to know the right entirely on its own terms. The meaning and relationships of words should be able to be buttoned-down into a comprehensive taxonomy of parts and wholes, parents and children. This was the fecund Age of Ontology, bringing forth the ambitious Semantic Web, WordNet (where I did some work), FrameNet, ConceptNet, the pragmatic Wikipedia-based YAGO (Yet Another Great Ontology), SUMO (Suggested Upper Merged Ontology), and image schemas.

Press enter or click to view image in full size

Semantic Web Layer Cake Spaghetti Monster. This’ll make sense out of language!

Taking a step toward simplicity, the world models approach understood that people communicate from the basis of an assumed base model of the world, and only bubble up pertinent changes in state when needed. So the focus turned to creating those models and the dimensions of their changes: probabilistic, relational, concurrent, temporal.

Get Joe Hootman’s stories in your inbox

Join Medium for free to get updates from this writer.

Having put world models through their paces, the field then enhanced them with merge representation. A word alone could denote thick representations of ideas like coverage (“a roof covers the house”) but then be supplemented with inferences from the world to answer questions like “why does it only rain outside?”, leveraging world models about how rain falls and the functional boundaries of “outside”).

A representation like “chicken” can be supplemented with meaning inferred from the broader world model about what birds are like, how farms operate, or how to compliment waffles.

The second path forward in NLP is the sub-symbolic path, which forgoes the utopian vision of translating linguistic meaning into discrete, manipulable symbols. Instead, it takes a much more pragmatic approach of working with words as they function in actual language by running them through ML techniques, most notably neural networks. word2vec works from the assumption that words are grounded in experience and creates two random, dense vectors for words until they are adequately differentiated and produce a coherent internal structure. These vectors have resulted in the discovery of concepts like “captial city” where vectors for Italy and Rome can be applied to produce outputs like France and Paris.

seq2sec expands beyond single words to encode sequences of them, like in sentences, into vectors which can then be decoded into other sequences. It is used heavily in machine translation. Question Answering uses recurrent neural networks to answer specific questions so that it can learn which facts merit attention to answer other questions. By answering earlier questions about spatial location, this approach can produce results like:

The office is north of the yard. The bath is north of the office. The yard is west of the ktichen. How do you go from the office to the ktichen? A: south, east.

Finally, the sub-symbolic approach discovers meaning by recognizing that we are not disembodied brains, but that we exist and learn within the physical framework of our bodies and our given environment.

Sorry, Descartes. But no dice.

Therefore, in order to learn the meaning of language about touching, holding, or grabbing an object, we need to capture the sensations experienced by touching, holding, or grabbing. The low-hanging fruit here is DNNs created from interactions with the external world that are easy for a computer to undertake, like playing a text-based adventure game or ingesting input from the interface of a computer game (this is one of the key approaches AlphaStar used to triumph in StarCraft II this week). The likely gains will first come from the same species of the data ingested. For example, Jonathan talked about efforts to use video game data to create a “Grand Theft Auto world (with hopefully less violence).”

Physical reality presents a much more complex challenge for ingestion. To address that, the EU has developed an open-source project called iCub, which constructs a humanoid robot and captures the data from its interactions with physical reality.

iCub meet “plant.” Plant, meet iCub.

The external world training approach looks to me like the one that is pregnant with the possibility (and probably necessary) for the growth of virtual reality.

In conclusion, Jonathan re-iterated that the symbolic and sub-symbolic categories of approaches are designed to help AI journey from language to understanding.

Press enter or click to view image in full size

This raised a nagging question for me about epistemology: what exactly are we reaching for when we talk about a linguistic “understanding?” Theologians and philosophers have batted that question around for centuries. At one point in the talk, Jonathan suggested that the field’s current end goal was to obtain a grasp of meaning from language that was richly-grounded in the human experience gained from sensation and action, which sounded like a pretty empiricist definition to me.

When I asked him about how varying definitions of deriving “understanding” from language affected the approaches to the quest for its attainment, Jonathan offered a very pragmatic, realistic answer: it is whatever makes sense and is the most commercially viable at any given moment.

That fits in with the most of the history of science, which does not “progress” in a linear abstract vacuum but is often characterized by the interplay of varying approaches led by personalities and funding opportunities. I thought about how a recent article from MIT Technology Review confirmed that truth. It examined the history of recent scholarly approaches to “artificial intelligence” in order to predict the future of its development, and found no grand inevitable progression from one step to the next. Instead, a particular paradigm seems to reign for about a decade and then it is superseded.

I wonder what approach the market and idea-creators like Jonathan will settle into for 2020 and beyond. Maybe they’ll discover something that Siri will finally make note of.