researchers, the open-domain QA problem isattractive as it is one of the most challenging in therealm of computer science and artificial intelli-gence, requiring a synthesis of informationretrieval, natural language processing, knowledgerepresentation and reasoning, machine learning,and computer-human interfaces. It has had a longhistory (Simmons 1970) and saw rapid advance-ment spurred by system building, experimenta-tion, and government funding in the past decade(Maybury 2004, Strzalkowski and Harabagiu 2006). With QA in mind, we settled on a challenge tobuild a computer system, called Watson,
1
whichcould compete at the human champion level inreal time on the American TV quiz show,
Jeopardy
.The extent of the challenge includes fielding a real-time automatic contestant on the show, not mere-ly a laboratory exercise.
Jeopardy!
is a well-known TV quiz show that hasbeen airing on television in the United States formore than 25 years (see the
Jeopardy!
Quiz Showsidebar for more information on the show). It pitsthree human contestants against one another in acompetition that requires answering rich naturallanguage questions over a very broad domain of topics, with penalties for wrong answers. The natureof the three-person competition is such that confi-dence, precision, and answering speed are of criticalimportance, with roughly 3 seconds to answer eachquestion. A computer system that could compete athuman champion levels at this game would need toproduce exact answers to often complex naturallanguage questions with high precision and speedand have a reliable confidence in its answers, suchthat it could answer roughly 70 percent of the ques-tions asked with greater than 80 percent precisionin 3 seconds or less.Finally, the
Jeopardy
Challenge represents aunique and compelling AI question similar to theone underlying DeepBlue (Hsu 2002)
—
can a com-puter system be designed to compete against thebest humans at a task thought to require high lev-els of human intelligence, and if so, what kind of technology, algorithms, and engineering isrequired? While we believe the
Jeopardy
Challengeis an extraordinarily demanding task that willgreatly advance the field, we appreciate that thischallenge alone does not address all aspects of QAand does not by any means close the book on theQA challenge the way that Deep Blue may have forplaying chess.
The
Jeopardy
Challenge
Meeting the
Jeopardy
Challenge requires advancingand incorporating a variety of QA technologiesincluding parsing, question classification, questiondecomposition, automatic source acquisition andevaluation, entity and relation detection, logicalform generation, and knowledge representationand reasoning.Winning at
Jeopardy
requires accurately comput-ing confidence in your answers. The questions andcontent are ambiguous and noisy and none of theindividual algorithms are perfect. Therefore, eachcomponent must produce a confidence in its out-put, and individual component confidences mustbe combined to compute the overall confidence of the final answer. The final confidence is used todetermine whether the computer system shouldrisk choosing to answer at all. In
Jeopardy
parlance,this confidence is used to determine whether thecomputer will “ring in” or “buzz in” for a question.The confidence must be computed during the timethe question is read and before the opportunity tobuzz in. This is roughly between 1 and 6 secondswith an average around 3 seconds.Confidence estimation was very critical to shap-ing our overall approach in DeepQA. There is noexpectation that any component in the systemdoes a perfect job
—
all components post featuresof the computation and associated confidences,and we use a hierarchical machine-learningmethod to combine all these features and decidewhether or not there is enough confidence in thefinal answer to attempt to buzz in and risk gettingthe question wrong.In this section we elaborate on the variousaspects of the
Jeopardy
Challenge.
The Categories
A 30-clue
Jeopardy
board is organized into sixcolumns. Each column contains five clues and isassociated with a category. Categories range frombroad subject headings like “history,” “science,” or“politics” to less informative puns like “tutumuch,” in which the clues are about ballet, to actu-al parts of the clue, like “who appointed me to theSupreme Court?” where the clue is the name of ajudge, to “anything goes” categories like “pot-pourri.” Clearly some categories are essential tounderstanding the clue, some are helpful but notnecessary, and some may be useless, if not mis-leading, for a computer.A recurring theme in our approach is the require-ment to try many alternate hypotheses in varyingcontexts to see which produces the most confidentanswers given a broad range of loosely coupled scor-ing algorithms. Leveraging category information isanother clear area requiring this approach.
The Questions
There are a wide variety of ways one can attempt tocharacterize the
Jeopardy
clues. For example, bytopic, by difficulty, by grammatical construction,by answer type, and so on. A type of classificationthat turned out to be useful for us was based on theprimary method deployed to solve the clue. The
Articles
60AI MAGAZINE