Show HN: A Natural Language Query Engine Without Machine Learning

blog.ayoungprogrammer.com

115 points by youngprogrammer 9 years ago · 21 comments

Reader

I think you might get better results in the first stage using the dependency parse from CoreNLP - rather than the phrasal parse. Online demo at http://corenlp.run

If you're willing to drop CoreNLP there's also https://demos.explosion.ai/displacy/ that's worth checking out.

alanlit 9 years ago

Nice work by the O.P.
Amusingly a year or so ago I took the Stanford Dependency parser and fed its output tree into a Prolog system to try to pull out the semantics. It was used to analyze business news (getting at the who's, what's and why's).
The easiest approach was to wrap a very simple DSL around Prolog (which, BTW, Prolog is great at). Then in the DSL (which still retained logical variables and backtracking) you could write things like:
%% Simple statements -- root is an announcement word whose subject and object tell the story. %% 'IBM announced a new computer today' announce(Who, About, What) ==> s+root(['announc', 'releas', 'introduc','launch', 'unveil', 'reveal', 'agre']), #Dep1, subject(Who), Dep1 >> object(About, What).
%% 'IBM has announced a partnership ...' is caught by the above. But 'IBM has entered into a partnership ...' needs %% a little more work announce(Who, About, announcement) ==> s+root(['enter']), #Obj, subject(Who), Obj >> prep_pobj_chain(PPC), {PPC = [Prep|About]}.
I think a Prolog-based query planner as a front end to Sparql on Wikidata could be quite interesting.
Alanl
- alanlit 9 years ago
  Bah -- try again so it is readable !!
  %% Simple statements -- root is an announcement word whose subject and object tell the story. %% 'IBM announced a new computer today' announce(Who, About, What) ==> s+root(['announc', 'releas', 'introduc','launch', 'unveil', 'reveal', agre']), #Dep1, subject(Who), Dep1 >> object(About, What). %% 'IBM has announced a partnership ...' is caught by the above. But 'IBM has entered into a partnership ...' needs %% a little more work announce(Who, About, announcement) ==> s+root(['enter']), #Obj, subject(Who), Obj >> prep_pobj_chain(PPC), {PPC = [Prep|About]}.

steinsgate 9 years ago

Nice work! You said that you avoided machine learning because labeled data is hard to find. What about unsupervised approaches?

Frankly speaking, I am a bit skeptical about pattern matching algorithms for answering questions. It would help if you showed some kind of stats about your algorithm's performance on a diverse question set. For example, you can scrape simple quiz questions (and answers) from quiz sites [1] and report back on the performance.

[1] http://www.quiz-zone.co.uk/questionsbydifficulty/1/0/answers...

drdeca 9 years ago

In addition to the questions it does answer well, it also has these answers:

Q: "What is purpose" A: "Justin Bieber album" Q: "What is a car?" A: "country in Africa" Q: "What is a male?" A: "capital of Maldives" Q: "What is a female?" A: "human who is female (use with Property:P21 sex or gender). For groups of females use with ''subclass of (P279)''"

my point in this comment is just to say that when it does give an odd answer, it can be funny, not to say that it sometimes gives odd answers.

mrob 9 years ago

This seems almost completely useless. I tried ten questions, and only one was answered, incorrectly (Moby Dick question misunderstood, answered as "novel by Herman Melville"). I think even Ask Jeeves back in the 90s had better performance than this. Questions tried:

how many lines of resolution are there in an ntsc television signal?

what is the melting point of tin/lead eutetic solder?

what species of whale was moby dick?

what grain is most often used to make beer?

what is the boiling point of water?

how many chromosomes does a normal human have?

what animal is known as "man's best friend"?

what fps did id software release in 1993?

what is the largest known prime number?

what is the clock rate of the arduino uno?

As a comparison, Google gives 8 correct answers directly (either as an special info box, or as highlighted part of a web page), 1 correct answer as the 2nd search result (Doom), and 1 incorrect answer (largest known prime).

azpoliak1 9 years ago

"This seems almost completely useless" - seems pretty harsh. Someone making a cool project, open sourcing it, and documenting it really well is something that should be praised. Of course Google is going to do much better, its a company focused on search.
charlieegan3 9 years ago

Bear in mind that it's only as good as the data in the LOD source.

imh 9 years ago

These things are always so interesting in their totally inhuman failure cases. It can tell me George Washington was born in 1732, but doesn't know which planet America is on (much less which planet George Washington was born on).

Also, it seems to have issues formatting dates before 1900 (for the bday one, the answer it returns is more of an error message than an answer: "year=1732 is before 1900; the datetime strftime() methods require year >= 1900")

ecesena 9 years ago

Partially related - has anyone worked on natural language queries with time expressions in it? Imagine analytics queries, where you want to count the number of events/unique users, given certain conditions, and in a certain time window. i'm particularly interested in the time aspect of it.

fspeech 9 years ago

Have you studied Prolog? Its matching (logical unification) capability may give you some more ideas.

greglindahl 9 years ago

Very interesting! Nice to see how little code it is. I wonder how much work it would be to get it to answer questions like "What is the biggest planet?" or fix that "Who was Prime Minister of Canada in 1945" drops "of Canada"?

atoko 9 years ago

This is cool! I like how you've iterated on a central concept (NLP) with different codebases.

Tip: The link to the source is pointing to github pages, which hasn't been set up.

youngprogrammerOP 9 years ago

Thanks! Fixed the link.

mrcabada 9 years ago

This is nice! Would it be possible to run the code with other language models? (Spanish, German, and any other CoreNLP language model)

youngprogrammerOP 9 years ago

Yes it should be possible! You would need to add the grammar matching rules for those languages though.

youngprogrammerOP 9 years ago

Demo should be working now. The stanford parser getting dying from running out of memory so I moved it to a another box

billconan 9 years ago

This Is cool! Is it easy to convert a mediawiki to the graph store your system reads?

smsm42 9 years ago

While the note other commenter left is correct (wikidata is not your regular mediawiki) you can also look at DBpedia which is doing pretty much what you have suggested. The TLDR answer would be: "possible, but harder than it seems".
Tpt 9 years ago

Wikidata (the wiki used) is not a regular MediaWiki but host special pages with structured data. See https://www.wikidata.org

alexcaps 9 years ago

Couldn't tell me who the CEO of Apple is... :(

Settings

Show HN: A Natural Language Query Engine Without Machine Learning

Keyboard Shortcuts