It's always the same pattern: some well-known figure comes up with some idea and everybody jumps onto the bandwagon. I have the strong suspicion that media continues to fall into one of the famous logical fallacies,
appeal to authority. Or perhaps it's just clever marketing. This time the hype is around a new project by the physics guy Stephen Wolfram of Mathematica fame:
Wolfram alpha is coming and could be as important as Google. To cut a long boring blog entry short, I wouldn't hold my breath.
Question answering is a quite old sub discipline of computational linguistics, which nethertheless has seen a lot of progress in recent years. Still it happens to be a pretty hard task even in closed domains or for a given training set (see results of various TREC conferences, where TREC is an acronym for text retrieval conferences). Question answering in the open domain, as Wolfram alpha seems to address, is not one but multiple magnitudes harder: all of a sudden you no longer have a controlled terminology for queries and the amount of information you have to index and search is unbelievable.
There have been attempts in the past to deal with this problem. One particular well-known approach was or is
CyC, which tried to build a huge knowledge base of every day knowledge. There have been several attempts to use
Wordnet and more recently
Wikipedia as a source of answers to questions. Even Microsoft tried to build a knowledge base from the data of its Encarta product. So, why don't we have already a well functioning open domain question answering system if people are trying to build one for like fifty years? See above: because it's really hard. Think about the parts involved: Information extraction in itself is not easy. Query parsing, as easy as it sounds, isn't trivially, either. Matching a query to an extracted piece of information usually requires a sophisticated system of knowlege representation or a similarly sophisticated statistical system. And text generation isn't a piece of cake either.
So what is it that Wolfram alpha makes different? From the very fuzzy amount of information we can see it's really hard to judge but I have very serious doubts that they've found the holy grail of QA. Unfortunately, we haven't seen any tests of their system but I guess there's a reason for it.
As for instance can be seen by the often ridiculous search results that the previous-google-killer-hype
cuil offers, overcoming real world trouble like filtering out irrelevant or false data can be a major obstacle. Wolfram alpha of course doesn't have to filter out irrelevant web pages but it has a problem that is probably even greater: to filter out false claims on input data (assuming that they're operating on public available data) because otherwise they end up with wrong answers which would be disastrous. But on which grounds could you automatically filter out "wrong facts"? You would already have to know the "correct" ones.
So this leaves us with a lot of handcrafting, say to build a knowledge base of facts which they can answer. However, we've seen in the past that any handcrafted knowledge base requires vast resources and constantly so -- which is the major reason why wikipedia is a real problem for traditional encyclopedia publishers. Now remember that with Cyc there already has been an attempt to build up such a knowledge base and they've been working on it for roughly twentyfive years now and it's still not a system that is useful in reality (see, for instance the list of criticisms of the Cyc project on the
wikipedia page on CyC.
But let's go back to that sensational article linked to above: in all seriousness, it's quite unlikely that even if they can build a system that can answer a lot of questions they're gonna get as important as Google is. Besides the fact that Google has enourmous resources including a lot of guys who know a lot about computational linguistics and is hence likely to come up with a similar system if necessary, Google has not been only a simple search engine for quite some time now. Google nowadays is in no way comparable to what it was ten years ago, their major service they provide is information access in a large variety of ways, including multiple media sources, integration of social interaction services . Fact retrieval and question answering (mainly based on texts) is certainly important but information access encompasses a lot more.