<?xml version="1.0" encoding="utf-8" ?>

<rss version="2.0" 
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:admin="http://webns.net/mvcb/"
   xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
   xmlns:wfw="http://wellformedweb.org/CommentAPI/"
   xmlns:content="http://purl.org/rss/1.0/modules/content/"
   >
<channel>
    <title>A view from the hill - Knowledge processing</title>
    <link>http://hillview.1on.de/</link>
    <description>Blogging Holgers little world</description>
    <dc:language>en</dc:language>
    <generator>Serendipity 1.3.1 - http://www.s9y.org/</generator>
    <pubDate>Wed, 11 Mar 2009 08:36:14 GMT</pubDate>

    <image>
        <url>http://hillview.1on.de/templates/default/img/s9y_banner_small.png</url>
        <title>RSS: A view from the hill - Knowledge processing - Blogging Holgers little world</title>
        <link>http://hillview.1on.de/</link>
        <width>100</width>
        <height>21</height>
    </image>

<item>
    <title>Wolfram alpha is going to continue to be alpha </title>
    <link>http://hillview.1on.de/archives/138-Wolfram-alpha-is-going-to-continue-to-be-alpha.html</link>
            <category>Knowledge processing</category>
    
    <comments>http://hillview.1on.de/archives/138-Wolfram-alpha-is-going-to-continue-to-be-alpha.html#comments</comments>
    <wfw:comment>http://hillview.1on.de/wfwcomment.php?cid=138</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://hillview.1on.de/rss.php?version=2.0&amp;type=comments&amp;cid=138</wfw:commentRss>
    

    <author>nospam@example.com (Holger Schauer)</author>
    <content:encoded>
    It&#039;s always the same pattern: some well-known figure comes up with some idea and everybody jumps onto the bandwagon. I have the strong suspicion that media continues to fall into one of the famous logical fallacies, &lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=560&amp;amp;entry_id=138&quot;  onmouseover=&quot;window.status=&#039;http://www.nizkor.org/features/fallacies/appeal-to-authority.html&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot; title=&quot;http://www.nizkor.org/features/fallacies/appeal-to-authority.html&quot;&gt;appeal to authority&lt;/a&gt;. Or perhaps it&#039;s just clever marketing. This time the hype is around a new project by the physics guy Stephen Wolfram of Mathematica fame:  &lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=561&amp;amp;entry_id=138&quot;  onmouseover=&quot;window.status=&#039;http://www.twine.com/item/122mz8lz9-4c/wolfram-alpha-is-coming-and-it-could-be-as-important-as-google&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot; title=&quot;http://www.twine.com/item/122mz8lz9-4c/wolfram-alpha-is-coming-and-it-could-be-as-important-as-google&quot;&gt;Wolfram alpha is coming and could be as important as Google&lt;/a&gt;. To cut a long boring blog entry short, I wouldn&#039;t hold my breath.&lt;br /&gt;
&lt;br /&gt;
&lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=562&amp;amp;entry_id=138&quot;  onmouseover=&quot;window.status=&#039;http://en.wikipedia.org/wiki/Question_answering&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot; title=&quot;http://en.wikipedia.org/wiki/Question_answering&quot;&gt;Question answering&lt;/a&gt; is a quite old sub discipline of computational linguistics, which nethertheless has seen a lot of progress in recent years. Still it happens to be a pretty hard task even in closed domains or for a given training set (see results of various TREC conferences, where TREC is an acronym for text retrieval conferences). Question answering in the open domain, as Wolfram alpha seems to address, is not one but multiple magnitudes harder: all of a sudden you no longer have a controlled terminology for queries and the amount of information you have to index and search is unbelievable.&lt;br /&gt;
&lt;br /&gt;
There have been attempts in the past to deal with this problem. One particular well-known approach was or is &lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=563&amp;amp;entry_id=138&quot;  onmouseover=&quot;window.status=&#039;http://www.cyc.com/&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot; title=&quot;http://www.cyc.com/&quot;&gt;CyC&lt;/a&gt;, which tried to build a huge knowledge base of every day knowledge. There have been several attempts to use &lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=564&amp;amp;entry_id=138&quot;  onmouseover=&quot;window.status=&#039;http://wordnet.princeton.edu/&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot; title=&quot;http://wordnet.princeton.edu/&quot;&gt;Wordnet&lt;/a&gt; and more recently &lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=565&amp;amp;entry_id=138&quot; title=&quot;http://en.wikipedia.org&quot;  onmouseover=&quot;window.status=&#039;http://en.wikipedia.org&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot;&gt;Wikipedia&lt;/a&gt; as a source of answers to questions. Even Microsoft tried to build a knowledge base from the data of its Encarta product. So, why don&#039;t we have already a well functioning open domain question answering system if people are trying to build one for like fifty years? See above: because it&#039;s really hard. Think about the parts involved: Information extraction in itself is not easy.  Query parsing, as easy as it sounds, isn&#039;t trivially, either. Matching a query to an extracted piece of information usually requires a sophisticated system of knowlege representation or a similarly sophisticated statistical system. And text generation isn&#039;t a piece of cake either.&lt;br /&gt;
&lt;br /&gt;
So what is it that Wolfram alpha makes different? From the very fuzzy amount of information we can see it&#039;s really hard to judge but I have very serious doubts that they&#039;ve found the holy grail of QA. Unfortunately, we haven&#039;t seen any tests of their system but I guess there&#039;s a reason for it. &lt;br /&gt;
As for instance can be seen by the often ridiculous search results that the previous-google-killer-hype &lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=566&amp;amp;entry_id=138&quot;  onmouseover=&quot;window.status=&#039;http://cuil.com/&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot; title=&quot;http://cuil.com/&quot;&gt;cuil&lt;/a&gt; offers, overcoming real world trouble like filtering out irrelevant or false data can be a major obstacle. Wolfram alpha of course doesn&#039;t have to filter out irrelevant web pages but it has a problem that is probably even greater: to filter out false claims on input data (assuming that they&#039;re operating on public available data) because otherwise they end up with wrong answers which would be disastrous. But on which grounds could you automatically filter out &quot;wrong facts&quot;? You would already have to know the &quot;correct&quot; ones.&lt;br /&gt;
So this leaves us with a lot of handcrafting, say to build a knowledge base of facts which they can answer. However, we&#039;ve seen in the past that any handcrafted knowledge base requires vast resources and constantly so -- which is the major reason why wikipedia is a real problem for traditional encyclopedia publishers. Now remember that with Cyc there already has been an attempt to build up such a knowledge base and they&#039;ve been working on it for roughly twentyfive years now and it&#039;s still not a system that is useful in reality (see, for instance the list of criticisms of the Cyc project on the &lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=567&amp;amp;entry_id=138&quot;  onmouseover=&quot;window.status=&#039;http://en.wikipedia.org/wiki/Cyc&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot; title=&quot;http://en.wikipedia.org/wiki/Cyc&quot;&gt;wikipedia page on CyC&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
But let&#039;s go back to that sensational article linked to above: in all seriousness, it&#039;s quite unlikely that even if they can build a system that can answer a lot of questions they&#039;re gonna get as important as Google is. Besides the fact that Google has enourmous resources including a lot of guys who know a lot about computational linguistics and is hence likely to come up with a similar system if necessary, Google has not been only a simple search engine for quite some time now. Google nowadays is in no way comparable to what it was ten years ago, their major service they provide is information access in a large variety of ways, including multiple media sources, integration of social interaction services . Fact retrieval and question answering (mainly based on texts) is certainly important but information access encompasses a lot more.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 
    </content:encoded>

    <pubDate>Tue, 10 Mar 2009 21:36:00 +0100</pubDate>
    <guid isPermaLink="false">http://hillview.1on.de/archives/138-guid.html</guid>
    <category>linguistics</category>
<category>semantic web</category>

</item>
<item>
    <title>A cloud of words: yes, they can</title>
    <link>http://hillview.1on.de/archives/131-A-cloud-of-words-yes,-they-can.html</link>
            <category>Knowledge processing</category>
            <category>Politics</category>
    
    <comments>http://hillview.1on.de/archives/131-A-cloud-of-words-yes,-they-can.html#comments</comments>
    <wfw:comment>http://hillview.1on.de/wfwcomment.php?cid=131</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://hillview.1on.de/rss.php?version=2.0&amp;type=comments&amp;cid=131</wfw:commentRss>
    

    <author>nospam@example.com (Holger Schauer)</author>
    <content:encoded>
    I&#039;m not one of the people to get over-excited by the new US presidency, hence I normally wouldn&#039;t have any reason to say anything about it here. As a computational linguist, however, I can&#039;t let &lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=528&amp;amp;entry_id=131&quot;  onmouseover=&quot;window.status=&#039;http://www.readwriteweb.com/archives/tag_clouds_of_obamas_inaugural_speech_compared_to_bushs.php&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot; title=&quot;http://www.readwriteweb.com/archives/tag_clouds_of_obamas_inaugural_speech_compared_to_bushs.php&quot;&gt;ReadWriteWeb&#039;s word cloud comparison of Obamas inaugeral speech to former speeches&lt;/a&gt; go unnoticed. From a casual look, it seems as if Bush communicated a lot clearer what his presidency would be about, at least looking back on the last years. Of course, all that talk about liberty and freedom was probably just advance justification for the aggressive actions to come. The word cloud analysis of Obama looks much broader but also much more unspecific to me -- which matches the image I got from the media pieces of his previous election speeches, too. It will be very interesting to see if Obama can fulfill all the wishful thinking people approach his presidency with (I wouldn&#039;t hold my breath, though) -- and, in some future time, how one might look back on that word cloud and which interpretation one is going to associate with all these terms. 
    </content:encoded>

    <pubDate>Wed, 21 Jan 2009 09:26:03 +0100</pubDate>
    <guid isPermaLink="false">http://hillview.1on.de/archives/131-guid.html</guid>
    <category>linguistics</category>
<category>politics</category>
<category>semantic web</category>

</item>
<item>
    <title>The semantic web hype strikes again</title>
    <link>http://hillview.1on.de/archives/126-The-semantic-web-hype-strikes-again.html</link>
            <category>Knowledge processing</category>
    
    <comments>http://hillview.1on.de/archives/126-The-semantic-web-hype-strikes-again.html#comments</comments>
    <wfw:comment>http://hillview.1on.de/wfwcomment.php?cid=126</wfw:comment>

    <slash:comments>1</slash:comments>
    <wfw:commentRss>http://hillview.1on.de/rss.php?version=2.0&amp;type=comments&amp;cid=126</wfw:commentRss>
    

    <author>nospam@example.com (Holger Schauer)</author>
    <content:encoded>
    Today I stumbled upon a list of &lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=510&amp;amp;entry_id=126&quot;  onmouseover=&quot;window.status=&#039;http://www.readwriteweb.com/archives/top_10_semantic_web_products_2008.php&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot; title=&quot;http://www.readwriteweb.com/archives/top_10_semantic_web_products_2008.php&quot;&gt;top ten semantic web products of 2008 (according to ReadWriteWeb)&lt;/a&gt;, which is interesting alone because of the varying range of the products listed. There is for instance a blogging helping browser extension (&lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=515&amp;amp;entry_id=126&quot;  onmouseover=&quot;window.status=&#039;http://www.zemanta.com/&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot; title=&quot;http://www.zemanta.com/&quot;&gt;Zemanta&lt;/a&gt;) right next to an API providing semantic analysis of texts (&lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=511&amp;amp;entry_id=126&quot;  onmouseover=&quot;window.status=&#039;http://www.opencalais.com/&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot; title=&quot;http://www.opencalais.com/&quot;&gt;OpenCalais&lt;/a&gt;). Another interesting point to me is that there seems to be somewhat contradictory directions: while some, e.g. Yahoo with it&#039;s &lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=512&amp;amp;entry_id=126&quot;  onmouseover=&quot;window.status=&#039;http://developer.yahoo.com/searchmonkey/&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot; title=&quot;http://developer.yahoo.com/searchmonkey/&quot;&gt;SearchMonkey&lt;/a&gt;, seem to go into the direction of open access, others like &lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=513&amp;amp;entry_id=126&quot;  onmouseover=&quot;window.status=&#039;http://www.hakia.com/&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot; title=&quot;http://www.hakia.com/&quot;&gt;Hakia&lt;/a&gt; are strictly closed-shop approaches. It will be very interesting to see where future development will be headed, given that knowledge is not necessarily a unique thing -- as can be easily seen in the many different ontologies that have been built or in the edit wars going on in Wikipedia. Probably these differences might go hand in hand with company size or profit -- the odd start up is as well on the list as large corporations. I don&#039;t like to sound like Cassandra, but given the recent financial crisis, I sincerly hope that the precious and small commercial flowers won&#039;t starve in another AI winter 2.0.&lt;br /&gt;
&lt;br /&gt;
But perhaps there is a chance that open source approaches bring some movement -- today I also happened to learn about &lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=514&amp;amp;entry_id=126&quot;  onmouseover=&quot;window.status=&#039;http://nepomuk.semanticdesktop.org/xwiki/bin/view/Main1/&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot; title=&quot;http://nepomuk.semanticdesktop.org/xwiki/bin/view/Main1/&quot;&gt;Nepomuk&lt;/a&gt;, a &quot;Networked Environment for Personalized, Ontology-based Management of Unified Knowledge&quot; -- that happens to be integrated in the latest KDE version 4.0. Although, from what I hear from my KDE using friends, none of them seems to love that particular integration, but I can&#039;t judge whether that has more to do with the backend or the integration.&lt;br /&gt;
&lt;br /&gt;
 
    </content:encoded>

    <pubDate>Tue, 13 Jan 2009 20:28:08 +0100</pubDate>
    <guid isPermaLink="false">http://hillview.1on.de/archives/126-guid.html</guid>
    <category>linguistics</category>
<category>semantic web</category>

</item>
<item>
    <title>Spellchecking isn't exactly trivial today, either</title>
    <link>http://hillview.1on.de/archives/113-Spellchecking-isnt-exactly-trivial-today,-either.html</link>
            <category>Knowledge processing</category>
    
    <comments>http://hillview.1on.de/archives/113-Spellchecking-isnt-exactly-trivial-today,-either.html#comments</comments>
    <wfw:comment>http://hillview.1on.de/wfwcomment.php?cid=113</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://hillview.1on.de/rss.php?version=2.0&amp;type=comments&amp;cid=113</wfw:commentRss>
    

    <author>nospam@example.com (Holger Schauer)</author>
    <content:encoded>
    &lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=469&amp;amp;entry_id=113&quot;  onmouseover=&quot;window.status=&#039;http://prog21.dadgum.com/29.html&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot;  title=&quot;http://prog21.dadgum.com/29.html&quot;&gt;James Hague blogs about how writing a spellchecker used to be a major feat of software engineering&lt;/a&gt;, the main point being that memory was a main problem that&#039;s no longer an issue today. While that is certainly true, the two issues (memory, spellchecking problem) are not as much related as James claims.&lt;br /&gt;
&lt;br /&gt;
Without making it explicit, he makes the assumption that for spell checking to work you have to use a dictionary consisting of entire (correctly spelled) words. That would have been a very unwise way to go  in the 80ties and it still is nowadays. /usr/share/dict contains roughly 240.000 words and you find that impressive? I don&#039;t. The real number of words of a language is much larger, which is easier to see in languages like German where you can easily build compound words. Hence, linguists typically try to capture as much information as they can in  word building rules. &lt;br /&gt;
&lt;br /&gt;
This will leave you with three things: a dictionary of words which are not built according to the generally applicable rules (for instance necessary for foreign words prominent in your language of choice),  and a set of stems, basic building blocks of words. Quite to the contrary of James&#039; claim that 3-5 lines of Perl are everything you need, the thing to keep in mind here is that this is the only way to solve the &quot;spellchecking problem&quot;: any full-form approach to spellchecking is doomed to fail because a language&#039;s vocabulary does not consist of a closed set of words. How else could publishers of dictionaries go on selling new versions of their dictionaries every year? Another point is that in some languages (e.g. German) some aspects of spelling might be dependent on context (grammatical and semantical context). I.e. in order to be able to decide whether it&#039;s correct to write &quot;Fahren&quot; you need to have a basic understanding of the grammatical composition of the sentence it appears in: it might be a noun derived from &quot;fahren&quot; (to drive) or it might be situated at the beginning of a sentence (and sentence boundary detection is not always trivial either). Now, if your spellchecker just checks a dumb list of words it can&#039;t decide whether &quot;Fahren&quot; is correct or wrong.&lt;br /&gt;
&lt;br /&gt;
With regard to the implementation issue, using a non-full form approach also helps: The nice thing is that you don&#039;t have as much redundancy and you can spellcheck a lot of words you&#039;ve never heard of just by following the rules. Unfortunately, the number of stems is still huge if you want to do achieve a good rate, so of course clever tricks to deal with the list of stems is still required. There are of course still different routes you might take to implement spellchecking: what ispell or aspell are doing, for example, is mostly using a full-form approach using only minor modifications to recognize some simple word production rules. This is not what you will find in state-of-the-art spell checkers.&lt;br /&gt;
&lt;br /&gt;
 
    </content:encoded>

    <pubDate>Mon, 07 Jul 2008 21:27:00 +0200</pubDate>
    <guid isPermaLink="false">http://hillview.1on.de/archives/113-guid.html</guid>
    <category>linguistics</category>

</item>
<item>
    <title>Semantic search: a hard task and a piece of cake</title>
    <link>http://hillview.1on.de/archives/112-Semantic-search-a-hard-task-and-a-piece-of-cake.html</link>
            <category>Knowledge processing</category>
    
    <comments>http://hillview.1on.de/archives/112-Semantic-search-a-hard-task-and-a-piece-of-cake.html#comments</comments>
    <wfw:comment>http://hillview.1on.de/wfwcomment.php?cid=112</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://hillview.1on.de/rss.php?version=2.0&amp;type=comments&amp;cid=112</wfw:commentRss>
    

    <author>nospam@example.com (Holger Schauer)</author>
    <content:encoded>
    This started out as a reddit reply to &lt;a href=&quot;http://hillview.1on.de/exit.php?url_id=462&amp;amp;entry_id=112&quot;  onmouseover=&quot;window.status=&#039;http://www.readwriteweb.com/archives/semantic_search_the_myth_and_reality.php&#039;;return true;&quot; onmouseout=&quot;window.status=&#039;&#039;;return true;&quot;  title=&quot;http://www.readwriteweb.com/archives/semantic_search_the_myth_and_reality.php&quot;&gt;Alex Iskold&#039;s article about semantic search engine technology&lt;/a&gt;. Alex suggests that the reason why we haven&#039;t seen the semantic web as overtaking as the search technology of today is because of two points: that they won&#039;t provide better search results today and that the current semantic search engines have UI problems. At the same time, he claims that (wrt. to the search results) &quot;emantic search is going to be big and it is going to help us answer questions that we simply cannot answer today - complex, inferencing queries asked over the entire web as if it was a database.&quot; I think that the UI point is irrelevant, or perhaps only relevant at this point in time.&lt;br /&gt;
&lt;br /&gt;
I agree with the search result point, but I&#039;m much more sceptical about the future results to expect.  &lt;br /&gt;&lt;a href=&quot;http://hillview.1on.de/archives/112-Semantic-search-a-hard-task-and-a-piece-of-cake.html#extended&quot;&gt;Continue reading &quot;Semantic search: a hard task and a piece of cake&quot;&lt;/a&gt;
    </content:encoded>

    <pubDate>Sat, 31 May 2008 12:14:58 +0200</pubDate>
    <guid isPermaLink="false">http://hillview.1on.de/archives/112-guid.html</guid>
    <category>linguistics</category>
<category>semantic web</category>

</item>

</channel>
</rss>