{"id":317,"date":"2014-01-27T01:25:00","date_gmt":"2014-01-26T23:25:00","guid":{"rendered":"http:\/\/log.or.cz\/?p=317"},"modified":"2014-01-28T21:41:46","modified_gmt":"2014-01-28T19:41:46","slug":"brmson-blanqa","status":"publish","type":"post","link":"https:\/\/log.or.cz\/?p=317","title":{"rendered":"Brmson \/ BlanQA"},"content":{"rendered":"<p>I have recently been dabbling in Natural Language Processing, in particular <b>Question Answering<\/b>. I have been fascinated by the success of <a href=\"http:\/\/www-03.ibm.com\/innovation\/us\/watson\/\">IBM Watson<\/a> and have gradually came to believe that this technology can serve as a great basis of autonomous agents operating in the complex world of human knowledge. (I later came across <a href=\"http:\/\/www.allenai.org\/TemplateGeneric.aspx?contentId=8\">Project Aristo<\/a> &#8211; I&#8217;m not alone.) This approach, compared to projects like <a href=\"http:\/\/opencog.org\/\">OpenCog<\/a> that aim to create autonomous agents understanding and operating in the physical world, seems to offer many advantages &#8211; but let&#8217;s talk about that some other time.<\/p>\n<p>Let&#8217;s say we wanted to take a stab on approximating IBM Watson with easily available technology, in &#8220;at home&#8221; conditions (or rather, &#8220;at hackerspace&#8221; &#8211; I gave this aim a temporary callsign &#8220;Project Brmson&#8221;). What&#8217;s the best we can do?<\/p>\n<p>So I took a look at the current <b>open source question-answering technologies<\/b> and found &#8211; well, just <i>one<\/i>, and none that would be immediately usable by anyone. I have put together a <b><a href=\"http:\/\/brmlab.cz\/project\/brmson#knowledge_base\">short survey<\/a><\/b> of the current landscape.<\/p>\n<p>The only OSS framework I found that (i) could be used with not-so-many modifications to produce something functional, and (ii) would be a good base to build a truly good system on, is <b><a href=\"http:\/\/oaqa.github.io\/\">OAQA \/ OpenQA<\/a><\/b>. It seems appealing from multiple viewpoints &#8211; it builds on the UIMA unstructured data processing platform which is also at the basis of IBM Watson, it originates at CMU which collaborated with IBM in this area; and, well, it&#8217;s the only platform that already exists anyway, so it&#8217;s a good starting point for someone who has no prior clue about the field. A honorable mention goes to <b><a href=\"http:\/\/www.ephyra.info\/\">OpenEphyra<\/a><\/b>, basically a non-UIMA OAQA predecessor by the same institution; it&#8217;s not a good base to use for new systems, but can be sourced for a lot of NLP functionality.<\/p>\n<p>In my first stab, I looked if there is actually a working QA system built on top of OAQA, and the answer was non-obvious. There is a <b><a href=\"https:\/\/github.com\/oaqa\/helloqa\/\">helloqa<\/a><\/b> project, but its <i>master<\/i> branch can currently do nothing useful. However, there is also a <i>prototype<\/i> branch that can actually answer some terrorism-related questions! It doesn&#8217;t work out of the box, but <b><a href=\"https:\/\/github.com\/brmson\/helloqa\">our fork<\/a><\/b> does if you follow the <a hef=\"http:\/\/brmlab.cz\/project\/brmson\/helloqa-prototype-howto\">instructions<\/a>. But overally the project seems to be a bit of a hack and not a good base for a universal system usable by anyone but the original author.<\/p>\n<hr>\n<p>So I set out to rewrite the helloqa-prototype from scratch on top of OAQA and build a different, clean and extendable QA pipeline (that shares bits of the original code and is much simpler). Thus, behold the project <b><a href=\"https:\/\/github.com\/brmson\/blanqa\">BlanQA<\/a><\/b>! :-)<\/p>\n<p>BlanQA is focused on universality, practicality and user-friendliness. That means there is a relatively detailed documentation and easy to follow <a href=\"https:\/\/github.com\/brmson\/blanqa\/blob\/master\/README.md#installation-instructions\">installation instructions<\/a> (try BlanQA out yourself!). By default, BlanQA offers interactive mode and will answer on top of Project Gutenberg corpus; but you can also connect it to IRC (<a href=\"http:\/\/webchat.freenode.net\/?channels=#brmson\">#brmson<\/a> @ freenode) or run on top of Wikipedia.<\/p>\n<p>BlanQA is still a <i>very stupid program<\/i> at this point. It gets the answer right about 10-30% of the time, depending on how nicely you ask. But it&#8217;s more important as a base on top of which you can add clever algorithms (the smartest parts of BlanQA are currently outsourced from the OpenEphyra project, mainly guessing the type of the answer &#8211; <i>is it a person? location? amount of something?<\/i>). And if you want an OSS question-answering engine now, BlanQA is where to turn!<\/p>\n<hr>\n<p>I want to develop this further, but the way ahead remains a little unclear. The thing is, OAQA appears to have <b>significant architectural problems<\/b>, as I realized while I continued hacking BlanQA and learning more about both OAQA and the UIMA framework it builds on top of. The rest of this section is a bit technical, c.f. also <a href=\"https:\/\/github.com\/brmson\/blanqa\/blob\/master\/README.md#a-brief-walkthrough\">a quick intro to BlanQA architecture<\/a>.<\/p>\n<p>The basic UIMA principle is that each artifact (in this case: question, document\/passage, answer) should have its own CAS (&#8220;piece of data&#8221; with a set of annotations and other featuresets derived from it) with a dedicated type system and appropriate Sofa (view of this piece of data). This would enable easy creation of stand-off annotations of e.g. fetched documents.<\/p>\n<p>However, the OAQA model works with just a single CAS that has just the question text set as a Sofa and then a variety of types mashed together, partitioned only into phase-based views.  This seems to me as a substantially less appealing option &#8211; it doesn&#8217;t allow to use third-party UIMA annotators that expect their subject to be the Sofa, it might be harmful for scaleout and it seems generally awkward to use; I actually have hard time seeing what advantages does using UIMA bring on the table in this model.<\/p>\n<p>So it seems the way forward for BlanQA (or likely a differently-named successor) is to break away of OAQA and build directly on top of UIMA (possibly with a hacked version of uima-ecd that supports multiple CAS, but that seems as a bit intimidating proposition).<\/p>\n<hr>\n<p><b>Tue Jan 28 2014 update:<\/b> Note that we have started work on a new Question Answering engine <a href=\"http:\/\/github.com\/brmson\/yodaqa\">YodaQA<\/a> built on UIMA from scratch.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I have recently been dabbling in Natural Language Processing, in particular Question Answering. I have been fascinated by the success of IBM Watson and have gradually came to believe that this technology can serve as a great basis of autonomous agents operating in the complex world of human knowledge. (I later came across Project Aristo [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[123,60,126,127,124,125,9],"class_list":["post-317","post","type-post","status-publish","format-standard","hentry","category-software","tag-ai","tag-brmlab","tag-brmson","tag-java","tag-nlp","tag-qa","tag-research"],"_links":{"self":[{"href":"https:\/\/log.or.cz\/index.php?rest_route=\/wp\/v2\/posts\/317","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/log.or.cz\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/log.or.cz\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/log.or.cz\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/log.or.cz\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=317"}],"version-history":[{"count":3,"href":"https:\/\/log.or.cz\/index.php?rest_route=\/wp\/v2\/posts\/317\/revisions"}],"predecessor-version":[{"id":319,"href":"https:\/\/log.or.cz\/index.php?rest_route=\/wp\/v2\/posts\/317\/revisions\/319"}],"wp:attachment":[{"href":"https:\/\/log.or.cz\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=317"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/log.or.cz\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=317"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/log.or.cz\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=317"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}