I have recently been dabbling in Natural Language Processing, in particular Question Answering. I have been fascinated by the success of IBM Watson and have gradually came to believe that this technology can serve as a great basis of autonomous agents operating in the complex world of human knowledge. (I later came across Project Aristo – I’m not alone.) This approach, compared to projects like OpenCog that aim to create autonomous agents understanding and operating in the physical world, seems to offer many advantages – but let’s talk about that some other time.
Let’s say we wanted to take a stab on approximating IBM Watson with easily available technology, in “at home” conditions (or rather, “at hackerspace” – I gave this aim a temporary callsign “Project Brmson”). What’s the best we can do?
So I took a look at the current open source question-answering technologies and found – well, just one, and none that would be immediately usable by anyone. I have put together a short survey of the current landscape.
The only OSS framework I found that (i) could be used with not-so-many modifications to produce something functional, and (ii) would be a good base to build a truly good system on, is OAQA / OpenQA. It seems appealing from multiple viewpoints – it builds on the UIMA unstructured data processing platform which is also at the basis of IBM Watson, it originates at CMU which collaborated with IBM in this area; and, well, it’s the only platform that already exists anyway, so it’s a good starting point for someone who has no prior clue about the field. A honorable mention goes to OpenEphyra, basically a non-UIMA OAQA predecessor by the same institution; it’s not a good base to use for new systems, but can be sourced for a lot of NLP functionality.
In my first stab, I looked if there is actually a working QA system built on top of OAQA, and the answer was non-obvious. There is a helloqa project, but its master branch can currently do nothing useful. However, there is also a prototype branch that can actually answer some terrorism-related questions! It doesn’t work out of the box, but our fork does if you follow the instructions. But overally the project seems to be a bit of a hack and not a good base for a universal system usable by anyone but the original author.
So I set out to rewrite the helloqa-prototype from scratch on top of OAQA and build a different, clean and extendable QA pipeline (that shares bits of the original code and is much simpler). Thus, behold the project BlanQA! :-)
BlanQA is focused on universality, practicality and user-friendliness. That means there is a relatively detailed documentation and easy to follow installation instructions (try BlanQA out yourself!). By default, BlanQA offers interactive mode and will answer on top of Project Gutenberg corpus; but you can also connect it to IRC (#brmson @ freenode) or run on top of Wikipedia.
BlanQA is still a very stupid program at this point. It gets the answer right about 10-30% of the time, depending on how nicely you ask. But it’s more important as a base on top of which you can add clever algorithms (the smartest parts of BlanQA are currently outsourced from the OpenEphyra project, mainly guessing the type of the answer – is it a person? location? amount of something?). And if you want an OSS question-answering engine now, BlanQA is where to turn!
I want to develop this further, but the way ahead remains a little unclear. The thing is, OAQA appears to have significant architectural problems, as I realized while I continued hacking BlanQA and learning more about both OAQA and the UIMA framework it builds on top of. The rest of this section is a bit technical, c.f. also a quick intro to BlanQA architecture.
The basic UIMA principle is that each artifact (in this case: question, document/passage, answer) should have its own CAS (“piece of data” with a set of annotations and other featuresets derived from it) with a dedicated type system and appropriate Sofa (view of this piece of data). This would enable easy creation of stand-off annotations of e.g. fetched documents.
However, the OAQA model works with just a single CAS that has just the question text set as a Sofa and then a variety of types mashed together, partitioned only into phase-based views. This seems to me as a substantially less appealing option – it doesn’t allow to use third-party UIMA annotators that expect their subject to be the Sofa, it might be harmful for scaleout and it seems generally awkward to use; I actually have hard time seeing what advantages does using UIMA bring on the table in this model.
So it seems the way forward for BlanQA (or likely a differently-named successor) is to break away of OAQA and build directly on top of UIMA (possibly with a hacked version of uima-ecd that supports multiple CAS, but that seems as a bit intimidating proposition).
Tue Jan 28 2014 update: Note that we have started work on a new Question Answering engine YodaQA built on UIMA from scratch.