June 6th, 2012
Over the weekend, we stumbled on a paper called Towards Building Robust Natural Language Interfaces to Databases, which looks like it was published back in 2007 or 2008.
In section 6, the authors say "The PRECISE group reported a side experiment in which a student took over 15 hours to build a Geoquery 250 NLI using Microsoft’s EnglishQuery tool. The resulting system achieved rather poor results for such an expensive effort – approximately 80% precision and 55% recall, yielding a correctness of approximately 45%. Our limited experience with the Microsoft English query tool was rather frustrating as well, but it would be interesting to repeat this experiment and also see how other commercial systems such as Progress Software’s EasyAsk and Elfsoft’s ELF fare."
So on Monday we built a complete interface. Tuesday we checked it over and polished a few rough edges.
The interface would have 100% perfect accuracy, except that their data is inconsistent, so it's impossible to really answer #163, what is the population of the major cities in wisconsin? For example, if we asked this question about Hawaii, would the answer be the 4 cities listed as City1, City2, City3 and City4? Or would it be only the 3 cities listed in the City table for Hawaii?
Besides this (which could be fixed in minutes, if we got an answer one way or the other), constructing the interface was pretty straightforward. The first step was to import their tables into Microsoft Access, and because some of their relationships were expressed as lists, we wrote a few queries to render them as recordsets. For example, the Border table looks like this:
so we wrote a query called [Names of Bordering States], defined so:
SELECT Border.state, State.state_name AS BorderState FROM State INNER JOIN Border ON InStr(Border.states_that_border_it,"'" & State.state_name & "'")>0;
We did the same with rivers and also added a query, Capitals, to list info of just the capital cities:
Then we ran the interface generator in automatic mode, and found that pretty much every style of question was handled.
What exactly does that mean? It doesn't mean that every question was answered. It means that, considering what information the question was expected to retrieve, we could quickly formulate a question to get just that information.
For example, take the question "what is the highest point in the US?"
Because the HighLow table has a field called highest_point, a naive answer to this question might list all the highest points in the US, just as if we'd asked "what are the highest points in the US?" This isn't what we want at all. So we need to add a rule in the "When I Say, What I Mean" section:
what is the highest point in the us ==> show highest_point that has largest highest_elevation
This is the step at which people usually ask, "Shouldn't the system figure that out by itself? Isn't that the whole point of a natural language system?"
Maybe in the future. Today, the best NL system in the world (which is this one) lets you create a robust, easily customizable NL interface, if you're willing to address the unavoidable ambiguities in language, and tell the system just what it is you want to see. Once you've done that, you can simply type: "what is the highest point in the us" -- the system will first apply these substitution rules to get a more targeted query, ie. "show highest_point that has largest highest_elevation" -- and finally, it will process the query, navigating all the requirements of relational database systems such as SELECTs, JOINs etc. to produce:
OK, but how many of these substitutions did we have to write for these 250 questions? The answer: 130
This really isn't as much work as it seems. First of all, several of the rules are only required because the questions have errors, probably because they're designed for a system running Prolog, which handles apostrophes oddly. Normally you wouldn't expect errors like "whats the largest city?" -- but OK, if it happens, we have a simple rule:
Also, many of the rules are generalizations, added to answer questions that don't even appear in the 250 questions. For example, since we needed a rule for "most populous state" (question 161), we added these six just in case:
Here's the complete set of 130. As mentioned, this was created in a single day. That's because very rarely does adding a rule change the behavior of another rule. At worst, we might have to add a 1/2 or a/b to ensure that the rules are used in the right order. Seriously, do any of these rules look too fearsome? How about:
The problem was to distinguish between questions like "which rivers run through new mexico?" and "which rivers run through states bordering new mexico?" There might be many ways to solve this issue, for example using substitutions like those above. But in the ELF system there's a very simple solution.
Each database can have as many different interfaces as you like; we call them Views. We can also write scripts to decide which View should be used, based on the user, the question, etc. So in this case, we can immediately see that if the word "border" doesn't appear in the question, we can't be interested in rivers flowing through border states. Ergo,
Step 2: rerun the analysis done for GeoQuery using this modified Relationship map (we called the View NoBorders).
Step 3: add a View selector script into the Settings window's script panel. Enter the function's name, eg. ViewPicker, in the Name panel.
Step 4: enter ViewPicker in the slot for Question Script on the View tab of Settings, and check off the Question Script box to activate it.