For search engine developers, the comedy bit "Who's on first" is an occupational nightmare. Without understanding the context of a phrase - that Who's on first, What's on second, and I Don't Know's on third - search engines are as confused about the intended meaning of a word as Costello was by Abbott.
But a company called InXight Software claims it has come up with a solution to the problem of determining context in a query. The innovation behind context-sensitive searches was found in improving upon a technique called a finite state machine, a software program designed to recognize repeating patterns in a data set. Finite state machines have a long history in computer science, and are used with particular success in voice-recognition technology.
"It's been the leading methodology for the past 20 years. What would distinguish an innovation is the knowledge base built into the finite state [machine]," said Jim Baker, CEO of Dragon Systems, a voice-recognition software-maker in Cambridge, Massachusetts.
InXight is a subsidiary of Xerox's famed Palo Alto Research Center, an organization as famous for missing out on the commercial possibilities of its research as it is for its inventions. In this case, InXight quickly encapsulated its new technology in a toolset, which has since been licensed by Microsoft, Oracle, Infoseek, Verity, and SPSS Inc., a statistical software developer.
The latest version of InXight's software, called LinguistX, offers the finishing enhancements to a knowledge base built into a finite state machine. Designed by two researchers, one trained in artificial intelligence and the other in computational linguistics, LinguistX offers an improvement on traditional finite state machines, a technology called finite state transducers.
Finite state transducers go beyond recognizing word patterns to understanding the meanings of different lexiconical forms. For example, to a search engine not using finite state transducers, the phrase "the white house" contains an article, "the," an adjective, "white," and a noun, "house." But a technology in the transducers, called a linguistic morphological tool, looks for clues to put a group of words in context. In the case of "the white house," the linguistic morphological tool identifies "the" before "white" as a meaningful combination. An embedded dictionary then seeks out the phrase, and the search engine is instructed to find other words associated with "the white house." Up come government URLs, not sites dedicated to home improvement.
Beyond a contextual search, the other advantage to finite state transducers is speed, says Ian Hersey, advanced product planning manager at InXight. Finite state transducers operate in a compressed environment. This means that unlike conventional software, the program operates like a data set, so a search can be applied to the technology while it's still compressed. LinguistX's French dictionary, for example, offers some 5 million words, but only takes up 300K of disk space.
"What this means is Infoseek doesn't have to buy more hardware to conduct thousands of searches a second. For end users, they don't know why their searches are in context and fast, but they understand that Infoseek is providing extremely good performance," said Hersey. Rather than mindshare with end users, InXight hopes to become a de facto standard with software companies.
In addition to LinguistX, InXight is also releasing the Summarizer, which uses finite state transducers to create summaries of articles at speeds approaching 1 GB of data per hour. The software supports 13 languages, including Japanese, a language considered extremely difficult to develop linguistic programs for because its written language doesn't separate individual words with spaces. In Kanji, for example, the phrase "Tokyo Metropolitan Area" can be read as completely different cities simply by dividing the phrase in different ways. Other languages are expected to be added soon, Hersey says.