Nov 11, 2008 8:17 AM

Search Engine With Roots in Genomics Unlocks Deep Web

A research-focused search engine founded by Human Genome Project scientists is claiming to go where even Google doesn’t tread: the deep web. DeepDyve is designed to search the 99 percent (they say, citing a study from UC Berkeley) of hits not picked up by other search engines, which return pages based largely on interpretations of […]

A research-focused search engine founded by Human Genome Project scientists is claiming to go where even Google doesn't tread: the deep web.

DeepDyve is designed to search the 99 percent (they say, citing a study from
UC Berkeley) of hits not picked up by other search engines, which return pages based largely on interpretations of popularity and work only if a page is findable. Content hidden behind paywalls or that is not linked to enough sites to gain page rank remains obscure, but often contains the source material required for serious research.

It's the classic "needle in a haystack" problem: you know it is there, you know you can get to it, but ... how? DeepDyve attempts to bridge this gap with techniques used in genomics to identify DNA strands like pattern and symbol matching.

The company's technology uses an algorithm called “KeyPhrases” which indexes passages up to 20 words in length – not just single key words. Since the technology was conceptualized to identify long, complex strings of DNA, there was no need for semantics, just character recognition to sequence the human genome.

“It’s really doing pattern matching; it’s not at all language dependent,” CEO William Park told wired.com. “In fact it’s actually language agnostic.”

DeepDyve’s most interesting feature, what further distinguishes it from the likes of Google Scholar, is the ability to base a search on a large chunk of text or even a whole article up to 25,000
characters. Google only lets you search 32 words.

“If you were trying to look for the sequence for blue eyes, it could be massive in length,” said Park. “The query so to speak has to be very large.”

It will scan whole strings of text to find familiar segments, rank and order them, and finally locate the most relevant article in which it is found.

“It’s purely statistical – just like genomics,” said Park.

The 2003 UC Berkeley study of the deep web cited by the company, "How Much Information,"
was conducted by Hal Varian, current chief economist for Google. Varian found that there were about 91,000 terabytes of information in the deep web, and only 167 on the surface.

But Chris Sherman, executive editor of Search Engine Land, says that it's difficult to pin down an exact number of what's not being found.

“It’s one of these cases where they know the information’s out there, but because they can't access it, it’s almost impossible to accurately quantify,” he said, noting that databases and content management systems are not like typical web pages.

Sherman did his own investigation into the deep web six years ago when he was working on his book called “The Invisible Web," and he came to the conclusion there was anywhere from two to fifty times as much untapped information.

He thinks that
DeepDyve – with its free service – has great potential exploring this uncharted territory compared to competitors like LexisNexis.

A subscription-based service debuted at the DEMO conference a few months ago, but on Tuesday the company launched a free ad-supported version. And it's actively seeking out new publishers to open up their content to the public using its search.

“We’re going to publishers and we’re saying let us be your iTunes partner. Let’s build a platform together where we can re-market your content in a very IP/copyright friendly way and we’re going to make your information much more findable,” Park said.

DeepDyve currently indexes about 500 million pages and partners with a number of publications for free access to their content. This quarter the company, which focuses solely on topics like health, life sciences and patents, plans on expanding its focus into physical sciences, including information technology, clean technology and energy.