Dec 4, 1996 12:00 PM

Next Generation of Web Crawlers Will Distribute Tasks

There's a lot of information on the Web that simply can't be found with the traditional "spider" approach, Simson Garfinkel finds.

Web search engines like Lycos and AltaVista are great for indexing static HTML pages. Unfortunately, the idea of searching the Web with a "spider" and then building up a massive database is beginning to break down.

Search-engine companies are having technical problems keeping their massive databases current. Consider news-oriented sites like CNN, MSNBC, and Wired News: How can a Web spider find individual stories on these sites, when the pages are changing every hour?

Although they've been useful until now, there's a lot of information on the Web that simply can't be found with the traditional "spider" approach. Bad news: There are more index-resistant sites all the time.

Consider the Nasdaq site, which now offers nearly real-time stock quotes of all companies listed on its exchange. You can look up the market value of Sun Microsystems. But you would never know this by searching for the words "Sun Microsystems," "Nasdaq," and "stock price" on any Web search engine, because the Sun quote page is dynamically generated.

AltaVista's director of engineering, Barry Rubinson, isn't interested in solving this - there are just too many monstrous databases to worry about them. AltaVista has 50 million pages of text right now, says Rubinson; Dow Jones News Retrieval has 400 million. "I know of 10 other databases out there" that are just as big, he says. "We would be 100 times as large as we are now if we tried to index every page out there."

Right now, being 100 times larger is a physical impossibility. AltaVista already has four terabytes of online storage. It's just not possible with today's computer technology to build a database that has 400 terabytes of spinning storage, Rubinson said.

Besides, Scooter (AltaVista's Web spider) is having a hard time just keeping up with Internet growth. I scanned my Web server's log files and found out that Scooter is paying our site a visit less than once a month per Web page. For the most part, Scooter isn't even following links; all it's doing is fetching pages that have been individually registered with AltaVista's "register-a-URL" feature. Perhaps this explains why AltaVista returns so many dead links, and why it does a poor job of finding new stuff.

InfoSeek lead developer Chris Linblad and chairman Steven T. Kirsch are looking for something better: a way to quickly figure out what's new on a Web site, as well as ways to discover what information might be hidden inside a database that resides on a remote server. Rather than develop their own proprietary solution, they're working with Hector Garcia-Molina at Stanford's digital libraries project to develop a standard that's based on the Web's HTTP protocol. The standard will both let spiders scan Web sites more intelligently and allow for distributed searches throughout the Net.

Kirsch is advocating a special file in the root directory of Web servers. This file, called sitelist.txt, would list all the files on a Web server and the times they were last modified. Such a file would make it easy for a spider to keep tabs on even the most complicated sites. It would eliminate the need for spiders to follow links (because the file would contain the names of all the pages on the Web site), and it would eliminate the need to pull down pages that hadn't changed (because the spider could check the modification dates and simply pull down the pages when needed).

Distributed searching is more complicated. Basically, InfoSeek is working on a way to hand off a search from one search engine to another. This way, when you searched for Sun Microsystems on InfoSeek, it could simply hand off the search to the Nasdaq Web site, and you'd get a link back for the database-generated HTML page. Unfortunately, distributed searching would solve only half the problem. The other half is distributing meta-information about the distributed search engines. Otherwise, every Web search on InfoSeek will have to be handed off to every single database on the Internet. Not only would that be terribly inefficient, it would be stupid: There's clearly no reason to search the Library of Congress card catalog when you're looking for a current stock quote.

Linblad is building support for this narrowed distributed search into the InfoSeek enterprise search engine. Consider, says Linblad, "a big company spread out across the world: Lots of groups have set up Web servers. The company has a private network. They don't want this network all used up by a [spider] trying to index every Web site in the company, and they don't want it used up by people doing searches all over the country. The happy medium is to have people doing indexing on each one of the local Web servers and to build meta-indexes."

When you try to do the search, the meta-search engine will first figure out which search engines around the company should get the request, then send the query to each one. And finally, it will assemble the answers and give them to you in a digestible form.

None of the proposed systems will work on spider traps, like the one I've developed. Basically, a spider trap is an intentional set of HTML pages that seems like it goes on forever, because it really does.

To build a spider trap, you simply set up a CGI script to create page after page after page. And you've got to be a little clever: Be sure you don't put any question marks or the letters "CGI" in the script, or the spiders will realize you are trying to fool them into indexing your site and not bother.

I built the spider trap because I thought it would be fun. But it didn't take long for Chris Lindblad to convince me the idea was really stupid. If you're really successful with your trap and you really convince the spider to take a good swipe at your site, you're dead. Sites like InfoSeek, AltaVista, and Lycos have huge connections to the Internet. You don't, so they can easily fill up your bandwidth before you fill theirs. Furthermore, these sites have got really big disks: Your log files are sure to overflow before they run out of space to store the index.

In other words, building a good spider trap is a great way to mount a denial-of-service attack against yourself.

Few people realize how important search engines have become to any sense of order on the Web. Today they're all but indispensible. That's why solving these problems is critical to the future of the Web.