Sep 1, 1996 12:00 PM

Web Brownout

Why the hell is the Web so slow? What exactly is causing the problems? And how are we ever going to fix them?(Chill out and read on. It's going to be OK.)

Why the hell is the Web so slow? What exactly is causing the problems? And how are we ever going to fix them?(Chill out and read on. It's going to be OK.)

On November 9, 1965, the Great Northeast Blackout left 30 million people in the United States in darkness. It started as an unexpected power surge through the relatively new regional electric grid, but soon cities across the region were going dark as power plants in New York, Connecticut, Massachusetts, Vermont, New Hampshire, and Maine tumbled like dominos. Over the coming six to twelve months, computer users around the planet are likely to experience the Internet equivalent of the Great Blackout, or at least frequent brownouts, as our information infrastructure staggers and struggles under the heavy onslaught of new users and new demands.

These slowdowns will be more than just a minor annoyance: they will challenge the very future of the network. Businesses that depend on the Internet will find themselves cut off from their branch offices, their suppliers, and their customers. Web sites that charge by the hit for advertising will see themselves short of funds, since network congestion will keep people away. And press reports of the slowdowns will keep new users away, further confounding business models that depend on the Internet's continued growth.

Fear not: We fixed the power grid after 1965, and we'll fix the information grid, too. Back then, the Great Blackout sent consumers, businesses, government officials, and the media into a panic. Some predicted doom from a society and technology that was growing too complex. A month later, the Federal Power Commission issued a report that raised the standards of service for the industry by calling for more careful operating procedures and more investment in interconnections between regional power companies. The influential report concluded: "The utility industry must strive not merely for good, but for virtually perfect service." And not many years later, we got it.

Today, we're watching not the end of the information network, but the beginning. "There will be a collapse, then operations will resume. There will be another collapse, and operations will resume," says Bob Metcalfe, inventor of the Ethernet network technology, and one of the loudest voices proclaiming the coming doom. After each collapse, Metcalfe says, people will try to figure out what happened, then fix the network so that it won't happen again. Eventually, he predicts, a new industrial-strength Internet will emerge that will be solid and reliable, and will be different from today's Internet in one very important respect: people will pay for what they use.

It's easy to spot the symptoms of Internet slowdown: Web pages crawl onto your screen, images drip-drip-drip into existence. But diagnosing the problem can be far more complex - as complex as the network's topology itself. Slowdowns fall into two broad categories: delays caused by the network, and delays caused by network servers. There is no single organization to blame, because everybody's system needs improvement. Fortunately, all of the problems are solvable.

The typical act of viewing a Web page involves sending at least a dozen packets of information over at least five separate networks: 1) From your computer, over your modem and phone line, to your Internet service provider. 2) From your ISP to one of the national backbone providers - usually MCI, BBN, or Sprint. 3) From the backbone provider through one of the national peering locations to another backbone provider. 4) From the second backbone provider to the Web site's ISP. 5) From the Web site's ISP to the Web server itself. And then, of course, the whole process has to run in reverse across the same five networks. Repeatedly.

A fault in any of these paths can lead to a slowdown or an outright info blackout. That's because even though people claim that the Internet was designed to withstand an atomic blast, today's Net isn't being run that way. While the nation's big network providers mostly have redundant paths between their own locations around the nation, the links between your modem and the backbone aren't redundant. Neither are the links between the backbones and many commercial Web sites.

Like the power industry in 1965, the nation's Internet providers are now scrambling to make the network more dependable. (An increasing number of sites, for example, are installing second links to insure a reliable Web presence.) But that push is being complicated by tremendous growth in almost all uses of the Net. At any given time, the companies developing the hardware and software running the Net are about six to twelve months ahead of the curve. In other words, today's top-of-the-line Internet routers, data circuits, and servers won't be up to the task of satisfying network traffic demands in six to twelve months. Fortunately, by then a new generation of hardware and software will have been released - a generation that itself will be obsolete in another six to twelve months.

Modem malaise

The first clue of impending network logjams usually comes when you click on a link - and the Web page takes three minutes to download. What's wrong?

The problem could be your modem. You can't slurp pages off the Internet faster than the speed of your connection. To make matters worse, many modems are configured so that they run even slower than needed - though how to solve this is a subject of great debate. Most modems are equipped with circuitry that automatically corrects data transmission errors as they occur. But if your computer is connected to noisy telephone lines, you may see better performance by turning off your modem's error correction. That's because the Internet Protocol is better at dealing with line noise than the modem protocols are. IP will retransmit packets that are lost or arrive damaged, but your modem will keep trying to send the garbled data again and again, until it finally succeeds or hangs up the phone out of frustration.

On the other side of the telephone line is another modem - one of many stacked up at your ISP. In all likelihood, that modem is connected to a box called a terminal concentrator, which takes the data you sent and puts it on an Ethernet local area network. From there, the data probably travels to a router, down a high-speed line, and ends up on a national backbone.

Until your packets reach the backbone, they are vulnerable. Turn off one of the modems, break the Ethernet connection, or crash one of the routers, and your link to the Net disappears. Until you reach the backbone, there's almost no redundancy. Backbone bottleneck

The Internet's various backbones can cause their own bottlenecks and delays. Almost always, these delays are caused by too many people trying to send data down the same link at the same time.

Unlike the phone system, which reserves channels end to end for every active phone call, the long-haul links of the Internet are shared moment to moment by all of the packets that are trying to cross them at that time. When the links get filled up, users don't get busy signals as they do on telephone lines. Instead, they get increasingly poor performance. Dropped packets. Delayed responses. Data death.

But keeping data flowing smoothly requires that Internet providers walk an economic tightrope. That's because it costs providers substantial amounts of money to increase their network capacity, but these costs can't be billed directly to their customers. Thus providers want to install very fast backbones so customers won't see delays, but they don't want to install too many because that expense eats into the bottom line.

What complicates the equation even more is today's widespread practice of flat-rate pricing - charging a single monthly fee for a Net connection, be it a dialup modem or a T1. With flat-rate pricing, the network provider does not want the customer to use the connection, because any use increases the demand on the provider's backbone. That's backward, says Metcalfe. "If you are a network supplier, you should benefit from your customers using the network, not say, 'Gee, I hope they don't use it too much.'"

Flat-rate pricing, at least for T1 connections, may be getting phased out. Already some providers, such as BBN and Alternet, have discovered that they can lower prices for many T1 customers who are relatively light users by raising the prices for customers who are heavy users. The result: metered T1 connections.

Fortunately, the Internet's backbone providers aren't simply waiting for the economics to be fixed: they are busy installing new equipment and adding capacity to keep up with anticipated demand. This task is substantially complicated by the 90 to 120 days it can take to order the necessary hardware, wait for its delivery, lay the fiber, and arrange for the complex configuration that high-speed connections require.

Consider MCI, which dramatically improved the speed of its backbone in the spring of 1996 by installing high-speed ATM switches and connecting them together with OC-3 links delivered by fiber optics. Each optical-carrier link moves data at speeds of 155 Mbps between the company's switches - the equivalent of more than 100 T1s. That's fast enough to let more than 5,000 people simultaneously download Netscape's homepage at the top speed of their 28.8 modem.

But OC-3 is just the beginning. Simply by changing a card in the ATM switch, MCI can quadruple the speed of the circuits, moving them from OC-3 to OC-12. "We have plans to upgrade to OC-12s this year," says Rob Hagens, MCI's director of Internet engineering.

Even now, information flows relatively well along the backbones. "Within the individual networks, we see very few problems," says John Curran, chief technical officer at BBN Planet. But all is not roses. That's because there is more than one company with an Internet backbone, and, as Curran points out, "right now the Internet is being stressed at the interconnection points." Malfunctioning MAEs

The Internet's biggest problems today are at the MAEs - the metropolitan area exchanges where the country's big Internet providers trade packets with one another. Imagine a nation of six-lane highways that all converge on a few clogged cloverleafs. That's the MAEs. Two of the largest and best-known of these are MAE East, in North Virginia, and MAE West, in the San Francisco Bay area.

The solution? Build more capacity, and build it smarter. To build more, MFS Communications Company, the organization that runs the MAEs, has been installing its own high-speed ATM networks that can handle significantly more traffic. To build smarter, MCI, Sprint, and BBN are establishing "private peering" locations, where two companies will interconnect their networks and evenly split the costs of doing so. Private peering "should pull

a significant amount of traffic out of these exchange points and leave more bandwidth available for everyone else," says Benham Malcom, manager of SprintLink engineering.

Of course, private links like this complicate the overall structure of the Internet. Computers at both ends need to know if the direct link is up or down, so they can decide whether to send packets across the link or to the public exchange. Making that decision is the quintessential problem of routing. And it's a problem that's getting harder and harder to solve every day. Roundabout routing

Every packet that travels the Internet is labeled with its final destination, but not with the route that the packet needs to take. If a link between two parts of the network is damaged - for example, if a long distance circuit is accidentally cut by a utility crew - the network is supposed to automatically route the packets around the point of failure. To accomplish this feat, routers on network backbones need to have a complete map of the network's current structure. The map lets the routers decide where to send each packet on a packet-by-packet basis.

The size of routing maps has been growing steadily ever since the Internet's birth. Three years ago, the gurus on the Internet Engineering Task Force worried that routing tables were growing too fast - in particular, they were doubling in size every nine to ten months, whereas the density of RAM chips inside the routers, the chips that hold the routing tables, was doubling only every 11 to 24 months. If nothing was fixed, then at some point the routing tables would become too large to fit in the routers, and the network would melt down.

Fortunately, something was fixed. The solution was to change the way the maps are stored in the routers and transmitted around the network. Engineers developed a new system called CIDR (classless interdomain routing), which allowed individual networks on the Internet to be automatically aggregated into larger networks. The immediate result was smaller routing tables. The long-range payoff was routing tables that grow at a slower rate. Once again, the imminent meltdown of the Internet was pushed a few years further into the future.

One problem that remains is route flapping. Every time a route on the Internet gets turned on or cut off, that information has to be sent to hundreds or thousands of other routers. When routes go up and down repeatedly, they are said to "flap." A flap can happen because a router is rebooting, or because a long-haul link between two routers starts generating errors, or because a route is improperly configured.

Most route flaps are harmless, affecting only a few customers. Sometimes, though, flaps can trigger long-dormant bugs in the routers' computer programs, causing widespread failure. (In 1980, for example, the Arpanet collapsed because of such a bug - each router crashed, but only after it sent out packets to neighboring routers telling them to do the same thing.) Flaps can dramatically change traffic patterns on the Net, causing moments of congestion followed by periods of relative calm. To the untrained observer, the Internet seems to suddenly stop working, start working again a few minutes later, then stop working again.

Right now, Internet providers don't seem to have a technical solution to stop route flapping. That's because the information that a link has gone up or down has to be carried to other routers on the network so they can route around the failure. Instead, Cisco Systems Inc., which manufactures the lion's share of routers that run the Internet's backbones, recently added an anti-flapping feature to its router software to deal with organizations that are flapping their routes too much. Anti-flapping software lets the backbone routers detect when a connected network is flapping its routes. The backbone router can then drop any packets that are destined for the "problem" network for about half an hour. Think of it as a form of electronic time-out. Swamped sites

If the user's modem is working, and the connection to their Internet service provider is up, and the ISP's link to the Internet backbone is in place, and the Internet backbones aren't congested, and the MAEs aren't overloaded, then eventually the packets end up at the destination Web site. Here's the final place where connections can bog - or break - down. At the server.

If there's a Web site that is ready for the massive loads that many will see in the future, it's Netscape's. Right now the Netscape site is getting 80 million hits a day. Thanks to the millions of users who have left home.netscape.com/ as their browser's default homepage, and the people who frequently click Navigator's Net Search, Netscape's Web site has become the busiest site of all.

Robert Andrews, director of the Netscape site, can almost literally watch the world turn through his Silicon Graphics Challenge server. The Challenge crunches the log files from the 50 computers that make up Netscape's presence on the Internet. As the business day starts in Japan, then Europe, then New York, Chicago, and San Francisco, there are a series of surges on Netscape's site as millions of office workers sit down at their computers and start their browsers. A million hits here, a million hits there. The screen shows the pulse of the global network.

To satisfy these huge demands requires three things: a connection that is fast enough to pump out the data, computers that are fast enough to keep up with

the demand, and memory that is large enough to support thousands of simultaneous connections.

Building the connection to the outside world is relatively easy: Andrews has arranged for three fiber-optic T3 connections, each capable of sending data at 45 Mbps. One goes out the front of the building to Sprint, while the other two go out the back of the building to MCI. The geographical diversity assures that if one line is accidentally severed, the others will in all likelihood continue operating and carry the load. At Netscape's headquarters in Mountain View, California, the fiber carrier, surrounded by an armored steel pipe, is visible as it comes out of the ground and enters Netscape's machine room. Employees joke, "That's where the money goes in."

But building the servers themselves is a bit more complicated. Today no single computer is large enough to handle the onslaught. (See "The Domain Name System," page 84.) Instead, Andrews has built a system that distributes the load across more than one machine. There are actually more than 30 computers pretending to be home.netscape.com, all of them holding identical Web pages. Your browser randomly picks one when you click the big N on Navigator or press one of the directory buttons.

Each server is also equipped with a lot of memory. That's because each user who downloads a page from a Netscape server ends up reserving a tiny piece of the computer's memory. There are limits within the Unix and Windows NT operating systems on how many open network connections can exist on the same machine. Most workstations are hard-pressed to handle more than 20 to 40 simultaneous connections. Crammed with 128 Mbytes of RAM apiece, each Netscape server can handle more than 4,000 connections.

Eventually, though, Netscape will have to spread the load not merely across more servers, but across the planet itself. That's why Andrews plans to build server farms in Paris, Stockholm, Sydney, Tokyo, and Hong Kong. It makes no economic or technical sense to install high-speed transpacific circuits so that Netscape's Japanese-speaking users in Japan can click the big N and see the Japanese version of Netscape's homepage appear from a Web server in California. Moving the Japanese content to Japan would simultaneously give users better performance and cost less to operate. Better, stronger, faster

It seems that the key to stopping Web slowdown - and building a new, industrial-strength Internet - is for everyone working within the five network spheres to do their part.

In the days following the Great Blackout of '65, businesses and government officials took stock of the disaster and tried to figure out what to do next. The Federal Aviation Administration ordered large power generators for 50 airports across the country so that they could continue to operate during future blackouts. Likewise, hospitals, tunnels, drawbridges, even gas stations around the Northeast were told to make sure that they developed alternative sources for electricity. Yet few of them did.

Instead, the power industry made good on the Federal Power Commission's challenge to turn good service into "virtually perfect service." Part of the formula was that consumers paid both for what they used and for the peak demands they placed on the system. The resulting reliable power system has substantially benefited all of society - even residential users, who might not see the value in paying for uninterruptable service until they lose a few refrigerators' worth of groceries.

If the Internet is to become a true information utility, the same commitment to "virtually perfect service" will have to be made. As that happens, companies and individuals will find that it makes more and more economic sense to depend on the Internet's infrastructure rather than to attempt to duplicate it with their own private networks. In the end, they undoubtedly will come around to the idea of paying for what they use.