What really went down when the internet went down

Fastly’s outage took out a chunk of the internet. The next one could be even bigger
What really went down when the internet went down
WIRED

Some people noticed the problem when they couldn’t access The Guardian. Others struggled with the New York Times or the UK government’s websites. Others couldn’t buy things Amazon. People started to panic as another global outage struck internet users.

The latest outage, which began just before 11am UK time, appears to have hit Fastly, a content distribution network, or CDN, and knocked out every company that used its services to support their websites. Across the internet, “Error 503 service unavailable” appeared on people’s screens.

Fastly identified the issue within 45 minutes and told the world that “a fix is being implemented”. The sites began to trickle back online soon after.

In less than an hour, a company that most people have never heard of showed how vulnerable our global internet infrastructure is. The vestiges of the internet not affected went into overdrive with speculation. While #cyberattack was trending Twitter, the reality was more prosaic: someone misconfigured a server. “We identified a service configuration that triggered disruptions across our POPs globally and have disabled that configuration, Fastly explained in a tweet. “Our global network is coming back online.”

“When you configure your servers, you give them commands. If you tell them something wrong, then it affects potentially all servers at once,” says Christian Kaufmann of Akamai, a competitor – who didn’t want to give specifics about Fastly. This isn’t the first time a configuration error has caused havoc. A 2020 outage at Cloudflare, another key CDN, occurred because of a configuration error, the company confirmed.

The fault that brought down the many websites that use Fastly highlights one of the problems of our interconnected internet. Power in the CDN space is concentrated in the hands of the big three providers: Akamai is the biggest, then Fastly and Cloudflare compete in terms of the volume of internet traffic they serve. Amazon itself also has its own CDN platform, Amazon CloudFront.

Fastly is mostly aimed at enterprise customers, such as the UK government, The Guardian and Amazon – the latter of which began moving over its content delivery to Fastly in earnest in mid-2020, according to reports. “Because it’s used by high-profile sites, when they have an issue, we notice it straight away,” says Andy Davies, an independent web performance consultant. “It’s widely-used under the hood and it’s a piece of the internet most people don’t think about.”

Like any CDN, Fastly acts as a service provider to deliver content hosted in one place closer to internet users in another place. Take, for instance, a website with servers located in New York. People who access the website nearer New York will use those servers to quickly access that content. People thousands of miles away – in the UK, for example – will find the page takes longer to load. This is a problem, especially for video streaming.

CDNs solved this through physical infrastructure called edge servers, powerful computers that are on the “edge” of networks where data computation needs to happen, located in countries close to users. “The edge is the bit of the internet that sits between the cloud and users,” says David Grunwald, a digital strategy consultant.. “It’s the notion of storing processing and content data towards where users are, as opposed to in very remote data centres. CDNs will typically pick up the most used content and cache it closer to population centres.”

But the issue here wasn’t that data was simply delayed. It wasn’t being transferred at all. Our growing demand for fast-loading websites and smooth-running access to the internet has meant relying on third party infrastructure providers, who can take down an entire chain if they go offline. This time, the scope of the outage was huge: Reddit, Stackoverflow, Twitch, Github, Amazon, PayPal, Shopify, HMRC, eBay and most news organisations were down. Next time, it could be worse . “The internet wasn’t designed in its earliest incarnation to deliver huge loads of data, and therefore the need to store it on the edge has risen in a big way in the last few years,” says Grunwald.

Ninety-nine per cent of the time, CDNs work without a hiccup, Davies points out. “They serve billions and billions of page views without a problem,” he says. “It’s only when it does go wrong, we wake up and realise there’s a problem.” Concentration of infrastructure is often done with the aim of making it more resilient, he says, “but by having more resilience and failsafe options we make it more complex, which can make it more likely to fail”.

Yet Davies says that centralisation is, most of the time, no bad thing. He compares it to a chicken and egg situation: if CDNs are too small, they struggle to deliver the level of service that has become expected. “And concentration allows them to invest and to work on emerging web standards like HTTP/3, [a new version of the protocol used to transfer information online] which will make people’s experiences better.”

But it does mean that the barrier to entry in the internet infrastructure market gets higher as centralisation increases and technical capabilities become more concentrated in the hands of the incumbents in the market. The more sites that are hosted, for instance, on Amazon Web Services, the more likely that a greater number of sites will be affected by an outage at any one single point. There are ways to avoid that, however: websites can host mirrors of their sites in more than one region – for instance, in Dublin to serve western Europe, and Frankfurt as a failsafe or West Virginia at a pinch. There is also the possibility to stretch a website’s hosting and content across multiple CDNs, so if one fails, another can pick up the slack. But few companies want to spend money doing this.

When it comes to the bigger issue, there’s no fix. “Part of this is on the demand side,” says Corinne Cath-Speth from the Oxford Internet Institute. “We’ve grown so accustomed to having content at our fingertips at any given time, we’ve also lost the patience from the early internet.”

Others see it differently and say the time is ripe to make a change. “We are at a fantastic moment to address this,” says Niels Ten Oever, a postdoctoral researcher in internet infrastructure at the University of Amsterdam. “It turns out engineers only thought about technology, but didn’t take the economy into account,” he says. Consolidation is occurring at every level of the internet’s infrastructure, with users increasingly dependent on a small number of firms. “It’s a prime example of consolidating unchecked power,” he says.

We should expect further outages — but things could soon change. Over the last decade, private companies have started to control more and more of the internet and what we do on it. “Increasingly, civil society and government have shown unease with that,” says Ten Oever. Whether we’re willing to sacrifice convenience to change the infrastructure that causes occasional but terrifying mass outages is another question.

More great stories from WIRED

This article was originally published by WIRED UK