May 16, 2012 10:30 AM

Want to build a perfect website? Don't trust the designers

This article was taken from the June 2012 issue of Wired magazine. Be the first to read Wired's articles in print before they're posted online, and get your hands on loads of additional content by subscribing online.

Dan Siroker helps companies discover tiny truths, but his story begins with a lie.

It was november 2007 and Barack Obama, then a Democratic candidate for president of the US, was at Google's headquarters in Mountain View, California, to speak. Siroker -- who today is CEO of the web-testing firm Optimizely, but then was a product manager on Google's browser team -- tried to cut the enormous queue by sneaking in a back entrance. "I walked up to the security guard and said, 'I have to get to a meeting in there,'"

Siroker recalls. There was no meeting, but his bluff got him inside.

At the talk, Obama fielded a facetious question from then-CEO Eric Schmidt: "What is the most efficient way to sort a million 32-bit integers?" Schmidt was having a bit of fun, but before he could move on to a real question, Obama stopped him. "Well, I think the bubble sort would be the wrong way to go," he said -- correctly.

Schmidt put his hand to his forehead in disbelief, and the room erupted in raucous applause. Siroker was instantly smitten. "He had me at 'bubble sort'," he says. Two weeks later he had taken a leave of absence from Google, moved to Chicago, and joined up with Obama's campaign as a digital adviser.

At first he wasn't sure how he could help. But he recalled something else Obama had said to the Googlers: "I am a big believer in reason and facts and evidence and science and feedback -- everything that allows you to do what you do. That's what we should be doing in our government." And so Siroker decided he would introduce Obama's campaign to a crucial technique -- almost a governing ethos -- on which Google relies in developing and refining its products. He showed them how to A/B test.

Over the past decade, the power of A/B testing has become an open secret of high-stakes web development. It's now the standard (but seldom advertised) means through which Silicon Valley improves its online products. Using A/B, new ideas can be essentially focus-group tested in real time: without being told, a fraction of users are diverted to a slightly different version of a given web page and their behaviour compared against the mass of users on the standard site. If the new version proves superior -- gaining more clicks, longer visits, more purchases -- it will displace the original; if the new version is inferior, it's quietly phased out without most users ever seeing it. A/B allows seemingly subjective questions of design -- colour, layout, image selection, text -- to become incontrovertible matters of data-driven social science.

After joining the Obama campaign, Siroker used A/B to rethink the basic elements of the campaign website. The new-media team already knew that their greatest challenge was turning the site's visitors into subscribers -- scoring an email address so that a drumbeat of campaign emails might eventually convert them into donors. Their visit would start with a splash page -- a luminous turquoise photo of Obama and a bright red "Sign Up" button. But too few people clicked the button.

Under Siroker's tutelage, the team approached the problem with a new precision. They broke the page into its component parts and prepared a handful of alternatives for each. For the button, an A/B test of three new word choices -- "Learn More", "Join Us Now", and "Sign Up Now" -- revealed that "Learn More" garnered 18.6 percent more signups per visitor than the default of "Sign Up". Similarly, a black-and-white photo of the Obama family outperformed the default turquoise image by 13.1 percent. Using both the family image and "Learn More", signups increased by a thundering 40 percent.

Most shocking of all to Obama's team was just how poorly their instincts served them during the test.

Almost unanimously, staffers expected that a video of Obama speaking at a rally would handily outperform any still photo. But in fact the video fared 30.3 percent worse than even the turquoise image. Had the team listened to instinct -- if it had kept "Sign Up" as the button text and swapped out the photo for the video -- the sign-up rate would have slipped to 70 percent of the baseline. ("Assumptions tend to be wrong," as Siroker succinctly puts it.)

And without the rigorous data collection and controls of A/B testing, the team might not even have known why their numbers had fallen, chalking it up perhaps to some decline in enthusiasm for the candidate rather than to the inferior site revamp. Instead, when the rate jumped to 140 percent of baseline, the team knew exactly what, and whom, to thank. By the end of the campaign, it was estimated that a full four million of the 13 million addresses in the campaign's email list -- and some $75 million (£50 million) in money raised -- resulted from Siroker's careful experiments.

A/B testing was a new insight in the realm of politics, but its use on the web dates back at least to the turn of the millennium. At Google -- whose rise as a Silicon Valley powerhouse has done more than anything else to spread the A/B gospel over the past decade -- engineers ran their first A/B test on February 27, 2000. They had often wondered whether the number of results the search engine displayed per page, which then (as now) defaulted to ten, was optimal for users. So they ran an experiment. To 0.1 percent of the search engine's traffic, they presented 20 results per page; another 0.1 percent saw 25 results, and another, 30.

Due to a technical glitch, the experiment was a disaster. The pages viewed by the experimental groups loaded significantly slower than the control did, causing the relevant metrics to tank. But that in itself yielded a critical insight -- tenths of a second could make or break user satisfaction in a precisely quantifiable way. Soon Google tweaked its response times and allowed real A/B testing to blossom. In 2011 the company ran more than 7,000 A/B tests on its search algorithm. Amazon.com, Netflix, and eBay are also A/B addicts, constantly testing potential site changes on live (and unsuspecting) users.

Today, A/B is ubiquitous, and one of the strange consequences of that ubiquity is that the way we think about the web has become increasingly outdated. We talk about the Google home page or the Amazon checkout screen, but it's now more accu<span class="s2">rate to say that you visited a Google home page, an Amazon checkout screen. What percentage of Google users are getting some kind of "experimental" page or results when they initiate a search? Google employees contacted by Wired wouldn't give a precise answer -- "decent," chuckles Scott Huffman, who oversees testing on Google Search. Use of a technique called multivariate testing, in which myriad A/B tests essentially run simultaneously in as many combinations as possible, means that the percentage of users getting some kind of tweak may well approach 100 percent, making "the Google search experience" a sort of Platonic ideal: never encountered directly but glimpsed only through derivations and variations.

Still, despite its widening prevalence, the technique is not simple. It takes some fancy technological footwork to divert user traffic and rearrange a site on the fly; segmenting users and making sense of the results requires deep knowledge of statistics. This is a barrier for any firm that lacks the resources to create and adjudicate its own tests. In 2006 Google released its Website Optimizer, which provided a free tool for anyone who wanted to run A/B tests. But the tool required site designers to create full sets of code for both A and B -- meaning that non-programmers (marketing, editorial or product people) couldn't run tests without first taxing their engineers to write multiple versions of everything. Consequently there was a huge delay in getting results as companies waited for the code to be written and go live.

In 2009 this remained a problem in need of a solution. After the Obama campaign ended, Siroker was left amazed at the efficacy of A/B testing, but also at the paucity of tools that would make it easily accessible. "The thought of using the tools we used then made me grimace," he says. By the end of the year, Siroker joined forces with another ex-Googler, named Pete Koomen, and they launched a startup with the goal of bringing A/B tools to the corporate masses, dubbing it Optimizely. They signed up their first customer by accident. "Before we even spent much time working on the product," Siroker explains, "I called up one of the guys from the Obama campaign, who had started up a digital marketing firm. I told him what I was up to, and about 20 minutes in, he suddenly said, 'Well, that sounds great. Send me an invoice.' He thought it was a sales call."

The pair had made a sale, but they still didn't have a product. So Siroker and Koomen started coding. Unlike the earlier A/B tools, they designed Optimizely to be usable by non-programmers, with a powerful graphical interface that lets clients drag, resize, retype, replace, insert and delete on the fly. Then it tracks user behaviour and delivers results. It's an intuitive platform that offers the A/B experience, previously the sole province of web giants such as Google and Amazon, to small and midsize companies -- even ones without a hardcore engineering or testing team. What this means goes way beyond just a nimbler approach to site design. By subjecting all these decisions to the rule of data, A/B tends to shift the whole operating philosophy -- even the power structure -- of companies that adopt it. A/B is revolutionising the way that firms develop websites and, in the process, rewriting some of the fundamental rules of business.

Here are some of these new principles:

Choose everything The online payment platform WePay designed its home page through a testing process. "We did it as a contest," CEO Bill Clerico says. "A few of our engineers built different home pages, and we put them in rotation." For two months, every user that came to WePay.com was randomly assigned a home page, and at the end the numbers made the decision.

In the past, that exercise would have been impossible -- and because it was impossible, the design would have emerged in a completely different way. Someone in the company, perhaps Clerico himself, would have wound up choosing a design. But with A/B testing, WePay didn't have to make a decision.

After all, if you can test everything, then simply choose all of the above and let the customers sort it out.

For that same reason, A/B increasingly makes meetings irrelevant. Where editors at a news site, for example, might have sat around a table for 15 minutes trying to decide on the best phrasing for an important headline, they can simply run all the proposed headlines and let the testing decide. Consensus, even democracy, has been replaced by pluralism -- resolved by data.

The mantra of "choose everything" also becomes a way for companies to test out relationships with other companies -- and in so doing becomes a powerful way for them to win new business and take on larger rivals. In 2011 a fund-raising site called GoFundMe was talking to WePay about the possibility of switching to its service from payment giant PayPal. GoFundMe CEO Brad Damphousse was open about his dissatisfaction with PayPal's service; WePay responded, as startups usually do, by claiming that its product solved all the problems that plagued its larger competitor. "Of course we were skeptical and didn't really believe them," Damphousse recalls with a laugh.

But harnessing the power of A/B testing, WePay could present Damphousse with an irresistible proposition: give us ten percent of your traffic and test the results against PayPal in real time. It was an almost entirely risk-free way for the startup to prove itself, and it paid off.

After Damphousse saw the data on the first morning, he switched half his traffic by the afternoon -- and all of it by the next day.

Data makes the call

Google insiders, and A/B enthusiasts more generally, have a derisive term to describe a decision-making system that fails to put data at its heart: HiPPO -- "highest-paid person's opinion". As Google analytics expert Avinash Kaushik declares, "most websites suck because HiPPOs create them".

Tech circles are rife with stories of the clueless boss. In Amazon's early days, developer Greg Linden came up with the idea of giving personalised "impulse buy" recommendations to customers as they checked out, based on what was in their shopping cart. But his feature was shot down at demo. "I was told I was forbidden to work on this any further. It should have stopped there."

Instead, Linden worked up an A/B test. It showed that Amazon stood to gain so much revenue from the feature that all arguments against it were rendered null by the data. "In some organisations, challenging a senior vice-president would be a fatal mistake, right or wrong," linden wrote in a blog post on the subject. But once he'd put the idea in front of real customers, the higher-ups had to bend.

Siroker recalls similar shifts during his time with the Obama campaign. "It started as a pretty political environment -- where, as you can imagine, HiPPO syndrome reigned supreme. I think over time people started to see the value in taking a step back and saying, 'Here's three things we should try. Let's run an experiment.'"

This was the culture that he had come from at Google, what you might call a democracy of data. "Very early in Google's inception," Siroker explains, "if an engineer had an idea and had the data to back it up, it didn't matter that they weren't the VP of some business unit. They could make a case. And that's the culture that Google believed in from the beginning."

Once adopted, that approach will beat the HiPPOs every time, he says. "A/B will empower a whole class of businesses to say, 'We want to do it the way Google does it. We want to do it the way Amazon does it.' "

Says WePay's Bill Clerico: "On Facebook, under 'Religious Views', my profile says: 'In God we trust. All others, bring data.'"

The risk is making only tiny improvements

One consequence of this data-driven revolution is that the whole attitude towards writing software, or even imagining it, becomes subtly constrained. A number of developers explained that A/B has probably reduced the number of big, dramatic changes to their products. They now think of wholesale revisions as simply too risky -- instead, they want to break every idea up into smaller pieces, with each piece tested and then gradually, tentatively phased into the traffic.

But this approach, and the mindset that comes with it, has its own dangers. Companies may protect themselves against major gaffes but risk a kind of plodding incrementalism. They may find themselves chasing "local maxima" -- places where the A/B tests might create the best possible outcome within narrow constraints -- instead of pursuing real breakthroughs. Google's Scott Huffman cites this as one of the greatest dangers of a testing-oriented mentality: "One thing we spend a lot of time talking about is how we can guard against incrementalism when bigger changes are needed. It's tough, because these testing tools can really motivate the engineering team, but they also can wind up giving them huge incentives to try only small changes. We do want those little improvements, but we also want the jumps outside the box."

Paraphrasing a famous Henry Ford maxim -- "If I'd asked my customers what they wanted, they'd have said a faster horse" -- Huffman adds, "If you rely too much on the data, you never branch out. You just keep making better buggy whips."

Data can make the very idea of lessons obsolete

The biggest evolution in A/B testing is not how pervasive it has become, but how fast it has become. In the early '00s, test results typically took 24 hours. This might explain why testing began in marketing teams before it moved to product teams: ads generally stick around over many days and weeks.

That's all different today. "Ten years ago you did not have data. Five years ago the best reporting tools were a day behind," says Yulie Kim, VP of product at the furniture etailer One Kings Lane. "But we're in a world where you can't wait a whole day to get your data." "Big data is not enough," adds Kim's boss, CEO Doug Mack. "It has to be real-time data that we can act on."

The difference with live testing is not just that there is no time to learn and apply lessons. It's there are no clear lessons or rules.

At the gaming network IGN, for example, executives found that crisp, clear prose was outperforming hyped-up buzzwords (such as free and exclusive) on certain parts of the home page. But in previous years, the opposite had been true.

Why? No one could figure it out. Then they realised that it simply didn't matter.

If you find that last implication to be somewhat troubling, you're not alone. Even if we accept that testing is useful in learning how to run a business, it's hard to accept that we won't learn how to run our businesses at all. One of the burgeoning trends in A/B is to automate the whole process of adjudicating the test, so that the software, when it finds statistical significance, simply diverts all traffic to the better-performing option -- no humans required.

The culture of A/B cuts against our common-sense ideas about how innovation happens. Startups, we imagine, largely succeed or fail by long-term strategic decisions that are impossible to test.

Yes, Google built its empire by listening to data, but we reserve our awe for the sort of vision that Steve Jobs brought to Apple. When asked how much market testing he did for the iPad, he said, "None... It's not the consumers' job to know what they want."

It's a false dichotomy, of course, to pose vision against data, lofty genius against experimentation, as if companies are forced to choose between the two. Google doesn't test at random but relies on intuition and vision to narrow down the infinite possible changes to a finite group of testable candidates.

But it's also true that the A/B culture, in part by shaming its HiPPOs into submission, can sometimes lead companies down dead-end paths. Testing allows you to constantly react to user preferences, but 10,000 ongoing tweaks don't add up to a fundamental change of direction when one is needed. And it can make it hard to stop sweating the small stuff. "I had a recent debate over whether a border should be three, four or five pixels wide, and was asked to prove my case," wrote ex-Google designer Douglas Bowman on his blog the day he left the company. "I can't operate in an environment like that."

So, could the A/B ethos start to make waves in the offline world? Some major retailers are embracing the experimental method. Chains will test out store floor plans in a few locations and then implement them nationwide if they boost revenues.

But the constraints of physical reality make it hard to experiment as often, or to control one's experiments so that the outcomes aren't maddeningly ambiguous. Only in the digital realm is it possible to be two different things at the same place and time.

Many web workers now look with pity on the offline world. At one Silicon Valley office, I overheard an employee complain that dating can't be A/B tested; an online profile can, to be sure, but once you're in a relationship with a specific person, 100 percent of the "traffic" is on the line with every decision.

The testable web is so much safer.

No choices are hard, and no introspection is necessary. Why is B better than A? Who can say? We can only shrug: we went with B. We don't know why. It just works.

Brian Christian is author of The Most Human Human: What Artificial Intelligence Teaches Us About Being Alive (Viking).

This article was taken from the June 2012 issue of Wired magazine. Get access to our iPad edition at no extra cost with a print subscription and be the first to read Wired's articles -- try a subscription to Wired by subscribing online now.

This article was originally published by WIRED UK