Jul 16, 2012 6:24 AM

The exabyte revolution: how Kaggle is turning data scientists into rock stars

All products featured on WIRED are independently selected by our editors. However, we may receive compensation from retailers and/or from purchases of products through these links.

This article was taken from the August 2012 issue of Wired magazine. Be the first to read Wired's articles in print before they're posted online, and get your hands on loads of additional content by subscribing online.

The 2012 Strata Data Science Conference, held in February in Santa Clara, California, is the kind of place where conversations start with, "What's your index time?" Cocktail hour features custom-made drinks called Hadoopery Hooch, Alcohol-Stat and Numbers Numb-er. The conference's hottest swag is a black button with white letters that reads, "My, what big data you have." Those attending are glad to be among their own kind. "There's only so long you can talk to your spouse about data before you end up on the couch," one of them says. "It reminds me of the 90s," says Mike Bowles, a former MIT professor of aeronautical engineering who, under the rubric Hacker Dojo, now teaches courses in data mining to working professionals. "That was an exciting time for the internet, and this is an exciting time for big data. The enthusiasm [here] is palpable."

The second annual Strata -- known to many attendees as "Datastock" -- marks the moment when data scientists are emerging into the sunlight as members of tech's hottest profession; pioneers in what's being described as "the age of big data".

A 2011 study called Extracting Value From Chaos by John Gantz and David Reinsel reported that the volume of the world's information more than doubles every two years. Parsing meaning from these vast mountains of data has become tech's new obsession: business now views data as a raw material -- an economic input on a par with capital and labour. International Data Corporation estimates that a billion connected devices will ship this year, with that number set to double by 2016 -- but all the data flowing from them, rich with indications of users' preferences, location and behaviour, is worthless unless it can be interpreted. What's really valuable isn't the data, of course: it's the ability to extract meaning from it. "The notion of a professional data wrangler or data manipulator, the other half of a machine-learning system as a full-time job, has emerged very recently," says Max Levchin, a Silicon Valley investor and entrepreneur, and the cofounder of PayPal, on the phone from his office in San Francisco. "In the past, if you were a good coder, you delved into machine learning a little bit, or you were a good modeller; it was enough. Now it's not any more, and the entire driver of that has been the availability of data."

Six years ago, Clive Humby, creator of Tesco's Clubcard, was quoted as saying: "Data is the new oil" -- implying that its value lies in refining the crude source. At Strata, the phrase is almost a cliché. Futurist and Silicon Valley stalwart Tim O'Reilly, whose O'Reilly Media puts on Strata, buzzes about the exhibition floor dipping in and out of conversations. He says the data universe continues to surprise him. "When I first saw data starting to dominate," he says, "I didn't think about the companies that could emerge. Or how mobile was going to explode. All these data subsystems are starting to coalesce into this operating system that we all work with. There's more data than I thought would ever be possible." "Last year, I theorised that data would be the foundation for Web 3.0," Reid Hoffman, the founder of LinkedIn and an investor in many high-profile tech companies, tells WIRED. "Essentially, new services will build [systems] for navigating our lives through aggregate data: from explicit data we input to social networks, from implicit data from mobile phones and activity, and from analytic data created from explicit and implicit data. These services will help us navigate our lives better: from the physical world (examples: driving and walking), to the entertainment world (examples: which books and movies), to the career world (examples: which information and which opportunities). New Web 3.0 products will come both from existing companies such as LinkedIn and Twitter and form new companies."

On the morning of March 1, dozens of data scientists, most under 40, gather in the Strata speakers' lounge at the conference centre, finalising talks on topics such as "Democratisation of Data Platforms", "Decoding the Great American ZIP myth" and "Embrace the Chaos".

Jeremy Howard, when not working on the presentation he is to give that morning ("From Predictive Modelling to Optimisation: The Next Frontier"), walks about in his orange Vans and a hoodie that reads "Data Science" on the back. He grins impishly, engaging everyone, introducing people and enjoying the world he's helped create. Howard is the president and chief scientist of Kaggle, the "leading platform for predictive modelling competitions", where users compete to solve data problems.

"Data scientists are people who have been hacking away for years," he says. "Now we're coming together under a banner. It hasn't grown much. We've just found one another."

Howard and his fellow data scientists sit around a table. Hal Varian approaches with his breakfast plate. A respectful hush falls. Varian, professor emeritus at Berkeley and chief economist at Google, has been working with large data sets since before many of this crop of data scientists were born. Howard doesn't seem intimidated. "Hey, Hal," he asks. "Can you give a demonstration of Google Correlate?"

Correlate is a relatively new public feature from Google that, in the company's words, "finds search patterns which correspond with real-world trends". Search terms vary in popularity over time, and also match up with other search terms that display similar patterns. Data scientists are delighted by such pattern tracking; it helped Google build its popular and useful Google Flu Trends app, which lets doctors and researchers search trends to track influenza outbreaks quickly and accurately.

Varian seems glad to have a receptive audience that understands how Correlate works. Some of the top minds in the industry gather around. He opens his laptop, and types "Eric Schmidt" into the Google Correlate search bar.

The first result, with a correlation of more than 0.89, is "Schmidt Google". Not surprising, since Schmidt is Google's executive chairman. The next terms are equally mundane: "Eric Schmidt Google", "Google CEO", "Google CEO Eric Schmidt" and so on. Then, towards the bottom, with a 0.61 correlation, is "Starbucks size", followed by "male yeast infection". The table explodes in laughter. "I don't know what that could mean," Varian says.

The group starts speculating. As it turns out, there's a Dr Eric Schmidt -- no relation -- in California who specialises in treating male yeast infections. But Google Correlate didn't know that. It could only track trends in the data.

At the bottom of the stack, with a 0.60 correlation number, is "disco fries". How in the world does it relate to Eric Schmidt? It's up to data science to determine. Howard beams with pleasure. He loves this sort of question. To him, data science contains a multitude of possibilities. It's continually revealing things about the world that we didn't know before. "It's at the heart of so much that we do today," he says. The US insurance company Allstate, for instance, used a Kaggle competition to improve its actuarial model by 340 percent, and Google used data science to help to develop its self-driving car, which, Howard says, "is just a whole bunch of predictive models working in parallel".

Kaggle is running the Heritage Health Prize, sponsored by California healthcare company Heritage Provider Network. It wants to be able to identify the US patients most likely to be admitted to a hospital within the next year and the length of their stay, and is offering $3 million (£1.86 million) to the winning entrant. The healthcare industry in the US wastes up to $30 billion dollars a year in unnecessary hospitalisations. Data analysis, says Jonathan Gluck, senior executive and counsel at the Heritage, will help to tease out certain factors that may previously have been overlooked. "Doctors' intuition is great, but there's too much information," he says. "We're trying to take those people who are helping Google to figure out which restaurants are good in your area, or those people who are helping Netflix to recommend which movies you should be watching, and doing something that will benefit society."

Another Kaggle user is Pete Warden. He started a company called Jetpac, which allows people to share their travel photos. Yet many users were finding that its discovery process led them down useless rabbit-holes. So Warden deployed a $5,000 Kaggle competition to try to data-determine, through analysing words in captions, which photos were "inspiring" people to travel. He received more than 400 entries. "The reason we know it's working is that no one asks us about bad photos any more," Warden says. "It's amazing just how far this tool kit can take us."

Data, Howard says, "doesn't leave any room for bullshit. Most of the data is telling us stuff that we already know. But once in a while, it'll reveal something new to us. People will argue, but you can show them the data. It doesn't lie."

In the beginning, there was data, but it was hard to understand. A statistician named John Tukey -- who, among other things, is credited with coming up with the term "bit" while working with John von Neumann on early computer designs in the late 40s -- promoted a system called Exploratory Data Analysis. This argued that large, complex data sets should be simply summarised using explanatory charts and graphs. In other words, statistics weren't just numbers games that existed for their own sake. They had potential real-world applications and should be evaluated based on the stories they could tell. In 1972, Tukey developed a computer program called PRIM-9 (an acronym for picturing, rotation, isolation and masking of data in "up to nine dimensions"). It was well ahead of its time, and allowed users to see data displayed from nine separate graphical angles.

For decades, that was all data scientists had to go on as they wandered in the numbers wilderness.

Howard got his start more than 20 years ago, first as an analytical specialist at management consultant McKinsey, and later at big retail banks and insurance companies in his native Australia. "There was big data going on," he says. "They had tens of millions of customers, filled warehouses with data and spent shitloads of money." But he found the profession lonely: "When I started at McKinsey, I was it. I invented the position."

Gradually, the industry began to shift. When Howard started, he needed a room full of machines to analyse data. As the price of computing dropped, programmers began to develop open-source data-storage software. The arrival of the internet meant that companies had more and more data that they needed to store and analyse. Big financial institutions were no longer alone in their need to crunch numbers -- from retail corporations to public-health nonprofits, data belonged to and mattered to everyone.

A new field was created in 2001, when the term "data science" was first used in a paper by statistician William Cleveland, Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics. Cleveland decided to rename the field, he said, because "the plan is ambitious and requires substantial change". Not long after, Tim O'Reilly spotted the data trend. More than a decade ago, he was already evangelising data, calling it the "Intel Inside" of Web 2.0. "One VC said to me, 'Will you stop talking about this?'" O'Reilly says. "But I couldn't help myself. It's almost like talking about your kids. I was doing it by pattern recognition. You just saw more and more people playing with data."

Data science, O'Reilly realised, was becoming the essential new field in every industry. "It's really the currency of the future," he says. "We're just at the beginning of what the data economy will look like."

O'Reilly wanted to get his media company into the data business as soon as he could. He hired Roger Magoulas as his director of research. Magoulas had designed and implemented data-warehouse projects in the 90s, long before they were popular. When he started at O'Reilly, he was handed a database of hundreds of millions of jobs, ranging from Best Buy salesperson to barista. His task was to mine the data to tease technology trends out of these job descriptions. O'Reilly Media had all the data stored on two standard CPUs ("the kind that look like a pizza box," Magoulas says) using the MySQL database-management system. "It was slow," he adds.

Companies had begun to spring up that allowed people to crunch data more quickly, but they were too expensive for small companies such as O'Reilly Media to use. To fill the gap, free open-source systems such as Hadoop emerged, in Magoulas's words, to "democratise" the data world. O'Reilly switched to open-source company Greenplum, which loaded the data on to 12 servers working simultaneously and rapidly. Queries that had taken Magoulas and his team ten hours were suddenly being processed within six minutes.

Reality shifted almost overnight.

Magoulas and his colleagues started calling this their "big data" project. "The feedback we've received suggests that we were the first people that used that term extensively," he says. "But I doubt we were the people who actually coined it." They published a paper in 2009 called Big Data: Technologies and Techniques for Large-Scale Data and the term entered the public sphere.

As part of his research, Magoulas visited LinkedIn, where a young chief scientist called DJ Patil was working. At the time, he was an eccentric chaos theoretician running a data-research team at a company still establishing itself. Three years later, in January 2012, Patil would appear on the cover of Fast Company as part of "Generation Flux" -- industry leaders best equipped to survive a business environment that's "pure chaos".

But data was starting to mean more than it once had. "The data guy [used to be] relegated to the back," Patil says. "He wasn't allowed in on the real conversation.

It's like having Spock on the bridge -- nobody would have thought that made sense. Now it's like, 'Oh, why isn't Spock in on the conversation?'"

Web 2.0 companies were generating data in petabytes, and they needed specialists to process and understand it. Terms were getting thrown around: "analyst", "business intelligence", "research technician", but none of them made much sense. The lack of clarity was driving personnel departments crazy when it came to hiring people.

Patil says he came up with the term "data scientist" over lunch one day with Jeff Hammerbacher, who was running Facebook's data team. "It was the least offending term," he says. "It's more of a broad thing rather than a niche thing, and it's the term that people would most like when they were looking for jobs. People looked at 'data scientist' and said, 'These guys are on to something. That describes me.'"

But it didn't describe enough people. The ability of the digital universe to produce data far outpaced the ability of experts to mine, <span class="s2">analyse and explain it. Essentially, the universe suffered from a shortage of data scientists. Studies begun to appear, warning of the "looming data-science talent shortage", and that great volumes of essential data stood to be lost forever.

Kaggle stepped into the void.

Anthony Goldbloom is a mild-mannered young econometrician who built models for the Australian treasury and the Reserve Bank of Australia. In 2008, he entered an essay contest sponsored by The Economist, writing about how the subprime-mortgage crisis wasn't as big a deal as people thought.

That was the winning entry, and it gained him an internship at the magazine. As no one else was interested, he chose the big-data beat. "I started interviewing people who were doing the same things I was at the treasury and the bank, but they were using more real-world applications," he says. "I called up and said I was from The Economist. Everybody likes to speak to a journalist."

The executives that Goldbloom talked to told him that data science was one of their top priorities, "but it was clear their application didn't match their ambition," he says. There was a shortage of available talent, and it was hard to know whom to hire. Goldbloom suddenly had his big idea: "To join companies that have data to people who want to muck around with data."

He drew inspiration from the Netflix Prize, a million-dollar competition run by Netflix from 2006 to 2009 that sought an algorithm to improve its movie-recommendation software's accuracy by ten percent. Everyone came away a winner: data junkies got to work on a tough problem in a lively, competitive environment with the possibility of a huge payout, and Netflix got nearly unlimited R&D from some of the world's top minds at a cost of less than £6 an hour. Goldbloom bet that if it worked for Netflix, it would work for everyone.

Goldbloom launched Kaggle in April 2010 with a $1,000 contest for the algorithm that could best predict the winner of the Eurovision Song Contest, a test run that was more accurate than betting markets, proving Kaggle's software worked. The first serious Kaggle competition, to predict how genetic markers might affect the viral load of HIV-infected people, followed.

Dozens of people and teams from around the world, none of whom had experience with HIV research, wrote algorithms. "In a week and a half," Goldbloom says, "the best scientific research had been blown out of the water."

Goldbloom was overwhelmed with work, so, in November 2010, Jeremy Howard -- one of the first Kaggle users and who'd won competitions -- joined the firm. "The people who win Kaggle competitions have this amazing mix of tenacity, creativity, open-mindedness, coding skills, software-engineering skills and data-analytical skills," Howard says. "They're these amazing Renaissance people. You can imagine that working with people like that is a real pleasure. When we talk at work together we have a deep respect for one another, because we know that what we do is based on actual results."

Howard had found his people at last. Here's an example of how data science actually works: dark matter comprises 83 percent of the known universe. We have no idea what it actually is; it could be an undiscovered type of particle or something else entirely. Regardless, the mystery of dark matter represents the greatest puzzle in cosmology.

One of the ways cosmologists map dark matter is through "gravitational lensing", or measuring the change of elasticity in a galaxy because of dark matter's effects.

For years, researchers at the University of Edinburgh had been trying to use elasticity numbers to map dark matter, but none of their algorithms worked. They had access to a huge data set. They opened a terabyte of data to a public contest, asking for help. They got a lot of press, but only 20 entries, none successful. Then Kaggle gave them a call. According to Thomas Kitching, a cosmologist and postdoc doctoral research fellow at the University of Edinburgh, "they said, 'What you're doing is good, but the challenge is too big. It's too hard.'"

Goldbloom and Howard told him that they could make the dark-matter-mapping competition more straightforward and human-scaled, turning it into a forum where people would churn ideas around. "They said they could make it into a sport," Kitching says. He got the British Royal Astronomical Society, Nasa and the European Space Agency on board. Kitching, Howard and Goldbloom then spent months boiling down the huge data set into something more digestible. Competitors would be given 100,000 images of galaxies and then asked to measure the elasticity of 40,000. "In retrospect, it all sounds kind of obvious," Kitching says. "But because it was the first time, we took months." More than 1,000 people or teams signed up for Kaggle's dark-matter competition.

Martin O'Leary, a PhD candidate in glaciology at Cambridge, was one of the first competitors. The fact that he had no experience in astronomy, nor even in data, didn't intimidate him. He'd been examining amorphous satellite images for years, so he felt like a plausible candidate. "I took a mathematically simple form and just deconvolved them with a bit of algebra," he says. "Science is science. If it doesn't have data in it, it's not science."

His algorithm shot to the top of the Kaggle leaderboard. Two days later, someone surpassed him. He tweaked his equation and moved back to the top. But coming up on the rail was David Kirkby, a professor of physics at the University of California at Irvine. "I had to turn it into a data-science problem rather than an obscure astrophysics problem," says Kirkby, "because the essence of the problem really is data science."

Using brain-mapping software as a basis, Kirkby and his research assistant Daniel Margala designed a program that pushed them to the top of the leaderboard with three weeks left. They were overtaken a few times, but they kept tweaking. "You get to the point where you can figure out what time zone people are in, and when you expect them to try and overtake you," Kirkby says. They spent the last nine days at the top.

The competitors had increased the accuracy in mapping dark matter by a factor of three. The top three won a trip to Nasa's Jet Propulsion Laboratories in Pasadena. Martin O'Leary was fourth, but had the thrill of seeing his accomplishments being written up on the White House website. He tweeted: "Not braggin' or nothin' but the White House just compared me to Newton and Einstein."

At the first Strata conference, the Kaggle team was essentially just part of the crowd. But by the second year, they'd raised £6.8 million in series A funding, and were among the biggest stars at the convention centre.

Over drinks in the hotel bar at Strata, Goldbloom says that the real work in data science is going on in the academic hinterlands. "These are the cool kids doing data here," he says. "But our guys are the ones doing [it] not for Silicon Valley reasons, but for the love of the game. Our people go to work, come home, have dinner with their family and then from 10pm to 2am compete in competitions before going to bed."

Howard, who enjoys the spotlight more than his partner, walks around the convention centre, basking in his good fortune. "What was it Arthur C Clarke said, that advanced technology is indistinguishable from magic?" he says. "I feel like that's what we do as data scientists." A guy stops him. "Oh Jeremy Howard!" he says. "Hi!" says Howard. "I've been following your career on Kaggle for some time." "Oh, wow. Really?"

It's Howard's first fan. Later he says, "It's so strange. That's something someone would say to a tennis star. Not a data scientist."

Neal Pollack wrote about Troy Carter in 06.12

This article was originally published by WIRED UK