Apr 10, 2012 8:25 AM

Amazon Takes Genomics Research to the Clouds

What do you do with a 200-terabyte instruction manual that tells you how to build a human? You put it on a cloud. That's what Amazon and the National Institute of Health (NIH) have done with the 1000Genomes project, using Amazon's S3 storage service to offer over 1,700 human genomes to genetics researchers across the globe. The move is only part of a much larger effort to reinvent genetics using the proverbial cloud.

What do you do with a 200-terabyte instruction manual that tells you how to build a human?

You put it on a cloud.

That's what Amazon and the National Institute of Health (NIH) have done with the 1000Genomes project, using Amazon's S3 storage service to offer over 1,700 human genomes to genetics researchers across the globe. "This is what allows us to drive more complex maps of how genes interact with each other and their environment and zoom in on areas that may have a role to play in human health and disease," says Matt Wood, who oversees Amazon's side of the project and holds a PhD in bioinformatics. "This is the seed to create a tree of data."

'The genomics revolution people talked about 10 years ago? It is happening now,' says Misha Kapushesky, CEO of genomics startup Genestack. 'This is just the tip of the iceberg.'Amazon and the NIH made a big splash last month when they announced that anyone with an S3 account could now access this data, but the move is only part of a much larger effort to reinvent genetics using the proverbial cloud, with researchers tapping into public services from the likes of Amazon, Google, and Microsoft but also building their own cloud services using tools such as Hadoop, the open source platform for crunching large amounts of data across a sea of ordinary servers.

"The genomics revolution people talked about 10 years ago? It is happening now," Misha Kapushesky, CEO of genomics startup Genestack, tells Wired. "This is just the tip of the iceberg."

Biology researchers need DNA data so they can get a better handle on how proteins and other downstream biological molecules are structured -- and get closer to solving the mysteries of the human body. In the past, this information was saved on disks and mailed around the country, a highly inefficient process. We're getting to the point where these datasets are too large to store on individual machines, and very often, purchasing suitable hardware is beyond the tight budgets of public research institutions. So research operations are turning to the cloud.

Stephen Sherry, section chief for the National Center for Biotechnology Information (NCBI) at the NIH, calls the relationship with Amazon "priming a virtuous cycle" between researchers and various cloud outfits. Research operations aren't just storing their genetic data on service such as Amazon S3. They're using cloud services to run applications that seek to make sense of this data. According to Don Preuss, head the NCBI systems group, many researchers are using Google's AppEngine service to parse genome sequences. And Microsoft recently moved the NIH's Basic Local Alignment Search Tool (BLAST) -- a query tool for specific genomic sequences -- to its Azure cloud service.

In other cases, researchers organizations are building their own computer clusters capable of storing and analyzing this data. For instance, Crossbow and Bowtie, two programs from John Hopkins' school of public health that do short genetic reads, use a local Hadoop cluster.

But there's a large benefit to moving large research data sets onto public services where anyone can access them. "I think we were in this progression where the data was only accessible to a select few, but now the cloud opens it up to a greater number of people for a lot more innovation," Kapushesky says.

Yes, there are still hurdles to overcome. The 1000Genomes project is considered public data, but it can be more difficult to move private medical research data into the cloud, due to the US Health Insurance Portability and Accountability Act (HIPAA) and other similar laws. And though space and cost is less of an issue in the cloud, these databases are still rather unwieldy. The 200 terabytes of data stored on Amazon covers genomes for only about 1,700 people, and they expect to add another 900 shortly.

An outfit called The Pistola Alliance is running Sequence Squeeze, a competition to see how to best compress a particular sequence of DNA, and this sort of work will make it easier to move data to and fro. Meanwhile, companies such as Oxford Nanopore are working to further reduce the cost of actually sequencing the data. The end result is an exponential increase in the speed of genetics research.

"The cost of sequencing is just plummeting, way more than Moore's Law can keep. As the price continues to fall, we'll see more and more institutes that can afford sequencers," Amazon's Wood says. "Anybody can take advantage of the data because its sitting in S3 and recreate the data pipelines in their own sandboxes. I see this as a wider democratization across genomics research."

Update: This article has been updated to correctly identify the sponsor of Sequence Squeeze: the Pistola Alliance