Mar 20, 2007 12:00 PM

Google's Next-Gen of Sneakernet

How do you get 120 terabytes of data — the equivalent of 123,000 iPod shuffles (roughly 30 million songs) — from A to B? For the most part, the old-fashioned way: via a sneakernet. It's not glamorous, but Google engineers hope to at least end the arduous process of transferring massive quantities of data — […]

How do you get 120 terabytes of data -- the equivalent of 123,000 iPod shuffles (roughly 30 million songs) -- from A to B? For the most part, the old-fashioned way: via a sneakernet. It's not glamorous, but Google engineers hope to at least end the arduous process of transferring massive quantities of data -- which can literally take weeks to upload onto the internet -- with something affectionately called "FedExNet" by the scientists who use it.

Chris DiBona, the open-source program manager at Google, just returned late last week from Washington, D.C., where he met with Hubble researchers at the Space Telescope Science Institute to set the stage for what will be the largest data transfer for the project ever: The near totality of all the astronomical data and images that Hubble has ever collected -- about 120 terabytes.

Chris DiBona

Photo: Julian Cash

The project comes out of DiBona's efforts last fall to put together an informal system in which Google acts as both a repository and courier for large data sets between teams of scientists. Now, he leads a team that sets up small form-factor PCs, hooked up to drive arrays that can store up to 3 terabytes of data.

The process lightens the load, but it isn't simple: DiBona ships both the PC and array to teams of scientists at various research institutions, which then connect their local servers to the array via an eSATA connection. Once the data transfer is complete, the drives get sent straight back to Mountain View, where DiBona and others copy the data to Google's servers for archival purposes. The idea then is that if other scientists around the world needed access to such a large quantity of data, Google would simply reverse the process.

"Right now, we're just acting as a conduit," DiBona says. "We make a copy of it, and then we can use the hard drives for something else. They'll get banged around a little bit too much (to store the data directly on the drives). They're not intended to be a long-term storage medium -- they're like envelopes to us."

For now, the program is only working in one direction -- data being sent from the field straight back to Google. But that should change later this year. Also, for the time being, the data is largely limited to astronomical data, such as Arizona State University's nearly 6 terabytes of thermal infrared images of the surface of Mars.

Noel Gorelick, a member of the research faculty in the School of Earth and Space Exploration at Arizona State University, says that a complete electronic transfer of its Mars data with the outside world normally take more than a month of constant, painful, uploading.

"We stopped doing it because that's not pleasant," Gorelick says.

With a set of Google drives, Gorelick (who came up with the FedExNet moniker) can copy his team's data in about 24 hours or less, something that can make a big difference when the time comes to collaborate with other research groups.

"Faster is better," he says. "The sooner you get your data, the sooner you can start processing it and start finding out what it is that you don't know."

ASU's data, like that of the STSI's, is already made available online to the public. But both entities are limited to how much they can transfer over the public internet. In theory, they both could send their own hard-drive arrays out without the help of Google, but that takes time and money -- two things that the science community is typically short on.

"We can't afford (to send) a huge number of disks to people," says Carol Christian, deputy of the Community Missions Office at STSI. "We're not in a position to just mail out a terabyte disk to anyone who wants it."

But beyond simply letting Google do the data transfer for them, Christian says that she believes that by helping the company make Hubble data more easily available to the public, it may profoundly alter the way astronomical science is conducted.

"The more people that look at the data, and the more people that have large amounts of the data, then there is a change of thinking: 'Wow, I could have almost all the Hubble data attached to my laptop,'" she says.

Christian also said she has been working with Google to help the company create a new way to access their astronomical data -- simply typing in a star's name into a traditional search field simply won't do. And this raises the question of what Google intends to do with such a large amount of data, beyond just lending a helping hand. While the company remains cagey about its future plans, it's conceivable that it may be working on a more science-oriented search engine, along the lines of Google Scholar.

Google Rolls Back Image Search Design

Google: Don't Not Be Evil

Who's Afraid of Google? Everyone.