Google BigQuery Ratchets Up Evolution of New-Age Data Analysis

The latest incarnation of Google BigQuery is yet example of the way today's "Big Data" tools -- tools designed to process mega amounts of information -- are evolving to behave more and more like traditional databases.
Image may contain Text and Alphabet
Image: Flickr/amortize

Google was sitting on two massive collections of data describing its App Engine, a web service where software developers can build and deploy online applications.

One data set described the way people used the service, and it spanned 2 terabytes of information, or roughly 2,000 gigabytes. The second showed how these customers were billed for using the service, and this was about 10 gigabytes. Google wanted to examine the relationship between these two enormous collections of information, so it shuttled both into a service it calls BigQuery. With BigQuery, the company merged the data in about 60 seconds, according to Google man Ju-kay Kwek, and it could then zero in on the results for each individual App Engine user.

When you're dealing with such large data sets, 60 seconds is pretty darn quick. And this didn't require any specialized programming. Google was using standard tools built into BigQuery, and as the company announced late last week, these tools are now available to the world at large.

The tools mimic the sort of rapid queries that have long been possible on ordinary databases via the structure query language, or SQL. The difference is that Google is doing this on such large amounts of data. The latest incarnation of Google BigQuery is yet another example of the way today's "Big Data" tools -- tools designed to process mega amounts of information -- are evolving to behave more and more like traditional databases.

In October, Silicon Valley startup Cloudera uncloaked a tool called Impala that's designed to run rapid queries on massive data sets, and this month, tech giant EMC followed with a similar tool. Based on an internal Google software platform called Dremel, Big Query predates both these tools, and Google continues to fine-tune it.

Last week, the company unveiled two new tools atop BigQuery. "Big JOIN" lets you combine data in much the same way Google merged its two App Engine data sets, while "Big Group Aggregations" let you divide such data into specific segments, as Google did in setting up separate App Engine datasets for each user.

Join is a common SQL operation. Basically, it lets you combine two different datasets so that they can be analyzed in data. Big Query could do joins in the past, but according to Ju-kay Kwek, who oversees BigQuery as project manager, it was better suited to other types of queries. "We had a lot of people request the ability to do joins on very large tables," Kwek tells Wired. "It's not to say Big Query couldn't do that before...but doing a join on such a large dataset is a non-trivial problem, and in terms of performance, Big Query wasn't ideally suited to it."

Various tools have long offered the ability to run SQL queries atop Big Data platforms such as Hadoop, but this often requires a fair amount of time -- if not some specialized programming skills. But tools like Dremel and BigQuery aim to change this.

In 2010, Google released a research paper describing Dremel -- a software platforms that pools the power of hundreds of computer servers -- and it caused a bit of a stir in the academic community. According to Google’s paper, the tool could run queries on multiple petabytes of data -- millions of gigabytes -- in a matter of seconds. "If you had told me beforehand me what Dremel claims to do, I wouldn’t have believed you could build it," Armando Fox, a professor of computer science at the University of California, Berkeley, once told us.

Google never released the software behind Dremel, but with BigQuery, it lets anyone use this software atop its own infrastructure. In order to use the service, you must format your data using the CSV or JSON standard and upload it onto Google's machines. You can stream your data straight into BigQuery proper, or you have the option of grabbing and analyzing data housed on Google Cloud Storage, a general storage service for housing massive data sets online. Google has also teamed with companies such as Informatica and Talend to offer tools that can more easily move data into BigQuery from local software applications.

Separately, a Silicon Valley outfit called MapR is working to build an open source version of Dremel. This is known as Drill, and you would have the option of running this on your own servers.