Olympic Data Crunching

In weeding through a trillion bytes of data from the Nagano Winter Olympics, IBM engineers have found some curious results. Now they must analyze the data and prepare for the 2 billion hits expected in Sydney 2000. By Stewart Taggart.

What could be more exciting at the Winter Olympics than freestyle skiing, the luge, or the bobsled event? Here's one: curling. Yes, curling, the sport of sliding gizmos that look like metal teapots across the ice.

As IBM data-mines a massive 1 terabyte database of user information gained from the Nagano Winter Olympics Web site -- probably the biggest such collection ever made -- curling's unexpected popularity is confounding notions that the sport has few fanatical adherents.

In going through the data, IBM faced challenges including the size of the database; it was so huge that existing data-mining software couldn't process it effectively, said Jose Luis Iribarren, IBM's technical manager for the Olympics. As a result, IBM had to develop new algorithms and other techniques to get meaningful results.

"If you have billions of records and you are interested in the big picture only, then you can do sampling, without processing the overall pool," he said. "But as you start to study things like the precise paths users followed, and start to slice the data down to individual athletes or slots of time, then you really need to process a much bigger body of data."

So far, IBM's data-mining efforts on the Nagano site have concentrated on two lines of inquiry: a "verification mode" and a "discovery mode."

In "verification mode," the database is queried for correlations between hypothesized variables. An example would be examining the correlation between viewers looking at an overall medals table and viewers that then click through to the breakdown of medals won by an individual country.

In "discovery mode," more general questions are posed, such as what were the 10 most common paths of navigation. IBM has thus far been able only to pose "discovery mode" lines of inquiry, at least in part because of the amount of pre-processing necessary to prepare the data for analysis.

One of the more confounding data-mining findings to date: About 40 percent of visitors to the Nagano site's English-language homepage went no deeper than that page, a behavior shared by only about 10 percent of visitors to the Japanese language homepage.

"We're wondering why that was," said Iribarren. "Since the English page had many snippets of information right there -- news, tables, results of the day -- maybe the homepage was rich enough people didn't need to go further."

Either that, or they hated it and left.

So stands data mining's great contradiction. While it can offer valuable details of visitor behavior with clinical precision, the answers often pose more questions.

As Iribarren sifts through a mounting pile of data-mining results from Nagano -- ice hockey was the most popular in terms of page views, followed by figure skating, speed skating, alpine skiing, ski jumping, snowboarding, cross-country skiing, and, of course, curling -- he and others have already started looking ahead to the Sydney 2000 Summer Olympics.

IBM believes the Sydney site could receive as many as 2 billion hits during the games scheduled from 15 September through 1 October, 2000. By comparison, the Nagano Web site recorded 635 million hits, earning it a place in the Guinness Book of World Records as the highest number of hits for any sporting event up to that time. A major goal for the 2000 Olympics will be to provide any piece of Olympic data within four mouse clicks, Iribarren said.

"After four clicks, if someone hasn't found what they are looking for, the drop-off rate increases dramatically, and people are likely to leave the site," Iribarren said. "We are doing everything we can to streamline and reduce the hits and navigation paths to the information."

Other big challenges facing Iribarren and his team in assembling the Sydney 2000 site will be simultaneously satisfying fanatic data mavens as well as the more casually interested.

"For the hard-core sports fans who love tables and data, all that will still be there in Sydney," he said. "But what we want to do is add another layer, in which information is presented in a more attractive way."

"Using new publishing tools, we may be able to make creative graphics on the fly," he said. "While at this point we just don't know how this may be done, the aim will be to make results more understandable and easier to grasp."