Living Data

Despite all the hype about faster and better and cheaper and friendlier, it's amazing how little the foundations of computing have changed. From the 1940s to today, the raw material of computation has been something called "data." Data is made of bits; it's the stuff that's read in as "input," stored in computer memories, and […]

Despite all the hype about faster and better and cheaper and friendlier, it's amazing how little the foundations of computing have changed. From the 1940s to today, the raw material of computation has been something called "data." Data is made of bits; it's the stuff that's read in as "input," stored in computer memories, and named with variables in programming languages. But data isn't just numbers - it's also a way of thinking about the relationship between the abstract territory inside computers and the concrete territory outside them. Data has meaning - it represents the world. It can represent things inside computers, of course, but also represents things outside them: your age, the price of eggs, the number of cases of AIDS in New York, the predicted temperature in Brazil 10 years from now, how likely you are to buy a cubic zirconium ring. The basic function of a computer is to crunch this data: it grabs data from somewhere in its memory, shuffles and recombines the bits in some meaningful way, and stuffs the results somewhere else. It does this several million times a second, of course, but that's what it does.

We're so accustomed to data that hardly anyone questions it.

But data is obsolete. It's an archaic leftover that causes boundless mayhem and will inexorably be replaced - either quickly, if everyone wises up now, or slowly and painfully, if we continue to think of the difficulties caus-ed by data in shallow and fragmented ways.

Data is missing at least five things, all of which be-come both necessary and possible in a world of globally distributed computing:

  • Ownership. Where did this data come from? What are we allowed to do with it? To speak of "owning" data glosses a variety of things, from trade secrets to research ethics to contractual constraints on its use. So long as the data stays within a single program on a single system, these rules can be embedded implicitly in the software. But as data starts migrating to other machines and getting merged with other data, the rules need to migrate as well. And as new uses arise, the rules will have to be renegotiated. Your software agent needs to get my software agent's permission - through a firewall of anonymity enforced by encryption if necessary - before you mess around with my data.
  • Error bars. How reliable is this data? Real scientists put error bars on their numbers so they can tell whether they've gotten an answer or just a number. For example, 50±3 means it's probably between 47 and 53, but 50±83 means it's probably between -33 and 133. Big difference. When you add, subtract, multiply, or divide numbers you need to calculate the error bars for the result. Two numbers with big potential errors add up to a result with an even bigger potential error. But few computers make it easy to maintain this information, and few programmers ever bother. As a result, nobody knows whether most of the numbers that come out of computers are meaningful or not.
  • Sensitivity. This is similar to error bars. If someone makes a model of expenses under some health care proposal, the original input data will include a bunch of numbers that someone had to measure or estimate. How fast does the final answer change as you start modifying the input by plausible amounts? Answering that question is what spreadsheets are for, but only if someone checks each possibility by hand and bothers to save the answer. It should happen automatically.
  • Dependency. What data was used to compute this data? If something is screwy, can we trace the calculation back and figure out which input it depends on? And what data was this data used to compute? If we discover an error in our data, do we have a way of informing everyone who believed us before? As we all know, errors propagate a lot faster than they can be repaired. This would change if the data could stay connected both upstream and downstream.
  • Semantics. Now that computers are going on networks, thousands of databases are being connected to one another. The problem is that most of those databases have arisen independently of one another - in different organizations, different departments, and different professions. As a result, it's very common for two databases to contain columns of data named by the same word - such as "price" or "name" or "approved" - even though that word means subtly different things to the people who created the databases. We probably can't explain the complete semantics of our words to our databases, but at least we can record simple things like units of measurement (is it "gallons" or "gallons per second"?), so that the numbers themselves can check whether it makes sense to compare them.

The problem with data is that it's dead. We should bring it to life by thinking through all its relationships - both with other data and with the circumstances in the world that it's supposed to represent. One proposal is to make every last hunk of computerized data its own intelligent software agent, storing information about itself and exchanging a stream of messages with all other relevant data. Having done that, we'd then have to redefine the other basic concepts of computing so that those millions of operations per second compute something meaningful - not just something that looks good. Sounds inefficient, doesn't it? But basic processor speeds will keep on accelerating, and the computers of the world will keep on getting connected through networks. Let's spend some of that exponential growth on the production of useful answers and the prevention of computerized hassles.

Why aren't these things happening? They are, in small ways. But not in big ones. Unfortunately, a lot of the major data movers benefit from not knowing how meaningful their numbers are. A credit bureau just reports the numbers it got from somewhere else; if it were easy to find out how those numbers were collected, then demands for quality control would increase. A whole industry produces high-tech simulations for lobbyists and talk-show hosts to quote, and it wouldn't be good for business if everyone could find out how sensitive those numbers were to long lists of hidden assumptions. The people who sell mailing lists don't have to weed out so many marginal prospects because it's hard to tell exactly where the names and addresses came from or how current they are. In general, managers everywhere mostly use computers to justify the actions they've already decided on, and dead data can't call them on their games.

Another problem is that the old way of doing things is embedded in heavily entrenched standards. The Pentium chip doesn't help you build living data. Neither does any widely used programming language. Of course, you can build your own software abstractions on top of these things, effectively simulating living data. But that's no help until standards are established for all of the automatic interactions that living data requires. The introduction of intelligent agent languages like Telescript, though, provides an opening - a chance to do it right. Timing here is everything. If you want computers to be built right - not just fast and cool - then go visit some standard-setting meetings. Find corners of the technological world that are moving from one generation of computing to the next. And insist that "efficiency" be measured in terms of people, not in terms of machines.

Phil Agre (pagre@ucsd.edu) teaches in the Department of Communication at the University of California, San Diego.