Why we can't trust 'blind big data' to cure the world's diseases

Today's data sets, though bigger than ever, still afford us an impoverished view of living things
akindo/iStock

Once upon a time a former editor of WIRED, Chris Anderson, wrote a provocative article entitled 'The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.'

He envisaged how scientists would take the ever expanding ocean of data, send a torrent of bits and bytes into a great hopper, then crank the handles of huge computers that run powerful statistical algorithms to discern patterns where science cannot.

In short, Anderson dreamt of the day when scientists no longer had to think.

Eight years later, the deluge is truly upon us. Some 90 per cent of the data currently in the world was created in the last two years. In the biological sciences, an ocean of -omes is being generated by and there are high hopes that big data will pave the way for a revolution in medicine.

But we need big thinking more than ever before.

With two coauthors in the journal Philosophical Transactions of the Royal Society A, I outlined how biology is too complex to rely on data that have been blindly harvested.

And, conversely, when it comes to using big data to make the likes of Einstein redundant, my coauthor Ed Dougherty of Texas A&M has asked “Does anyone really believe that data mining could produce the general theory of relativity?”

Today's data sets, though bigger than ever, still afford us an impoverished view of living things. Our largest land animal contains around 1,000 trillion cells, has a genome consisting of 3,000 million letters of genetic code, some 30,000 proteins, endless microbial passengers and so on. If aliens from Planet Caecus Data Magna were presented with these data, would they deduce they add up to an elephant?

solarseven/iStock

Think of the story of the people locked up with a pachyderm in a pitch dark room. They grope around the big beast: one who touches the tail thinks he has a rope, another thinks he has a tree, based on feeling a leg, or a fan in case of the person who caresses its ear and so on. It takes a bewildering amount of data to capture the complexities of life.

The usual response is to put faith in machine learning, such as artificial neural networks. But no matter their ‘depth’ and sophistication, these methods merely fit curves to available data. Trained to recognise the trunk of an elephant, they would struggle when presented for the first time with an ear, let alone an embryo.

Two decades ago, my coauthor Peter Coveney of University College London used big data methods to predict thickening times of complex slurries from spectra of cement powders. Even though this became a successful commercial offering, we still do not understand what was going on at the molecular level to help develop novel materials.

Blind data dredging is most likely to produce false leads. Spurious correlations are a familiar problem to those who use machine learning to find promising drugs. The same goes for linking genes to disease. A recent study of 61,000 exomes (the parts of their genetic code that make proteins) found only 9 of 192 supposedly 'pathogenic variants' had a strong disease association. The overestimate of peak influenza levels by Google Flu Trends reminds us that past success in describing epidemics is no guarantee of future performance: we have to take particular care when extrapolating from existing data.

Researchers still need to ask the right questions to create a hypothesis, design a test, and use the data to determine whether that hypothesis is true. We have seen the power of this approach at the Large Hadron Collider at CERN which generates one petabyte of data every dayxenotar/iStock

There are other limitations, not least that data are not always reliable (“most published research findings are false,” as famously reported by John Ioannidis in PLOS Medicine). Bodies are dynamic and ever-changing, while datasets often only give snapshots, and are always retrospective.

Researchers still need to ask the right questions to create a hypothesis, design a test, and use the data to determine whether that hypothesis is true. We have seen the power of this approach at the Large Hadron Collider at CERN which generates one petabyte of data every day — the equivalent of around 210,000 DVDs. Although the discovery of the Higgs boson required a deluge of data, physicists used theory to initiate and guide their search.

In the same way, we do not predict tomorrow’s weather by averaging historic records of that day’s weather. Mathematical models do a much better job with real-time data from satellites. That is why a team at Los Alamos is using theory to help guide, enhance and refine the development of new materials. They direct data collection using Bayesian probabilistic methods, with experimental design driven by theoretical insight.

A blend of theory and measurement is how to make progress in medicine too. For example, Peter Coveney's team has shown how to design a drug based on a person’s genetic makeup, and within a matter of hours. He uses Newtonian dynamics and heavyweight computation to explore how candidate drug molecules interact with a target protein in the body.

In the long term, we need mathematical models of the human body (we already have a pretty good beating heart). Then, in a few decades, a doctor will be able to create a virtual model of you, customised with your own data. She will be able to treat, dissect and explore your digital doppelgänger before she experiments on you. When that day dawns, we will have true personalised medicine.

This article was originally published by WIRED UK