It is indisputable that the volume of data being captured is on the rise. What still remains controversial is whether or not the often superfluous capturing will actually lead to an increase in valid discoveries instead of a flood of false and random occurrences perceived to be new discoveries.
Data driven discoveries are commonly backed by the probabilistic or statistical existence of a relationship between one or more variables. For example, measuring the correlation between outside temperature and ice cream sales would likely indicate that as one goes up, so does the other. I haven’t done the math, but for the sake of argument, let’s assume they are related.
Our world is full of chaos and uncertainty. Since data is just recorded snapshots of our world at a given point in time, data also inherits these varying degrees of chaos and uncertainty. This is where you have to be careful. What if we added beach tourism data and shark attack counts to the ice cream and outside temperature data set mentioned above? I’d be willing to bet that a strong statistical relationship between shark attacks and ice cream sales would bubble up from the data. But is that relationship real, or is it just a fallacy of other underlying relationships? A more reasonable inference is that people flock to the beaches in warm weather, and more people in the water mean more shark attacks, which just happen to increase with ice cream sales due to the warmer weather.
So how do we keep from being fooled by these false relationships? First of all, don’t believe everything you see and read. Use your own judgment and expertise to validate claims. Scientific journals generally do a great job vetting through research submissions, but a lot of the vetting still falls back on the original investigators and whether or not they applied proper research design and experimental techniques.
Several months ago, a group of German researchers proclaimed that people on a low carb diet would lose weight 10% faster if they ate chocolate every day. This immediately made headlines that swept across Europe and eventually the US. Unfortunately for believers, the original intent wasn’t a diet study, but a documentary to demonstrate how easy it is to turn bad science into big headlines.
So how did they find a relationship? They captured lots of data – using big data – and hoped that with all of the variables something would pop out due to chance; then the researchers could capitalize on that relationship. And this is exactly what they did. The authors found a relationship, submitted the paper to 20 journals, and were already accepted by multiple within the first 24 hours.
Of course this example is an outlier – not all studies and papers are bad science. Most are great science and the backbone for innovation, but we still have to be careful. All it takes is accepting one false relationship and incorporating it into your decisions or daily life to cause adverse effects. Sometimes you just need to do a little research and think on your own. Play the role of the journal editor and decide if you would accept or reject the theory.
The evolution of big data has certainly opened up the world to great opportunities for discovery. However, within those opportunities lies the responsibility to present great science and eliminate fallacies that lie within the data. Part of that responsibility falls on you – the reader and consumer -- to judge whether you want to accept or reject the claims based on evidence presented to you.