Big (biased) Data. What can go wrong?

Diana, October 16.

Today’s question that I want to ask is: what can go wrong by using Big Data? I move, with that, from theoretical posts to more explanatory post to show, with a particular example, how Big Data can be a risk for “development professionals” in order to provide aid.

big data wrongs

In my previous posts, I talked about risks and hidden bias in Big Data. I mentioned, that people are the ones responsible for data construction.

Google and particularly the Internet are generally observed as groundbreaking discoveries that have changed the way millions of people live their lives and yet researchers and practitioners in the field of ICT and development often struggle to demonstrate explicit influences of the technology to “development professionals”. There are definite reasons why certain projects fail and there are even some generalisable outlines of failure.

One of the examples is Google – the most used search engine in the world, where millions of people can find all kind of information that affects their daily lives. In 2008, Google came up with a, like they thought, brilliant application – Google Flu Trends – to truck flu and its spreading in the world. That has been done in order to help “development professionals” to provide aid to affected areas. Google claimed that they could see the advances of flu based on people’s searches. The essential idea was that when people are sick with the flu, they search for flu-related information on Google, providing almost instant signals of overall flu prevalence.

But this concept didn’t work. Why? Let’s examine.

David Lazer and Ryan Kennedy write in SCIENCE that Google relayed too much on simple search. That led to the spectacular failure of Google Flu Trends. Application missed, at the peak of the 2013, flu season by 140 percent.

Like I mention in my previous post, it is hard to know what is really happening in the affected area if you are not actually present in this area.

That is what happened here – Google didn’t take into the account that multiple people with the flu don’t actually use the search engine to seek for flu-related information. Furthermore, Google didn’t do the research of how many people rely upon internet in order to find records about the flu. Also, Google didn’t take into the account all those people who use Yahoo or Bing instead of Google.

David Lazer and Ryan Kennedy – professors in the Department of Political Science at the College of Computer and Information Sciences at Northeastern University respective at the University of Houston – continue that Google’s algorithm was relatively weak to overfitting to seasonal terms unrelated to the flu. With millions of search terms being fit into data, there were searches that were strongly correlated by pure chance.

These terms were unlikely to be determined by actual flu cases or to be prognostic of future inclinations. Moreover, Google did not take into account variations in search activities over time. These errors are not randomly distributed: an old error predict a new error scale of error varies with the time of year (seasonality). These outlines mean that Google Flu Trends overlook significant information that could be extracted by traditional statistical methods.

big data wrong

Google, as well as the whole Internet, is continuously changing because of the activities of millions of engineers and consumers. Researchers require an improved understanding of how these changes transpire over time. Scientists need to reproduce findings using these data sources across time and using other data sources to guarantee that they are observing robust outlines and not temporary trends. For instance, it is extremely practicable to do controlled experiments with Google, e.g., observing how Google search results will differ based on location and past searches.

More commonly, reviewing the evolution of socio-technical systems rooted in our societies is fundamentally important and worthy of study. The algorithms underlying Google support to regulate what we find out about our health, politics, and friends.

It’s Not Just About Size of the Data. There is a tendency for big data research and more traditional applied statistics to live in two different realms – aware of each other’s existence but generally not very trusting of each other (SCIENCE).

Big data offer massive potentials for understanding human connections at a societal scale, with rich spatial and temporal changing aspects, and for spotting compound interactions and nonlinearities among variables. Those are the most thrilling borderlines in studying human behaviour.

As an alternative of focusing on a “big data revolution,” perhaps it is time to concentrate on an “all data revolution,” where it can be recognised that the critical change in the world has been innovative analytics, using data from all traditional and new sources, and providing a deeper, clearer understanding of our world.

 

Tags: ,

2 comments

  1. Hi Diana, interesting article indeed. And it really link to the previous post about data2x. And the data inputs should be from everyone so we are able to provide solutions for everyone. One questions comes to my mind as well while reading your post.
    Companies like google, or researches depends on technology to collect their data right? What about the data that is not recorded in database? Or behaviors that does not follow trends? In another way what I’m saying is, can we transfer everything into measureable data?
    Can technology by the solution ?

    • Diana Uljanova Sigfusson

      Hi, Ali!

      Thank you for your comment!

      You are asking a really good question! I myself was wondering the same. I think that the technology could be the solution. It could be a research for a bigger paper.