Oct 17

Big (biased) Data. What can go wrong?

Diana, October 16.

Today’s question that I want to ask is: what can go wrong by using Big Data? I move, with that, from theoretical posts to more explanatory post to show, with a particular example, how Big Data can be a risk for “development professionals” in order to provide aid.

big data wrongs

In my previous posts, I talked about risks and hidden bias in Big Data. I mentioned, that people are the ones responsible for data construction.

Google and particularly the Internet are generally observed as groundbreaking discoveries that have changed the way millions of people live their lives and yet researchers and practitioners in the field of ICT and development often struggle to demonstrate explicit influences of the technology to “development professionals”. There are definite reasons why certain projects fail and there are even some generalisable outlines of failure.

One of the examples is Google – the most used search engine in the world, where millions of people can find all kind of information that affects their daily lives. In 2008, Google came up with a, like they thought, brilliant application – Google Flu Trends – to truck flu and its spreading in the world. That has been done in order to help “development professionals” to provide aid to affected areas. Google claimed that they could see the advances of flu based on people’s searches. The essential idea was that when people are sick with the flu, they search for flu-related information on Google, providing almost instant signals of overall flu prevalence.

But this concept didn’t work. Why? Let’s examine.

David Lazer and Ryan Kennedy write in SCIENCE that Google relayed too much on simple search. That led to the spectacular failure of Google Flu Trends. Application missed, at the peak of the 2013, flu season by 140 percent.

Like I mention in my previous post, it is hard to know what is really happening in the affected area if you are not actually present in this area.

That is what happened here – Google didn’t take into the account that multiple people with the flu don’t actually use the search engine to seek for flu-related information. Furthermore, Google didn’t do the research of how many people rely upon internet in order to find records about the flu. Also, Google didn’t take into the account all those people who use Yahoo or Bing instead of Google.

David Lazer and Ryan Kennedy – professors in the Department of Political Science at the College of Computer and Information Sciences at Northeastern University respective at the University of Houston – continue that Google’s algorithm was relatively weak to overfitting to seasonal terms unrelated to the flu. With millions of search terms being fit into data, there were searches that were strongly correlated by pure chance.

These terms were unlikely to be determined by actual flu cases or to be prognostic of future inclinations. Moreover, Google did not take into account variations in search activities over time. These errors are not randomly distributed: an old error predict a new error scale of error varies with the time of year (seasonality). These outlines mean that Google Flu Trends overlook significant information that could be extracted by traditional statistical methods.

big data wrong

Google, as well as the whole Internet, is continuously changing because of the activities of millions of engineers and consumers. Researchers require an improved understanding of how these changes transpire over time. Scientists need to reproduce findings using these data sources across time and using other data sources to guarantee that they are observing robust outlines and not temporary trends. For instance, it is extremely practicable to do controlled experiments with Google, e.g., observing how Google search results will differ based on location and past searches.

More commonly, reviewing the evolution of socio-technical systems rooted in our societies is fundamentally important and worthy of study. The algorithms underlying Google support to regulate what we find out about our health, politics, and friends.

It’s Not Just About Size of the Data. There is a tendency for big data research and more traditional applied statistics to live in two different realms – aware of each other’s existence but generally not very trusting of each other (SCIENCE).

Big data offer massive potentials for understanding human connections at a societal scale, with rich spatial and temporal changing aspects, and for spotting compound interactions and nonlinearities among variables. Those are the most thrilling borderlines in studying human behaviour.

As an alternative of focusing on a “big data revolution,” perhaps it is time to concentrate on an “all data revolution,” where it can be recognised that the critical change in the world has been innovative analytics, using data from all traditional and new sources, and providing a deeper, clearer understanding of our world.


Oct 17

Biased #data

Diana, October 5.

bias data

To continue my previous post, I’ll talk more about the biased data in this one.

Like I mentioned before, the Big Data is, unfortunately, not objective, but a human creation: Taylor and Schroeder accentuate that if we know the whole information on the matter, it can lead to the difficulty in understanding it and to the unwillingness to share it. Also, if we are not critical enough towards data we are receiving, we can buy false information as it is, without the evidence.

Big Data is everywhere. Big companies or “development professionals” such as the United Nations (UN) or Organisation for Economic Co-operation and Development (OECD) are using these types of data for research and exploration. Companies meet a lot of technical concerns on the way, like risks and issues of bias have tended to dominate the discussion so far.

Taylor and Schroeder point out the role of biased data in development politics. One example is how data is politicised, namely, that even correct data may not be accepted: all information has to be agreed upon in order to be useful to country authorities as support for policy decisions. Many undeveloped countries have that problem, where real information is hard to acquire. Officials censors all information that comes from sectors of the population who feel underrepresented.

bias data

Kate Crawford — a Principal Researcher at Microsoft Research New York City, a Visiting Professor at MIT’s Center for Civic Media and a Senior Fellow at NYU’s Information Law Institute, her research addresses the social impacts of big data and she’s currently writing a new book on data and power with Yale University Press — published an article in Harward Business Review: “The Hidden Biases in Big Data”.

Hidden biases in both the collection and analysis stages present considerable risks and are as important to the big-data equation as the numbers themselves. — Kate Crawford.

Kate takes up an example to explain the hidden bias in data. There was a lot of tweets about Hurricane Sandy, more than 20 million, between October 27 and November 1. A study shows that these data don’t represent the whole picture. The highest number of tweets about Sandy came from Manhattan: the city has a high level of smartphone ownership and Twitter use. On the other hand, it forms the illusion that Manhattan was the hub of the disaster. Not so many messages originated from affected locations, such as Breezy Point, Coney Island, Rockaway and even fewer tweets came from the worst-hit areas.

Here we can ask ourselves: how do the people outside of affected areas know about what is really happening there?

We rely more and more on Big Data’s numbers to speak for themselves, but we risk in misunderstanding the results and in turn misdirecting important public resources are as big as data itself. “Development professionals” do that mistake also, they rely on information without questioning it. All that misinformation can cause a wrong type of help to a wrong place or be an obstacle in aid relief.

Taylor and Schroeder take a similar example of biased data: the Big Data being used by “development professionals” in mobiles for tracking population movement in disaster relief. The problem with collecting this data is that it is not totally complete: not everyone uses mobile phones, with users particularly low amongst vulnerable and ‘hidden’ populations such as children, the elderly, the poorest and women.

As we move into an era in which personal devices are seen as proxies for public needs, we run the risk that already existing inequities will be further entrenched. Thus, with every big data set, we need to ask which people are excluded. Which places are less visible? What happens if you live in the shadow of big data sets? — Kate Crawford.