Charlotta Duse about
Google Flu Trends
By analyzing the big data from our google searches, the Google Flu Trends claims to provide ”near real-time estimates of flu activity for a number of countries and regions around the world”.
The idea of the GFT is that if people feel sick, they will search for medication, expected symptom etc. on Google, and the company will be able to analyze this user data and see how and where the flu is spreading. Kenneth Cukier and Viktor Mayer-Schonberger calls the system ”more useful and [a more] timely indicator than government statistics with their natural reporting lags. Public health officials were armed with valuable information” (2013:pos45).
But voices are also raised for the contrary, voices saying that GFT failed its purpose various times. Some even calls it ”a prime example of what can go wrong when you read too much into your Big Data”.
It has been reported that GFT, during some time intervals, predicted more than double the proportion of doctor visits for influenza-like illness (ILI) than the Centers for Disease Control and Prevention (CDC). It is also said to have missed the nonseasonal 2009 influenza A–H1N1 pandemic, as well as consistently overestimating flu prevalence in 2011-2013.
Questions about privacy has also been raised, although Google says the searches don’t contain any information about the persons who made them (Nature. 2009:2).
The data that Google is using for making its predictions is based on the 45 search queries (the ones that had the highest accuracy among billions of search terms, in the making of the model compared to already existing CDC data). It might be interesting to note that in the 100 highest scoring queries words such as ”oscar nominations” and ”high school basketball” appear.
But let’s focus on the 45 search terms that the GFT actually is based on. What data is this? To start with, it is a selected sample, including some queries while ignoring others. Here the manipulation of the data has already started.
The sample might contain searches made by persons having the flu, or people having some other illness similar to flu, people who are curious, people who know someone who have the flu, people who heard about it on the news and want to know more etc. Even their own service to find possibly related health conditions might increase the ”surfing for symptoms” behavior. The media’s positioning of the flu in the limelight might also increase the searches included in the sample, and has also been one of the arguments for the misprediction by the GFT. One must also keep in mind the ability for companies or other entities to manipulate the social networks for their own interests.
We surely also have a group of people who might apply other words to look for flu-related material. If not using one of the 45 search queries that Google includes in their sample for GFT they are not counted for in the result. One must also take into consideration the group that does not go online and search for flu related terms, having the flu or not.
But Google reached the conclusion that the 45 search queries were all the data needed for an accurate forecast. Cathy O’Neil and Rachel Schutt call this a way of ”excluding the voices of people who don’t have the time, energy, or access to cast their vote in all sorts of informal, possibly unannounced, elections” (2013:pos 638). But what this group of people does the ”Google’s system doesn’t know, and it doesn’t care”, as Kenneth Cukier and Viktor Mayer-Schonberger say in an article published in Foreign Affairs. The GFT just makes a correlation between Google searches and outbreaks.
Applying big data to the estimate the lifetime of car equipment or manholes, as showed in Big Data – a revolution that will transform how we live, work and think (2013), might be reasonable. If the parts are correctly made, they supposedly have a certain lifetime before breaking. But people are not that predictable.
We know that people are not always doing what they are expected – media or a phenomena like the swine flu can make them change their search patterns, and hence change the pattern of the big data. But big data doesn’t care why, big data only cares for what: that people are searching for – in their opinion – flu related terms online. This does not necessarily mean that more people have or will get the flu, only that more people are interested. Something that would create a false alarm, and hence lead to unnecessary measurements.
Imagine if the same method were applied to ebola – what would the results be? Probably a panicked world with a lot of potential cases everywhere, and a digital divide complicating things.
Big data hubris?
The authors of the paper The Parable of Google Flu: Traps in Big Data Analysis show strong criticism towards what they call “Big data hubris”, the assumption that big data is a substitute for, more than a complement to, traditional collection and analysis of data (2014:2). Is this what has happened in the case of GFT?
The writers of Doing Data Science: Straight talk from the frontline says that ”Even if we have access to all of FB’s or Google’s or Twitter’s data corpus, any inferences we make from that data should not be extended to draw conclusions about humans beyond those sets of users for any particular day.” (Schutt and O’Neil. 2013:pos 574) The authors of Big Data claim that we have to give up our strive for accuracy on the microlevel for the insights that we gain on the macro level, in this case, the spread of the flu across countries and time (Cukier and Mayer-Schonberger2013).
Hence we can see different degrees in belief in the big data, and how its results should be applied.
Looking at the GFT example we can see that the use of big data has been, and is useful as a real time estimation of the spread of flu, compared to the CDC which are reported on a weekly bases. But the GFT can also fail and hence predict too many, or miss to predict, cases of flu. Lazer, Kennedy, King and Vespignani mean that to trust in GFT alone, with its flaws can lead to erroneous conclusions and actions but that GFT is still useful, and by combining it with other data might improve its accuracy (2014:2).
With this in mind we might conclude that a number or a search term may just be a number or a search term and that predictions are only probabilities, that sometimes become true and sometimes not.