Monday, April 20, 2009

How To Improve Econometric Analysis Using Data from Google Trends - They Can Predict The Flu

In the current edition of the Economist, there is an article on how data from Google Trends can help predict economic statistics before they become available. For example, using data on searches for trucks and SUVs to predict the monthly sales of motor vehicles reduces the average error by up to 18% compared with the predictions from a model that did not incorporate the search data. These findings are from a new economics paper written Hal Varian, the Chief Economist at Google, with Hyunyoung Choi, also at Google. (There is a link to the Google working paper here on the Google Research Blog).

The authors argue that fluctuations in the frequency with which people search for certain words or phrases online can improve the accuracy of the econometric models used to predict, for example, retail-sales figures or house sales. "Actual numbers for such things are usually available only with a lag. But Google’s search data are updated every day, so they can in theory capture shifts in consumer behaviour before official numbers are released."

These data are available through a site called Google Trends; this software has been discussed on the blog quite a few times: here in relation to predicting economic sentiment from search engine behaviour.

I mentioned Gord Hotchkiss from, who asked in the middle of 2008 "what if our mood turns to anxiety about the future? We still search, but we search for different things. We search for information needed to help us weather the storm. Or, we search out of a desperate desire need to know just how bad things are." To illustrate, Hotchkiss presents the following Google Trend graph which shows the relative search volume and news coverage volume of "house plans" (blue line) and "foreclosures" (red line) in America over the last few years:

The Varian and Choi paper discusses how for some things, like retail sales, the categories into which Google classifies its search-trend data correspond closely to what people may want to predict, such as the sales of a particular brand of car. For others, like sales of houses, things are less clear. It appears that searches for estate agents work better than those for home financing.

Some experimentation that I have done with with the Trends software has convinced me that the selection of the keyword is a crucial consideration when trying to analyse search volume. For example, the use of "Bush", "George Bush" and "George Bush Jr" produces very different results. So how can this issue be addressed? The answer may be to find the most popular keywords related to a core question, and to aggregate these for analysis. I have yet to find an aggregation function for keywords in Google Trends, but I have discovered a website that provides information about the most popular keywords used in web searches:

A list of the top 200 search terms that people use, week by week or month by month, is available for free from Sitepsych. A casual inspection of the top 200 list over a 90 day period, quickly tells you that the most popular things that people are looking for on the web are sex, music, games, dogs, golf, the weather and map-directions. Sex and music dominate.

Getting back to the Google Trends software, I noted before that Google lets users get their hands dirty with the secondary data. In fact, Varian and Choi write on the Google Research Blog that they want forecasting wannabes to download some Google Trends data and try to relate it to other economic time series. If you find an interesting pattern, they invite you to post your findings on a website and send a link to They will report on the most interesting results in a later blog post.

I'm thinking of putting together something on when the recession entered the public consciousness, with particular reference to Ireland. Was this a slow-burning process or where there shocks? I suspect it was largely the former but with a preliminary shock in August 2007, a subsequent shock in August 2008 and a critical threshold in November 2008. Did it come through media reference first or through search volume? Again, I suspect that it was largely the former but that there was convergence over time. If the temporal evolution is distinct, can I show that one affected the other? This seems tricky. Should I expect non-stationarity in both series? I definitely think so.

For a list of links to all the software mentioned above, and a discussion of how online search statistics may help drive Irish economic recovery, see this post from earlier on the blog: Web-based Technology and the Recovery - What Do Irish Consumers Want?

Finally, below is a video from which shows that certain search terms are good indicators of flu activity. Google Flu Trends uses aggregated Google search data to estimate flu activity up to two weeks faster than traditional flu surveillance systems. There was an article published about this in Nature during February: Detecting influenza epidemics using search engine query data.


No comments: