Thursday, November 11, 2010

Google Refine: A power tool for working with messy data

A story on Mashable tells us that "if you live for data, slave over spreadsheets and constantly find yourself sifting through endless rows and columns of facts and figures, Google’s got a lovely new product just for you — and it’s free and open-source, too." The product in question is Google Refine: a tool for cleaning up data, and more. I watched the introductory video on this new piece of kit; it seems to be developed to a very high standard.

However, I would not recommend Refine to graduate students conducting empirical research. As we have mentioned before on the blog, in academic research it is crucial to keep a record of any changes made to a data-set using the syntax-editors available in most software packages. Nonetheless, Google Refine could be extremely useful for undergraduate projects or for data-analysts who only use Excel. One appealing feature for security-conscious users is that the program must be downloaded, which means that the data is never on the web.

For more sophisticated data-handling, the first thing that comes to mind is Scott Long's recent book on "The Workflow of Data Analysis". This is an excellent starting-point for anyone looking to go about best practice in their research. Also, Daniel Hamermesh has a paper on replication in economics, that is arguably a must-read for graduate students beginning a program in empirical economics. An IZA WP version of the Hamermesh article is available here: Replication in Economics. The most recent discussion of data-issues on this blog (including workflow, publication bias, replication, retractions and empirical controversies) is available here.

No comments: