Wednesday, October 29, 2008

Working With Missing Data in Survey Analysis

A new working paper from NUIM Economics addresses the issue of missing data in probit estimation: "An Efficient Estimator for Dealing with Missing Data on Explanatory Variables in a Probit Choice Model", (Denis Conniffe and Donal O’Neill). According to the authors, a common approach to dealing with missing data in econometrics is to estimate the model on the common subset of data, thereby throwing away potentially useful data. The authors wish to avoid case-deletion --- in the particular context of a probit model with missing data on the explanatory variables; so they develop a new estimator. Their simulation results show that the new estimator performs well when compared to popular alternatives, such as complete case analysis and multiple imputation.

A few of us have been discussing missing data and how to address it (mostly with multiple imputation) recently. Below is a list of some resources we have found. If anyone else is aware of other missing data lecture-notes, multiple imputation software packages or relevant econometric estimators, I suggest that we build up a list in the comments on this post.

(i) The NBER econometrics video (and lecture-notes) on missing values - this is done by Woolridge: http://www.nber.org/WNE/lect_12_missing.pdf

(ii) The Gary King lecture-notes on missing values: http://gking.harvard.edu/g2001syl/files/eviltlkP.pdf These notes mention the software package developed by Gary King to implement multiple imputation of missing values. The package is called Amelia and there is a comprehensive guide to it made available by King here: http://gking.harvard.edu/amelia/

(In general, the King site has some great notes - available here)

(iii) A political science lecturer from UCD called Jos Elkink has some lecture-notes on missing values: http://jaeweb.cantr.net/aqm_2008_lecture_missing.pdf

(iv) The multiple imputation FAQ page: http://www.stat.psu.edu/~jls/mifaq.html#ref

(v) http://www.multiple-imputation.com/

(vi) Stephen Soldz's resources for missing data: http://www.soldzresearch.com/statisticsresources.htm#MissingData

(vii) The Southampton CASS course on missing values: http://www.s3ri.soton.ac.uk/cass/showcourse.php?id=71

(viii) The course from the Cambridge Biostatistics Unit (Patrick Royston is one of the lecturers here): http://www.mrc-bsu.cam.ac.uk/MIcourse/index.shtml

(ix) The ICE software package in STATA: http://www.ats.ucla.edu/stat/Stata/library/ice.htm

(x) The Hotdeck module in STATA: http://ideas.repec.org/c/boc/bocode/s366901.html

(xi) David Howell's notes on working with missing data:
http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html

(xii) Joe Schafer's notes on missing data in longitudinal studies:
http://www.stat.psu.edu/~jls/aaps_schafer.pdf

(xiii) Richard Williams' notes on missing data (including traditional approaches in STATA): http://www.nd.edu/~rwilliam/stats2/l12.pdf

(xiv) A book on missing data by Patrick E McKnight et al., made partially available by Googlebooks here

7 comments:

Martin Ryan said...

There is a recent discussion on the IQSS blog about multiple imputation of categorical data. Most standard multiple imputation packages assume the multivariate normal (MVN) distribution, which may not hold for certain types of categorical and binary data. The standard shortcut for overcoming this problem is to just impute under the MVN assumption, then use rounding to finish out the imputation. But a more finessed approach is suggested by Yucel Recai, Yulei He, and Alan Zaslavsky in their May 2008 article in 'The American Statistician'.

http://www.iq.harvard.edu/blog/sss/archives/2008/09/a_handy_trick_f.shtml

Martin Ryan said...

/09/a_handy_trick_f.shtml

Matt said...

Since no one has discussed this yet, let me suggest using partial identification techniques. For example see:

Partial identification with missing data: concepts and findings by Charles Manski

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V07-4DW3JW4-2&_user=10&_rdoc=1&_fmt=&_orig=search&_sort=d&view=c&_version=1&_urlVersion=0&_userid=10&md5=f82010d25ca18fbc781271ceb3204f1a

or Manski's overview text which has a chapter on missing data in general situations, including surveys:

http://www.amazon.com/Identification-Prediction-Decision-Charles-Manski/dp/0674026535

Martin Ryan said...

Thanks Matt.

I just realised that the Woolridge lecture on 'Partial Identification' (P.I.) in the NBER video-series discusses a partial identification example related to missing values. He cites Manski throughout. I'm going to look at this and the Manski links you sent on.

The NBER video and lecture notes on P.I. are available here:

http://www.nber.org/WNE/lect_9_bounds_fig.pdf

Martin Ryan said...

There is a useful list of techniques put together here aswell:

http://en.wikipedia.org/wiki/Missing_values

Martin Ryan said...

This is an interesting discussion about multiple imputation and multilevel models:

http://www.lshtm.ac.uk/msu/missingdata/papers/newsletterdec04.pdf

Martin Ryan said...

/papers/newsletterdec04.pdf