Monday, March 23, 2009

Missing Values

Further to a thread started a few months ago about working with missing values in survey data, some recent insights on whether or not to use multiple imputation are summarised below. There is an important distinction to be made between MCAR and MNAR, which is explained.

According to Howell (2007), using dummy variables to code for missing observations was popularized in the behavioral sciences by Cohen and Cohen (1983). However, the approach does not produce unbiased parameter estimates (Jones, 1996), and is no longer to be recommended in light of the availability of software to handle multiple imputation approaches.

However, there is a further complication in that there are several reasons why data may be missing. Data may be missing completely at random (MCAR) because equipment malfunctioned, the weather was terrible, or people got sick, or the data were not entered correctly. When data are MCAR, the probability that an observation (Xi) is missing is unrelated to the value of Xi or to the value of any other variables. Thus data on family income would not be considered MCAR if people with low incomes were less likely to report their family income than people with higher incomes.

According to Howell (2007), the only way to obtain an unbiased estimate of parameters when data are missing not at random (MNAR) is to model the missingness. Essentially, one needs to write a model that accounts for the missing data. That model could then be incorporated into a more complex model for estimating missing values. According to Howell (2007), this is not a task one would take on lightly, but he references Dunning and Freedman (2008) for an example. (Sadly, Professor Freedman passed away on 17 October 2008. His webpage is worth looking at.)

In the event that data are MNAR, it is better to use dummy variables to code for missing observations, rather than to use multiple imputation. However, it should be remembered that this will not produce unbiased parameter estimates (Jones, 1996).

1 comment:

Martin Ryan said...

Also worth looking at are Paul Allison's book on missing values, partly available on the web thanks to Google-Books:

http://books.google.ie/books?id=ZtYArHXjpB8C&pg=PA11&lpg=PA11&dq=jones+1996+missing+values&source=bl&ots=ziSDzDLqV2&sig=chbVCPYi6ZDYsARvN3-_RKsc074&hl=en&ei=_LXHSZi3HYOv-Abi_rToBg&sa=X&oi=book_result&resnum=3&ct=result#PPP1,M1

And a paper by Alan Acock on working with missing values:

http://oregonstate.edu/~acock/growth-curves/working%20with%20missing%20values.pdf