Friday, February 11, 2011

How not to do Instrumental variables

Instrumental Variable estimation, Generalized Method of Moments and related techniques are part of the standard toolkit for applied economists. They are also increasingly used in other fields such as health.
What everyone knows, or should know, is that while one can think of these models as a two stage process this is not actually how you do it. But this paper which looks at how systolic blood pressure depends on anti-hypertensive drugs in Japan, published in the Bulletin of the World Health Organization 2008, gets it badly wrong. As they note, a simple regression of blood pressure on medication is likely to get a positive slope so you need to instrument or do something.
They estimate a logit and then stick the predicted values into an OLS model. Aside from the identifying assumption (which isn't discussed & looks pretty dodgy to me), this is not IV as usually defined and it is not clear that the estimate is consistent or that the standard errors are correct. The model also includes controls for exercise but these are also likely to be endogenous but this is ignored.

7 comments:

Vincent O Sullivan said...

I will admit to not reading the paper in depth but is it because the first stage is non-linear?

Mostly Harmless talks about "forbidden regressions" and their dangers but doesn't really offer solutions.

With an endog binary variable should you just run an LPM in the first stage and plug in the fitted values into the second stage. LPM has a serious hetero problem but if you bootstrapped would you be ok?

Kevin Denny said...

Well two reasons, yes using a logit is mad so regular IV implicitly does LPM. Second reason is that the second equation isn't actually OLS in IV in that the estimated covariance matrix (from OLS) will be wrong. My reading of the paper is that this is what they did, effectively treating the instrumented treatment variable as non-stochastic.
Regular IV with an endogenous dummy is fine or you could do an ML model using a probit first stage - "treatreg" in Stata.

Liam Delaney said...

The Angrist and Krueger 2001 Journal of Economic Perspectives paper also talks briefly about the problems with using logit on the first stage. I think what they are doing is instrumental variable modelling though Kevin. I think your critique would be more accurately rendered as you have problems with the type of instrumental variable strategy they use.

Liam Delaney said...

Just saw your comment. So your problem is mostly with the use of the predictive values from the first stage in a non-stochastic way in the second equation. I thought your problem was the use of the logit as a way of generating the first-stage predictions (something people are arguing against in various ways). This is still IV in a broad sense surely? In another sense, whether we call it IV or not is not really important? What is the main problem with this as a methodology. If the instrument were valid, what would be the consistency problems with using this procedure?

PLW said...

I don't think logit is any worse than probit, but you definitely want to run iv (2sls, for instance) with the predicted values from the original LDV regression as instruments. Just running OLS in the second stage is consistent only if your original LDV model is exactly correct, and even then the standard errors are wrong. Woolridge (2002) 623-625 is best reference here.

Liam Delaney said...

Ok, my motivation in commenting here is to clarify what exactly Kevin is saying about the paper so that we are not being unfair on the authors of the paper. If the critique is that the two-stage estimation as outlined in the paper is inconsistent then fine. This is also covered in the Angrist and Krueger review.

So when Kevin says "this is not IV" I am going to read him as saying "this is a poor way of executing an IV strategy" or even that "this way requires a lot more unrealistic assumptions to be consistent".

Stephen O'Neill said...

I think Wooldridge's procedure 18.1 in "Econometric Analysis of Cross Section and Panel Data" applies here.

(1) Estimate the binary response model by maximum likelihood. Obtain the fitted probabilities,
^G e.g. by probit.
(b) Estimate the orignal regression by IV using instruments 1, ^G and x.