Control for x

March 15, 2012 at 5:29 am 8 comments

| Gabriel |

An extremely common estimation strategy, which Roland Fryer calls “name that residual,” is to throw controls at an effect then say whatever effect remains net of the controls is the effect. Typically as you introduce controls the effect goes down, but not all the way down to zero. Here’s an example using simulated data where we do a regression of y (continuous) on x (dummy) with and without control (continuous and negatively associated with x).

                      (1)             (2)   
x                  -0.474***       -0.257***
                  (0.073)         (0.065)   

control                             0.492***

_cons               0.577***        0.319***
                  (0.054)         (0.048)   
N                    1500            1500   

So as is typical, we see that even if you allow that x=1 tends to be associated with low values for control, you still see an x penalty. However this is a spurious finding since by assumption of the simulation there is no actual net effect of x on y, but only an effect mediated through the control.

This raises the question of what it means to have controlled for something. Typically we’re not really controlling for something perfectly, but only for a single part of a bundle of related concepts (or if you prefer, a noisy indicator of a latent variable). For instance when we say we’ve controlled for “human capital” the model specification might only really have self-reported highest degree attained. This leaves out both other aspects of formal education (eg, GPA, major, institution quality) and other forms of HC (eg, g and time preference). These related concepts will be correlated with the observed form of the control, but not perfectly. Indeed it can even work if we don’t have “omitted variable bias” but just measurement error on a single variable, as is the assumption of this simulation.

To get back to the simulation, let’s appreciate that the “control” is really the control as observed. If we could perfectly specify the control variable, the main effect might go down all the way to zero. In fact in the simulation that’s exactly what happens.

                      (1)             (2)             (3)   
x                  -0.474***       -0.257***       -0.005   
                  (0.073)         (0.065)         (0.053)   

control                             0.492***                

control_good                                        0.980***

_cons               0.577***        0.319***        0.538***
                  (0.054)         (0.048)         (0.038)   
N                    1500            1500            1500   

That is, when we specify the control with error much of the x penalty persists. However when we specify the control without error the net effect of x disappears entirely. Unfortunately in reality we don’t have the option of measuring something perfectly and so all we can do is be cautious about whether a better specification would further cram down the main effect we’re trying to measure.

Here’s the code

set obs 1500
gen x=round(runiform())
gen control_good=rnormal(.05,1) - x/2
gen y=control_good+rnormal(0.5,1)
gen control=control_good+rnormal(0.5,1)
eststo clear
eststo: reg y x
eststo: reg y x control
esttab, se b(3) se(3) nodepvars nomtitles
eststo: reg y x control_good
esttab, se b(3) se(3) nodepvars nomtitles

*have a nice day

Entry filed under: Uncategorized. Tags: , , , .

Recursively building graph commands Using filefilter to make insheet happy


  • 1. Michael Bishop  |  March 15, 2012 at 1:36 pm

    +1 I might use this for teaching

  • 2. Ethan Fosse  |  March 15, 2012 at 3:19 pm

    Great example. Thanks for posting this.

  • 3. Marc F. Bellemare  |  March 15, 2012 at 8:07 pm

    This is great. I’m working on something in which my coauthor and I explain the persistence of some behavior, and show that this “persistence” coefficient gets chipped away at as we include controls (namely, layers of increasingly precise fixed effects…) This is very helpful. Thanks!

  • 4. jeff  |  March 16, 2012 at 8:09 pm

    Great website! Lots of great info!!

    I have one question, I am a beginner using stata, we are using it for an economics research project (running a regression). I know that panel data would probably be more difficult than just cross-sectional, but I am just wondering how much more complex it would be.

    Should I steer clear of trying to tackle a panel data oriented research project, since I am brand new to stata?

    Thanks for any input!

    • 5. gabrielrossman  |  March 16, 2012 at 8:18 pm

      panel data is very easy to specify in Stata. basically, instead of

      reg y x

      you type

      xtreg y x, i(i) re

      so the specification isn’t very hard although the interpretation can get more complicated, especially with fixed-effects (where you have to interpret all effects as relative to the case-specific mean). so i would say it’s more an issue of what you’re up to interpreting than what you’re up to coding. if you have previous experience w econometrics in other languages go ahead and jump into panel data w Stata

  • 6. Thomas Hubbard  |  May 13, 2012 at 6:19 pm

    Murphy and Topel have a paper on the econometrics of this effect, and how one can sometimes infer the “true” estimate based on what happens to the main effect and the R-squared as you add variables…

  • 7. Jon Hersh  |  May 14, 2012 at 11:24 am

    This is a great discussion, and I appreciate the thought experiment. However, I think you’re biasing the experiment a bit in favor of your conclusion. The “bad” control you specified is really bad in terms of signal relative to noise; it’s also biased as an estimator.

    If you specify an “okay” control that is unbiased, but with a large noise component of sd=0.5 you don’t find a significant control. (see code below).

    Frequentists–which we implicitly are when we run OLS regression models–expect all data to be measured with error, but expect that the error is classical. Other assumptions on the data need different models. Of course, few people pay attention to this when they run “reg y x” so that’s why this is a good discussion.

    I’ve never looked into structural equation modelling but know it’s popular with some sociologists. I’d be interested to see how these models fare against the same setup.

    set seed 462
    set obs 1500
    gen x = round(runiform())
    gen control_good = rnormal(.05,1) - x/2
    gen y = control_good + rnormal(0.5,1)
    gen control_bad = control_good + rnormal(0.5,1)
    gen control_okay = control_good + rnormal(0,.5)
    eststo clear
    eststo: reg y x
    eststo: reg y x control_bad
    eststo: reg y x control_good
    eststo: reg y x control_okay
    esttab, se b(3) se(3) nodepvars nomtitles

    [GHR: formatted the code, otherwise as submitted]

    • 8. gabrielrossman  |  May 14, 2012 at 11:45 am

      There are two differences between the error terms on “bad” and “okay,” the mean and the sd. The real issue is the sd. If you do

      gen control_okay = control_good + rnormal(0,1)

      then you get the same bias on the beta for x as with “bad”.

      Also, good point about SEM. The logic is that you average out the error across several measures of the same underlying concept so this should help things.

The Culture Geeks

%d bloggers like this: