Did you control for this?

April 27, 2009 at 5:29 am 9 comments

| Gabriel |

Probably because we deal with observational data (as compared to randomized experiments), sociologists tend to believe that the secret to a good model is throwing so many controls into it that the journal would need to have a centerfold to print the table. This is partly based on the idea that the goal of analysis is to strive for an r-squared of 1.0 or that it’s fun to treat control variables like a gift bag of theoretically irrelevant but substantively cute micro-findings.

The real problem though is based on a misconception of how control variables relate to hypothesis testing. To avoid spurious findings you do not need to control for everything that is plausibly related to the dependent variable. What you do need to control for is everything that is plausibly related to the dependent variable and the independent variable. If a variable is uncorrelated with your independent variable then you don’t need to control for it even if it’s very good at explaining the dependent variable.

You can demonstrate this with a very simple simulation (code is below) by generating an independent variable and three controls.

  • Goodcontrol is related to both X and Y
  • Stupidcontrol1 is related to Y but not X
  • Stupidcontrol2 is related to X but not Y

Compared to a simple bivariate regression, both goodcontrol and stupidcontrol1 greatly improve the overall fit of the model. (Stupidcontrol2 doesn’t so most people would drop it). But controls aren’t about improving fit, they’re about avoiding spurious results. So what we really should care about is not how r-squared changes or the size of beta-stupidcontrol1, but how beta-x changes. Here the results do differ. Introducing stupidcontrol1 leaves beta-x essentially unchanged but introducing goodcontrol changes beta-x appreciably. The moral of the story, it’s only worthwhile to control for variables that are correlated with the independent variable and the dependent variable.

Note that in this simulation we know by assumption which variables are good and which are stupid. In real life we often don’t know this, and just as importantly peer reviewers don’t know either. Whether it’s worth it to leave stupidcontrol1 and stupidcontrol2 in the model purely for “cover your ass” effect is an exercise left to the reader. However if a reviewer asks you to include “stupidcontrol1” and it proves unfeasible to collect this data then you can always try reminding the reviewer that you agree that it is probably a good predictor of the outcome, but your results would nonetheless be robust to its inclusion because there is no reason to believe that it is also correlated with your independent variable. I’m not guaranteeing that the reviewers would believe this, only that they should believe this so long as they agree with your assumption that stupidcontrol1 will have a low correlation with X.

clear
set obs 10000
gen x=rnormal()
gen goodcontrol=rnormal()/2+(x/2)
gen stupidcontrol1=rnormal()
gen stupidcontrol2=rnormal()/2+(x/2)
gen error=rnormal()
gen y=x+goodcontrol+stupidcontrol1+error
eststo clear
quietly eststo: reg y x
quietly eststo: reg y x goodcontrol
quietly eststo: reg y x stupidcontrol1
quietly eststo: reg y x stupidcontrol2
quietly eststo: reg y x goodcontrol stupidcontrol1 stupidcontrol2
esttab, scalars (r2) nodepvars nomtitles
*have a nice day

Entry filed under: Uncategorized. Tags: .

Stata graphs and pdf go together like mustard and ice cream Stata graphs in pdf fixed

9 Comments

  • 1. Mike3550  |  April 27, 2009 at 12:01 pm

    I don’t think that I have ever had a program tell me to have a nice day before. I like it =)

    BTW, the new design/layout is great.

  • 2. gabrielrossman  |  April 27, 2009 at 12:11 pm

    thanks, i’m glad you like the new layout. i wasn’t sure if the new header is too busy so i’m glad to have confirmation on it.

    the thing about “have a nice day” is actually functional. Stata ignores the last line in a do-file so I always put “*have a nice day” at the end of my code to ensure that the last line isn’t something important.

  • 3. Trey  |  April 27, 2009 at 12:25 pm

    Great point. So what happens when your IV of interest is highly correlated with other variables that reviewers would like to see controlled for? Does one take the true empirical argument and say that they are “interchangeable” since they load highly on the same factor?

  • 4. gabrielrossman  |  April 27, 2009 at 12:54 pm

    trey,
    that’s a very good question and i’m actually facing something similar with a peer reviewer right now who suggested a variable that has a lot of empirical power but i think doesn’t belong in the model for conceptual reasons. i don’t want to say what the variables are because it’s still in peer review, but it’s roughly the equivalent of if i were using single-parenthood and bad neighborhood to predict the odds that a juvenile will get arrested and the reviewer suggests controlling for bad grades and drug abuse.

    snarkiness aside, i’m always in favor of first trying things empirically and seeing if it works because that ends the argument much faster than debating on theory. in the situation you’re describing, specifying both variables will probably produce one of three things:
    * your effects get smaller because it’s splitting the statistical power with another variable
    * one of the betas goes to +infinity and the other goes to -infinity because of colinearity
    * the model drops one of the variables

  • 5. Trey  |  April 27, 2009 at 9:33 pm

    The third is actually what happens, but this can be manipulated if the two variables are entered step-wise. And the order of entry determines which variable gets dropped. Good times!

  • 6. Misc links « Code and Culture  |  January 19, 2010 at 5:22 am

    […] In oral arguments at the SCOTUS, a lawyer used the word “orthogonal.” Roberts and Scalia were fascinated by the word and seemed to want to make it the secret word of the day. I myself am fond of the word as it’s a pretty clear way to get across concepts like “lack of interaction effects,” which is a more subtle concept than merely “uncorrelated.” For instance, see my discussion of weighting and when it’s ok to omit controls. […]

  • 7. Jason Kerwin  |  January 22, 2013 at 6:36 pm

    Gabriel –

    I was just linked here by Dan Hirschman, and wanted to point out that you omitted an additional specification that demonstrates why stupidcontrol1 may in fact be something you want to control for. That is controlling for both goodcontrol and stupidcontrol1 in addition to the regressor of interest, but not stupidcontrol2. This reduces the residual variance and gives a lower standard error on the point estimate for Betahat_X.

    Since stupidcontrol1 is uncorrelated with x in expectation, in large samples, this will give you tighter confidence intervals at no cost in terms of biased estimates. An alternate way of thinking about this is that stupidcontrol1 is simply measurement error in y. Measurement error on the left-hand side doesn’t bias our estimates – your second specification already gives an unbiased estimate of Betahat_x – but it does decrease their precision. In a world where we don’t know the true point estimate, better precision is important. It’s also nice for reaching the arbitrary but important criterion of triple-star significance.

    Below is your original code with extra lines showing this other specification:

    clear
    set obs 10000
    gen x=rnormal()
    gen goodcontrol=rnormal()/2+(x/2)
    gen stupidcontrol1=rnormal()
    gen stupidcontrol2=rnormal()/2+(x/2)
    gen error=rnormal()
    gen y=x+goodcontrol+stupidcontrol1+error
    eststo clear
    quietly eststo: reg y x
    quietly eststo: reg y x goodcontrol
    quietly eststo: reg y x stupidcontrol1
    quietly eststo: reg y x stupidcontrol2
    quietly eststo: reg y x goodcontrol stupidcontrol1 stupidcontrol2
    quietly eststo: reg y x goodcontrol stupidcontrol1
    esttab, scalars (r2) nodepvars nomtitles
    *show the standard errors instead of t-statistics for better clarity
    *compare columns (2) and (6)
    esttab, scalars (r2) nodepvars nomtitles se
    *have a nice day

    – Jason

    PS: I have been driven mad by the ignore-the-last-line thing that Stata does, and you’re the only other person I’ve seen acknowledge it. Why does it do that?

    • 8. gabrielrossman  |  January 25, 2013 at 12:15 pm

      Interesting. I follow your logic but when I tried it the standard error for beta_x was about the same in model 2 (y=x+good) as in model 5 (y=x+good+stupid1+stupid2). I ran it several times and each time the se on beta_x went down about 10-20%. So I think you’re right about direction but it’s fairly trivial in magnitude. I think that’s basically consistent with my original verdict that stupidcontrol1 is innocuous but gratuitous and probably doesn’t justify an extensive data collection effort on the merits (although it may be justified for advantage w reviewers since it’s always more rhetorically effective to demonstrate than to argue).

      As for the last line thing, you’d have to ask Statacorp but my hunch is Stata needs some way to delimit the end of a command and it relies on an EOL character to do it.

  • 9. Jason Kerwin  |  January 25, 2013 at 5:11 pm

    I agree that in this case it’s a judgment call – running it without re-seeding the random number generator, I get an decrease in standard of about 20%, which translates to a tightening of the 95% CI from [.96,1.04] to [.97,1.03]. To me that looks like a relatively small important. Whether it’s worth controlling for stupidcontrol1 depends on if you already have data on it, and the cost of collecting that data.

    But its feasible for stupidcontrol1 to matter a whole lot for the precision of estimates of Betahat_X, and, in small samples, for the accuracy as well. What matters is the contribution of stupidcontrol1 to the value of y – if it’s big, you have a really bad signal-to-noise ratio unless you control for it. For example, the code below cranks up the contribution of stupidcontrol1 by ten times. Somewhat intuitively, controlling for it reduces the standard error on Betahat_X by roughly an order of magnitude.

    clear all
    set seed 998332483
    set obs 10000
    gen x=rnormal()
    gen goodcontrol=rnormal()/2+(x/2)
    gen stupidcontrol1=rnormal()
    gen stupidcontrol2=rnormal()/2+(x/2)
    gen error=rnormal()
    gen y=x+goodcontrol+10*stupidcontrol1+error

    quietly eststo: reg y x
    quietly eststo: reg y x goodcontrol
    quietly eststo: reg y x goodcontrol stupidcontrol1
    esttab, scalars (r2) nodepvars nomtitles
    *show the standard errors instead of t-statistics for better clarity
    *compare columns (2) and (6)
    esttab, scalars (r2) nodepvars nomtitles se

    If you run the same code but set the seed to 12345, you actually get point estimate for Betahat_X of 1.3, which is way off. This reasonable: since the standard error is so large, sometimes your estimate will be off by a meaningful degree.

    In reality, of course, it’s more likely that the contribution of any given control that is uncorrelated with X will be trivial, and your reasoning definitely goes through. If your intuition says that a given prospective control is per se very important in determining your dependent variable, you should try to control for it in your regressions. However, if a reviewer argues that you need to control for variable Z that is in principle uncorrelated with X, it is on them to demonstrate that Z is of overwhelming importance in determinijng the value of Y.


The Culture Geeks


%d bloggers like this: