Did you control for this?
| Gabriel |
Probably because we deal with observational data (as compared to randomized experiments), sociologists tend to believe that the secret to a good model is throwing so many controls into it that the journal would need to have a centerfold to print the table. This is partly based on the idea that the goal of analysis is to strive for an r-squared of 1.0 or that it’s fun to treat control variables like a gift bag of theoretically irrelevant but substantively cute micro-findings.
The real problem though is based on a misconception of how control variables relate to hypothesis testing. To avoid spurious findings you do not need to control for everything that is plausibly related to the dependent variable. What you do need to control for is everything that is plausibly related to the dependent variable and the independent variable. If a variable is uncorrelated with your independent variable then you don’t need to control for it even if it’s very good at explaining the dependent variable.
You can demonstrate this with a very simple simulation (code is below) by generating an independent variable and three controls.
- Goodcontrol is related to both X and Y
- Stupidcontrol1 is related to Y but not X
- Stupidcontrol2 is related to X but not Y
Compared to a simple bivariate regression, both goodcontrol and stupidcontrol1 greatly improve the overall fit of the model. (Stupidcontrol2 doesn’t so most people would drop it). But controls aren’t about improving fit, they’re about avoiding spurious results. So what we really should care about is not how r-squared changes or the size of beta-stupidcontrol1, but how beta-x changes. Here the results do differ. Introducing stupidcontrol1 leaves beta-x essentially unchanged but introducing goodcontrol changes beta-x appreciably. The moral of the story, it’s only worthwhile to control for variables that are correlated with the independent variable and the dependent variable.
Note that in this simulation we know by assumption which variables are good and which are stupid. In real life we often don’t know this, and just as importantly peer reviewers don’t know either. Whether it’s worth it to leave stupidcontrol1 and stupidcontrol2 in the model purely for “cover your ass” effect is an exercise left to the reader. However if a reviewer asks you to include “stupidcontrol1″ and it proves unfeasible to collect this data then you can always try reminding the reviewer that you agree that it is probably a good predictor of the outcome, but your results would nonetheless be robust to its inclusion because there is no reason to believe that it is also correlated with your independent variable. I’m not guaranteeing that the reviewers would believe this, only that they should believe this so long as they agree with your assumption that stupidcontrol1 will have a low correlation with X.
clear set obs 10000 gen x=rnormal() gen goodcontrol=rnormal()/2+(x/2) gen stupidcontrol1=rnormal() gen stupidcontrol2=rnormal()/2+(x/2) gen error=rnormal() gen y=x+goodcontrol+stupidcontrol1+error eststo clear quietly eststo: reg y x quietly eststo: reg y x goodcontrol quietly eststo: reg y x stupidcontrol1 quietly eststo: reg y x stupidcontrol2 quietly eststo: reg y x goodcontrol stupidcontrol1 stupidcontrol2 esttab, scalars (r2) nodepvars nomtitles *have a nice day