Weighting, omitted variable bias, and interaction effects

| Gabriel |

Everyone agrees that weights can be necessary for cross-tabs but you sometimes hear arguments that they are unnecessary for regression if you just control for whatever you were using to calculate the weights. This is true if you are concerned about different intercepts, but doesn’t do anything if you are concerned about different slopes. Of course you could specify these different slopes as interaction effects but:

• these interactions may not occur to you
• you’re talking about a majorly cluttered model

Here’s a simple illustration of how it works. Whites make more than blacks and men make more than women. However the income difference between black women and black men is much smaller than the gender gap among whites. One way to put this is that “black” and “female” have negative main effects but a positive interaction. Another way to put this is that the effect of gender on income has a different slope for whites and for blacks.

Nonetheless it may be meaningful to think of a grand slope for the whole population either by default because it hasn’t occurred to you to check for an interaction or because you think it’s more theoretically parsimonious to omit it. The grand slope should be a compromise between the (steep) slope for whites and the (shallow) slope for blacks but since there are more whites than blacks it should be closer to the white slope. However if you have an oversample of blacks, your estimate of the grand slope will be too close to that for blacks, even if you control for race. That is, if you’re worried about intercepts then controls are similar to weighting, but if you are worried about slopes then (additive) controls don’t do you any good.

Here’s a demonstration:

*create a population consisting that is 10% black and (orthogonally) 50% female
clear
eststo clear
set obs 10000
gen black=0
replace black=1 in 1/1000
gen female=mod([_n],2)
gen whiteman=0
replace whiteman=1 if black==0 & female==0
*create "income" where white women, black men, and black women make the same amount of money
*  and white men make more
*should be poisson, but for simplicity make it normal
gen income=rnormal()
replace income=income+whiteman
*white men are +1 sigma on income relative to the other three categories (which are equal to each other)
*confirmed in this table, where most groups are at about (standardized) zero and white men at about 1
table black female, c(m income)
eststo: regress income black female
eststo: regress income black female whiteman
*these are the true effects for the population (90% white, 10% black)

*imagine doing a survey n=1000 with a stratified random sample oversampling blacks
gen sample=0
gen stratrandomsample=runiform()
sort black stratrandomsample
replace sample=1 in 1/500 /*sample 500 whites (out of 9000 in pop)*/
gsort -black stratrandomsample
replace sample=1 in 1/500 /*sample 500 blacks (out of 1000 in pop)*/
*create pweights
gen pweight=.
replace pweight=2  if black==1 & sample==1 /*  2=1000/500 or black population: black sample */
replace pweight=18 if black==0 & sample==1 /* 18=9000/500 or white population: white sample */
eststo: regress income black female if sample==1
eststo: regress income black female whiteman if sample==1
eststo: regress income black female [pweight=pweight] if sample==1
esttab , se  mtitles("Population" "Population" "Unweighted" "Unweighted" "Weighted")
* 1. Population, gender only
* 2. Population, fully specified
* 3. Unweighted sample, gender only
* 4. Unweighted sample, fully specified
* 5. Weighted sample, gender only
* Note similarity of #1 and #5 and of #2 and #4. Contrast with #3

*have a nice day