## Regression to the mean

| Gabriel |

Imagine having all your undergrads write practice essays. You read them all and find the five worst essays, then send these kids to the writing center, or even (martyr that you are) tutor them personally. At the end of the term you see that they were no longer the bottom five but were still in the bottom half. Conversely, imagine noticing that most of the faculty brats you know are much smarter than average kids, but not as smart as their parents.

In these cases the issue is not necessarily the efficacy of the writing center or the stupefaction of growing up in a college town, but regression to the mean. In the first case it’s adverse selection, in the second it’s advantageous selection, but the issue is the same.

Regression to the mean occurs whenever you have three conditions:

1. a pre-treatment and post-treatment measure of the key variable (or something similar like two indicators loading on the same latent variable)
2. assignment to the treatment is non-random with respect to the pre-treatment measure
3. the key variable has moderate to low reliability

The reason is that you operationalize effect of the treatment as (Yi1+ei1)-(Yi0+ei0). Now it’s true that the actual treatment effect would be Yi1-Yi0. But note that ei0 is uncorrelated with ei1. Therefore, to the extent that ei0 was important to assigning cases to the treatment, a lot of what you think is an effect is really just that the latent value of the cases you selected for treatment weren’t as severe as you thought they were. I wrote a simulation that demonstrates all of this.

capture log close
log using reg2mean.log, replace

*the model assumes a true level of Y, which is susceptible to treatment, but
*  which can only be measured with error
*a population of agents is generated with y_0true distributed as a standard
*  normal
*y_0observed is defined as y_0true + random-normal*noisiness
*  where "noisiness" is a scaling factor.
*"Adverse selection" occurs when the treatment is applied to all agents
*  for whom y_0observed < -1 sigma
*In the second wave, y is measured again for all agents
*  y_1true = y_0true + beta_treatment*treatment
*  y_1observed = y_1true + random-normal/noisiness
*  delta_observed = y_1observed - y_0observed

*If there were random assignment, delta_observed should equal zero for the
* control group and beta_treatment for the treatment group
*However because noise_0 is uncorrelated with noise_1, with adverse selection
*  they can diverge.
*As such we can measure
*   bias=delta_observed(for treatment group) - beta_treatment

*simulation will vary "noisiness" and "beta_treatment" to show effects on
*  "bias"

global nagents=10000
*each condition gets $nagents capture program drop reg2mean program define reg2mean set more off capture macro drop noisiness beta_treatment global noisiness 1' * how bad is our measure of Y, should range 0 (good measure) to * 1 (1 signal: 1 noise), though theoretically it could be even higher global beta_treatment 2' * how effective is the treatment. should range from -.5 (counter-productive) * to .5 (pretty good), where 0 means no effect disp "noise " float(1') " -- efficacy " float(2') clear quietly set obs$nagents
gen y_0true=rnormal()
gen y_0observed=y_0true + (rnormal()*$noisiness) gen treatment=0 *this code defines recruitment * for adverse selection use "<-1" * for advantageous selection use ">1" quietly replace treatment=1 if y_0observed<-1 gen y_1true=y_0true+ (treatment*$beta_treatment)
gen y_1observed=y_1true+ (rnormal()*$noisiness) gen delta_observed=y_1observed-y_0observed gen bias=delta_observed - (treatment*$beta_treatment)
collapse (mean) bias delta_observed, by (treatment)
quietly keep if treatment==1
drop treatment
gen noisiness=$noisiness gen beta_treatment=$beta_treatment
append using reg2mean
quietly save reg2mean, replace
end

clear
set obs 1
gen x=.
save reg2mean.dta, replace

forvalues noi=0(.1)1 {
forvalues beta=-.5(.1).5 {
reg2mean noi' beta'
}
}
drop x
drop if noisiness==.
lab var delta_observed "apparent efficacy of treatment"
lab var bias "measurement error of delta_obs"
lab var noisiness "measurement error of Y"
lab var beta_treatment "true efficacy of treatment"
recode beta_treatment -1.001/-.999=-1 -.001/.001=0 .999/1.001=1 1.999/2.001=2
compress
save reg2mean.dta, replace

table noisiness, c(m bias sd bias)
table beta_treatment, c(m bias sd bias)
*have a nice day

As you can see, bias is robust to the size of the true effect but is basically equal to noisiness. The practical implication is to be very skeptical of claims about effects where the measurement has low reliability and selectivity is built into the system.

If you like, you can use some of my other code to graph the simulation as a contour plot, either crudely but natively or more elegantly with gnuplot. Here’s the code with those two commands:

crudecontour noisiness beta_treatment bias
gnuplotpm3d noisiness beta_treatment bias, title(Regression to the Mean Simulation) xlabel(noisiness) ylabel(efficacy) using(reg2mean)