## Regression to the mean

*August 6, 2009 at 5:56 am* *gabrielrossman* *
1 comment *

| Gabriel |

Imagine having all your undergrads write practice essays. You read them all and find the five worst essays, then send these kids to the writing center, or even (martyr that you are) tutor them personally. At the end of the term you see that they were no longer the bottom five but were still in the bottom half. Conversely, imagine noticing that most of the faculty brats you know are much smarter than average kids, but not as smart as their parents.

In these cases the issue is not necessarily the efficacy of the writing center or the stupefaction of growing up in a college town, but regression to the mean. In the first case it’s adverse selection, in the second it’s advantageous selection, but the issue is the same.

Regression to the mean occurs whenever you have three conditions:

- a pre-treatment and post-treatment measure of the key variable (or something similar like two indicators loading on the same latent variable)
- assignment to the treatment is non-random with respect to the pre-treatment measure
- the key variable has moderate to low reliability

The reason is that you operationalize effect of the treatment as (Y_{i1}+e_{i1})-(Y_{i0}+e_{i0}). Now it’s true that the actual treatment effect would be Y_{i1}-Y_{i0}. But note that e_{i0} is uncorrelated with e_{i1}. Therefore, to the extent that e_{i0} was important to assigning cases to the treatment, a lot of what you think is an effect is really just that the latent value of the cases you selected for treatment weren’t as severe as you thought they were. I wrote a simulation that demonstrates all of this.

capture log close log using reg2mean.log, replace *the model assumes a true level of Y, which is susceptible to treatment, but * which can only be measured with error *a population of agents is generated with y_0true distributed as a standard * normal *y_0observed is defined as y_0true + random-normal*noisiness * where "noisiness" is a scaling factor. *"Adverse selection" occurs when the treatment is applied to all agents * for whom y_0observed < -1 sigma *In the second wave, y is measured again for all agents * y_1true = y_0true + beta_treatment*treatment * y_1observed = y_1true + random-normal/noisiness * delta_observed = y_1observed - y_0observed *If there were random assignment, delta_observed should equal zero for the * control group and beta_treatment for the treatment group *However because noise_0 is uncorrelated with noise_1, with adverse selection * they can diverge. *As such we can measure * bias=delta_observed(for treatment group) - beta_treatment *simulation will vary "noisiness" and "beta_treatment" to show effects on * "bias" global nagents=10000 *each condition gets $nagents capture program drop reg2mean program define reg2mean set more off capture macro drop noisiness beta_treatment global noisiness `1' * how bad is our measure of Y, should range 0 (good measure) to * 1 (1 signal: 1 noise), though theoretically it could be even higher global beta_treatment `2' * how effective is the treatment. should range from -.5 (counter-productive) * to .5 (pretty good), where 0 means no effect disp "noise " float(`1') " -- efficacy " float(`2') clear quietly set obs $nagents gen y_0true=rnormal() gen y_0observed=y_0true + (rnormal()*$noisiness) gen treatment=0 *this code defines recruitment * for adverse selection use "<-1" * for advantageous selection use ">1" quietly replace treatment=1 if y_0observed<-1 gen y_1true=y_0true+ (treatment*$beta_treatment) gen y_1observed=y_1true+ (rnormal()*$noisiness) gen delta_observed=y_1observed-y_0observed gen bias=delta_observed - (treatment*$beta_treatment) collapse (mean) bias delta_observed, by (treatment) quietly keep if treatment==1 drop treatment gen noisiness=$noisiness gen beta_treatment=$beta_treatment append using reg2mean quietly save reg2mean, replace end clear set obs 1 gen x=. save reg2mean.dta, replace forvalues noi=0(.1)1 { forvalues beta=-.5(.1).5 { reg2mean `noi' `beta' } } drop x drop if noisiness==. lab var delta_observed "apparent efficacy of treatment" lab var bias "measurement error of delta_obs" lab var noisiness "measurement error of Y" lab var beta_treatment "true efficacy of treatment" recode beta_treatment -1.001/-.999=-1 -.001/.001=0 .999/1.001=1 1.999/2.001=2 compress save reg2mean.dta, replace table noisiness, c(m bias sd bias) table beta_treatment, c(m bias sd bias) *have a nice day

As you can see, bias is robust to the size of the true effect but is basically equal to noisiness. The practical implication is to be *very* skeptical of claims about effects where the measurement has low reliability and selectivity is built into the system.

If you like, you can use some of my other code to graph the simulation as a contour plot, either crudely but natively or more elegantly with gnuplot. Here’s the code with those two commands:

crudecontour noisiness beta_treatment bias gnuplotpm3d noisiness beta_treatment bias, title(Regression to the Mean Simulation) xlabel(noisiness) ylabel(efficacy) using(reg2mean)

Entry filed under: Uncategorized. Tags: simulation, Stata.

1.Regression to the mean [updated] « Code and Culture | May 27, 2010 at 4:42 am[…] updated* my old old script for simulating regression to the […]