## Posts tagged ‘simulation’

### Climbing the Charts, ch 4

| Gabriel |

A few months ago Stanford’s sociology department was nice enough to invite me up to give a talk on chapter four of Climbing the Charts. This chapter argues that the opinion leadership hypothesis cannot be supported in radio and in the talk I show a simulation of why we should be skeptical of this hypothesis in general. There’s no video, but here’s an enhanced audio file with slideshow. Also a separate PDF of the slides in case you have problems with the integrated version. (A caveat, I knew I was speaking to a technically sophisticated audience so I let the jargon flow freely, the chapter itself is much easier to follow for people without a networks background).

Also in shameless plugging news, Fabio’s review at OrgTheory.

### Control for x

| Gabriel |

An extremely common estimation strategy, which Roland Fryer calls “name that residual,” is to throw controls at an effect then say whatever effect remains net of the controls is the effect. Typically as you introduce controls the effect goes down, but not all the way down to zero. Here’s an example using simulated data where we do a regression of y (continuous) on x (dummy) with and without control (continuous and negatively associated with x).

```--------------------------------------------
(1)             (2)
--------------------------------------------
x                  -0.474***       -0.257***
(0.073)         (0.065)

control                             0.492***
(0.023)

_cons               0.577***        0.319***
(0.054)         (0.048)
--------------------------------------------
N                    1500            1500
--------------------------------------------
```

So as is typical, we see that even if you allow that x=1 tends to be associated with low values for control, you still see an x penalty. However this is a spurious finding since by assumption of the simulation there is no actual net effect of x on y, but only an effect mediated through the control.

This raises the question of what it means to have controlled for something. Typically we’re not really controlling for something perfectly, but only for a single part of a bundle of related concepts (or if you prefer, a noisy indicator of a latent variable). For instance when we say we’ve controlled for “human capital” the model specification might only really have self-reported highest degree attained. This leaves out both other aspects of formal education (eg, GPA, major, institution quality) and other forms of HC (eg, g and time preference). These related concepts will be correlated with the observed form of the control, but not perfectly. Indeed it can even work if we don’t have “omitted variable bias” but just measurement error on a single variable, as is the assumption of this simulation.

To get back to the simulation, let’s appreciate that the “control” is really the control as observed. If we could perfectly specify the control variable, the main effect might go down all the way to zero. In fact in the simulation that’s exactly what happens.

```------------------------------------------------------------
(1)             (2)             (3)
------------------------------------------------------------
x                  -0.474***       -0.257***       -0.005
(0.073)         (0.065)         (0.053)

control                             0.492***
(0.023)

control_good                                        0.980***
(0.025)

_cons               0.577***        0.319***        0.538***
(0.054)         (0.048)         (0.038)
------------------------------------------------------------
N                    1500            1500            1500
------------------------------------------------------------
```

That is, when we specify the control with error much of the x penalty persists. However when we specify the control without error the net effect of x disappears entirely. Unfortunately in reality we don’t have the option of measuring something perfectly and so all we can do is be cautious about whether a better specification would further cram down the main effect we’re trying to measure.

Here’s the code

```clear
set obs 1500
gen x=round(runiform())
gen control_good=rnormal(.05,1) - x/2
gen y=control_good+rnormal(0.5,1)
gen control=control_good+rnormal(0.5,1)
eststo clear
eststo: reg y x
eststo: reg y x control
esttab, se b(3) se(3) nodepvars nomtitles
eststo: reg y x control_good
esttab, se b(3) se(3) nodepvars nomtitles

*have a nice day```

### Christmas in July

| Gabriel |

Has it been two years already? Holy moly, Stata 12 looks awesome.

The headline feature is structural equation modeling. It comes with a graphic model builder, which even an “only scripting is replicable” zealot like me can appreciate as it helps you learn complicated command syntax. (I feel the same way about graphs). I had actually been thinking of working SEM into my next paper and was thinking through the logistics of getting a copy of M+, learning the syntax (again), etc. Now I can do it within Stata. I look forward to reading more papers that use SEM without really understanding the assumptions.

Probably the most satisfying new feature to me though is contour plots. Ever since I got interested in writing simulations a few years ago, I have been wanting to make heat maps in Stata. I’ve spent many hours writing code that can pipe to gnuplot and, not being satisfied with that, I (with some help from Lisa) have spent yet more time working on another script that can pipe to the wireframe function in R’s lattice library. Now I’m very happy to say that I will not finish writing this ado-file and submitting it to SSC as Stata 12 contains what looks to be really good native heat plots.

I’m thinking the set of commands I will feel most guilty about not using more often, is margin plots, which extends the margin command from Stata 11. In addition to the headline new features there’s a bunch of little stuff, including fixes to get more compatibility between the estimation commands and the ancillary commands (e.g., better “predict” support for count models and “svy” support for “xtmixed”). Also, Windows users should be pleased to hear that they can now do PDFs natively.

[Update: Also see Jeremy’s post on Stata 12. He closes with a pretty funny metaphor of stats packages to cell phone brands.]

### Simulations, numlist, and order of operations

| Gabriel |

I’ve been programming another simulation and as is typical am batching it through various combinations of parameter values, recording the results each time. In making such heavy (and recursive) use of the forvalues loop I noticed some issues with numlist and orders of operation in algorithms.

First, Stata’s numlist expression (as in the “forvalues” syntax) introduces weird rounding errors, especially if specified as fractions. Thus it is preferable to count by integers then scale down to the fractional value within the loop. This is also useful if you want to save each run of the simulation as a file as it lets you avoid fractional filenames.

```forvalues i=0(.01)1 {
replace x=sin(`i')
save sin`i'.dta, replace
}
```

Do this:

```forvalues i=0/100 {
local i_scaled=`i'/100
replace x=sin(`i_scaled')
save sin`i'.dta, replace
}
```

Another issue with numlist is that it can introduce infintessimal errors so that evaluating “1==1” comes back false. If you have a situation like this you need to make the comparison operator fuzzy. So instead of just writing the expression “if y==x” you would use the expression

```if y>x-.0001 & y<x+.0001
```

Finally, I’ve noticed that when you are running nested loops the number of operations grows exponentially and so it makes a big difference in what order you do things. In particular, you want to arrange operations so they are repeated the least numbers of times. For instance, suppose you have batched a simulation over three parameters (x, y, and z) and saved each combination in its own dataset with the convention “results_x_y_z” and you wish to append the results in such a way that the parameter values are variables in the new appended dataset. The simple (but slow) way to run the append is like this:

```clear
gen x=.
gen y=.
gen z=.
forvalues x=1/100 {
forvalues y=1/100 {
forvalues z=1/100 {
append using results_`x'_`y'_`z'
recode x .=`x'
recode y .=`y'
recode z .=`z'
}
}
}
```

Unfortunately this is really slow. The following code has the same number of lines but it involves about half as many operations for the computer to do. In the first version there are four commands that are each run 100^3 times. The second version has two commands that run 100^3 times, one command that runs 100^2 times, and one command that runs 100 times.

```clear
gen x=.
gen y=.
gen z=.
forvalues x=1/100 {
forvalues y=1/100 {
forvalues z=1/100 {
append using results_`x'_`y'_`z'
recode z .=`z'
}
recode y .=`y'
}
recode x .=`x'
}
```

### Conditioning on a Collider Between a Dummy and a Continuous Variable

| Gabriel |

In a post last year, I described a logical fallacy of sample truncation that helpful commenters explained to me is known in the literature as conditioning on a collider. As is common, I illustrated the issue with two continuous variables, where censorship is a function of the sum. (Specifically, I used the example of physical attractiveness and acting ability for a latent population of aspiring actresses and an observed population of working actresses to explain the paradox that Megan Fox was considered both “sexiest” and “worst” actress in a reader poll).

In revising my notes for grad stats this year, I generalized the problem to cases where at least one of the variables is categorical. For instance, college admissions is a censorship process (only especially attractive applicants become matriculants) and attractiveness to admissions officers is a function of both categorical (legacy, athlete, artist or musician, underrepresented ethnic group, in-state for public schools or out-of-state for private schools, etc) and continuous distinctions (mostly SAT and grades).

For simplicity, we can restrict the issue just to SAT and legacy. (See various empirical studies and counterfactual extrapolations by Espenshade and his collaborators for how it works with the various other things that determine admissions.) Among college applicant pools, the children of alumni to prestigious schools tend to score about a hundred points higher on the SAT than do other high school students. Thus the applicant pool looks something like this.

However, many prestigious colleges have policies of preferring legacy applicants. In practice this mean that the child of an alum can still be admitted with an SAT score about 150 points below non-legacy students. Thus admission is a function of both SAT (a continuous variable) and legacy (a dummy variable). This implies the paradox that the SAT scores of legacies are about half a sigma above average for the applicant pool but about a full sigma below average in the freshman class, as seen in this graph.

Here’s the code.

```clear
set obs 1000
gen legacy=0
replace legacy=1 in 1/500
lab def legacy 0 "Non-legacy" 1 "Legacy"
lab val legacy legacy
gen sat=0
replace sat=round(rnormal(1100,250)) if legacy==1
replace sat=round(rnormal(1000,250)) if legacy==0
lab var sat "SAT score"
recode sat -1000/0=0 1600/20000=1600 /*top code and bottom code*/
graph box sat, over(legacy) ylabel(0(200)1600) title(Applicants)
graph export collider_collegeapplicants.png, replace
graph export collider_collegeapplicants.eps, replace
ttest sat, by (legacy)
keep if (sat>1400 & legacy==0) | (sat>1250 & legacy==1)
graph box sat, over(legacy) ylabel(0(200)1600) title(Admits)
ttest sat, by (legacy)
*have a nice day```

### Status, Sorting, and Meritocracy

| Gabriel |

Over at OrgTheory, Fabio asked about how much turnover we expect to see in the NRC rankings. In the comments, myself and a few other people discussed the analysis of the rankings in Burris 2004 ASR. Kieran mentioned the interpretation of the data that it could all be sorting.

To see how plausible this is I wrote a simulation with 500 grad students, each of whom has a latent amount of talent that can only be observed with some noise. The students are admitted in cohorts of 15 each to 34 PhD granting departments and are strictly sorted so the (apparently) best students go to the best schools. There they work on their dissertations, the quality of which is a function of their talent, luck, and (to represent the possibility that top departments teach you more) a parameter proportional to the inverse root of the department’s rank. There is then a job market, with one job line per PhD granting department, and again, strict sorting (without even an exception for the incest taboo). I then summarize the amount of reproduction as the proportion of top 10 jobs that are taken by grad students from the top ten schools.

So how plausible is the meritocracy explanation? It turns out it’s pretty plausible. This table shows the average closure for the top 10 jobs averaged over 100 runs each for several combinations of assumptions. Each cell shows, on average, what proportion of the top 10 jobs we expect to be taken by students from the top 10 schools if we take as assumptions the row and column parameters. The rows represent different assumptions about how noisy is our observation of talent when we read an application to grad school or a job search. The columns represent a scaling parameter for how much you learn at different ranked schools. For instance, if we assume a learning parameter of “1.5,” a student at the 4th highest-ranked school would learn 1.5/(4^0.5), or .75. It turns out that unless you assume noise to be very high (something like a unit signal:noise ratio or worse), meritocracy is pretty plausible. Furthermore, if you assume that the top schools actually educate grad students better then meritocracy looks very plausible even if there’s a lot of noise.

```P of top 10 jobs taken by students from top 10 schools
----------------------------------------
Noisiness |
of        |
s and     |
Diss /    |How Much More Do You Learn at
Job       |         Top Schools
Market    |    0    .5     1   1.5     2
----------+-----------------------------
0 |    1     1     1     1     1
.1 |    1     1     1     1     1
.2 |    1     1     1     1     1
.3 | .999     1     1     1     1
.4 | .997     1     1     1     1
.5 | .983  .995  .999     1     1
.6 | .966   .99  .991  .999  .999
.7 | .915   .96  .982  .991  .995
.8 | .867  .932  .963  .975  .986
.9 | .817  .887  .904  .957  .977
1 | .788  .853  .873  .919   .95
----------------------------------------```

Of course, keep in mind this is all in a world of frictionless planes and perfectly spherical cows. If we assume that lots of people are choosing on other margins, or that there’s not a strict dual queue of positions and occupants (e.g., because searches are focused rather than “open”), then it gets a bit looser. Furthermore, I’m still not sure that the meritocracy model has a good explanation for the fact that academic productivity figures (citation counts, etc) have only a loose correlation with ranking.

Here’s the code, knock yourself out using different metrics of reproduction, inputting different assumptions, etc.

[Update: also see Jim Moody’s much more elaborate/realistic simulation, which gives similar results].

```capture program drop socmeritocracy
program define socmeritocracy
local gre_noise=round(`1',.001) /* size of error term, relative to standard normal, for apparenttalent=f(talent) */
local diss_noise=round(`2',.001) /* size of error term, relative to standard normal, for dissquality=f(talent) */
local quality=round(`3',.001) /* scaling parameter for valueadded (by quality grad school) */
local cohortsize=round(`4',.001) /* size of annual graduate cohort (for each programs) */
local facultylines=round(`5',.001) /* number of faculty lines (for each program)*/
local batch `6'

clear
quietly set obs 500 /*create 500 BAs applying to grad school*/
quietly gen talent=rnormal() /* draw talent from normal */
quietly gen apparenttalent=talent + rnormal(0,`gre_noise') /*observe talent w error */
*grad school admissions follows strict dual queue by apparent talent and dept rank
gsort -apparenttalent
*how much more do you actually learn at prestigious schools
*how good is dissertation, as f(talent, gschool value added, noise)
*grad school admissions follows strict dual queue of diss quality and dept rank (no incest taboo/preference)
gsort -dissquality
quietly gen placement=1 + floor(([_n]-1)/`facultylines')
lab var placement "dept rank of 1st job"
quietly replace placement=. if placement>`r(max)' /*those not placed in PhD granting departments do not have research jobs (and may not even have finished PhD)*/
*recode outcomes in a few ways for convenience of presentation
quietly gen researchjob=placement
quietly recode researchjob 0/999=1 .=0
lab var researchjob "finished PhD and has research job"
quietly recode gschool_type 1/10=1 11/999=2 .=3
quietly gen job_type= placement
quietly recode job_type 1/10=1 11/999=2 .=3
quietly gen job_top10= placement
quietly recode job_top10 1/10=1 11/999=0
lab def typology 1 "top 10" 2 "lower ranked" 3 "non-research"
lab val gschool_type job_type typology
if "`batch'"=="1" {
quietly tab gschool_type job_type, matcell(xtab)
local p_reproduction=xtab[1,1]/(xtab[1,1]+xtab[2,1])
shell echo "`gre_noise' `diss_noise' `quality' `cohortsize' `facultylines' `p_reproduction'" >> socmeritocracyresults.txt
}
else {
tab gschool_type job_type, chi2
}
end

shell echo "gre_noise diss_noise quality cohortsize facultylines p_reproduction" > socmeritocracyresults.txt

forvalues gnoise=0(.1)1 {
local dnoise=`gnoise'
forvalues qualitylearning=0(.5)2 {
forvalues i=1/100 {
disp "`gnoise' `dnoise' `qualitylearning' 15 1 1 tick `i'"
socmeritocracy `gnoise' `dnoise' `qualitylearning' 15 1 1
}
}
}

insheet using socmeritocracyresults.txt, clear delim(" ")
lab var gre_noise "Noisiness of Admissions and Diss / Job Market"
lab var quality "How Much More Do You Learn at Top Schools"
table gre_noise quality, c(m p_reproduction)
```

### Regression to the mean [updated]

| Gabriel |

I updated* my old old script for simulating regression to the mean.

Regression to the mean is the phenomena that when you have a condition measured before and after a treatment, where recruitment into a treatment is conditional on the condition at time zero, you can get artifactual results. For instance, people tend to go into rehab when they hit bottom (i.e., are especially screwed up) so even if rehab were useless you’d expect some people to sober up after a stint in rehab. Likewise, the placebo effect is often understood as something like the “magic feather” in Dumbo but another component is regression to the mean, which is why you can get a placebo effect with plants. A special case of regression to the mean is the “sophomore slump” which occurs when you select cases that were high rather than low for treatment.

The code simulates the process for a population of 100,000 agents (a number chosen to be large enough that sampling error is asymptotically zero). Each agent has a latent tendency drawn from a standard normal that is measured at any given time with (specifiable) noise and is sensitive to (specifiable) treatment effects. The program takes the following arguments in order:

1. Noisiness defined as noise:signal ratio for any given observation. Can take any non-negative value but 0-1 is a reasonable range to play with. Low values indicate a reliable variable (like height) whereas high values indicate an unreliable variable (like mood). At “zero” there is no measurement error and at “one” any given observation is equal parts latent tendency and random instantaneous error.
2. True effect of the treatment -1 to +1 is a reasonable range but can take any value: positive, negative, or zero. For raw regression to the mean choose “zero.”
3. Selection of the cases for treatment. Cases are selected for treatment on the basis of  initial measured condition. The parameter defines how far out into the left tail (negative values) or right tail (positive values) the cases are selected. Negative values are “adverse selection” and positive values are “advantageous selection.” Largish absolute values (i.e., +/- 2 sigmas or higher) indicate that the treatment is applied only to a few extreme cases whereas low values indicate that the treatment is applied to a large number of moderate cases.

After a run the program has in memory the parameters it started with and two output measures. “bias1” is the classic regression to the mean effect and “bias0” is the change in non-treatment group (which is usually much smaller than bias1). The program gives text output summarizing the data for those parameters. I designed this mode for use in lab pedagogy — let students play with different parameters to see how much bias they get and try to figure out what’s responsible for it.

Alternately, you can batch it and see the big picture. Doing so shows that the true effect doesn’t matter much for the size of the regression to the mean effect (though of course they might be conflated with each other, which is the whole point). What really drives regression to the mean is primarily the noisiness (i.e., low reliability) of the condition measurement and secondarily how intensive the selection is. This is shown below in a surface graph (which is based on simulations where there is no true effect). In this graph width is noisiness, depth is where in the tail agents get recruited, and height/color is the magnitude of the regression to the mean effect.

The first thing to note is that for very reliably measurable conditions (the left side of the graph) there is no regression to the mean effect. No noise, no regression to the mean. So if you take your shortest students (as measured standing up straight with their shoes off) and have them do jumping jacks for a week to stretch them out you’ll find that they are still your shortest students after the exercise. This is true regardless of whether you impose this on the single shortest student or the shorter half of the class.

As you increase the noise (the right side of the graph) you get more regression to the mean, especially as you have more intensive selection (the front and back of the graph). So if you read your students’ midterms and send the low scorers for tutoring you’ll see improvement even if the tutoring is useless, but the effect will be bigger if you do this only for the very worst student than for the whole bottom half of the class. When you have high noise and intense selection (the front right and back right corners of the graph) you get huge regression to the mean effects, on the order of +/- 1.3 standard deviations. The really scary thing is that this is not some simulation fantasy but a realistic scenario. Lots of the outcomes we care about for policy purposes show intense day-to-day variation such that, if anything assuming that error is of equal magnitude to latent tendency is a conservative assumption. Likewise, lots of policy interventions are targeted at extreme cases (whether it be a positive “rookie of the year” or negative “hitting bottom” extreme). This is one reason to expect that programs developed with hard cases will be less effective when applied to a more representative population.

```capture log close
log using reg2mean.log, replace

*DEPENDENCY
*full do-file (but not the core reg2mean program) depends on gnuplot and gnuplotpm3d.ado
*can get similar results with surface.ado, tddens.ado, by piping to R, or even MS Excel

capture program drop reg2mean
program define reg2mean
set more off
if `1'>=0 {
local noisiness `1'
/* how bad is our measure of Y, should range 0 (perfect measure) to 1 (1 signal: 1 noise), >1 indicates noise>signal */
}
else {
disp "NOTE: Noisiness must be non-negative. Set to zero for now"
local noisiness = 0
}
local beta_treatment `2'
/* how effective is the treatment. should range from -.5 (counter-productive) to .5 (pretty good), where 0 means no effect */
local recruitment `3'
/* as measured in sigmas for adverse selection use "<-1" , for advantageous selection use ">1" -- note, the program assumes that the median is in the control */
clear
quietly set obs 100000  /*note large number is hard-coded to avoid conflating sampling error with reg2mean effects. */
gen y_0true=rnormal()
gen y_0observed=y_0true + (rnormal()*`noisiness')
gen treatment=0
*this code defines recruitment
if `recruitment'<0 {
quietly replace treatment=1 if y_0observed<`recruitment'
}
else {
quietly replace treatment=1 if y_0observed>`recruitment'
}
quietly gen y_1true=y_0true+ (treatment*`beta_treatment')
quietly gen y_1observed=y_1true+ (rnormal()*`noisiness')
quietly gen delta_observed=y_1observed-y_0observed
quietly gen bias=delta_observed - (treatment*`beta_treatment')
collapse (mean) bias , by (treatment)
quietly gen noisiness=round(`noisiness',.001)
quietly gen beta_treatment=round(`beta_treatment',.001)
quietly gen recruitment=round(`recruitment',.001)
quietly reshape wide bias, i(noisiness beta_treatment recruitment) j(treatment)
local treatmentbias = bias1 in 1
local controlbias = bias0 in 1
if `recruitment'<0 {
disp "You have simulated regression to the mean where the signal:noise ratio is " _newline "1:" float(`noisiness') ", the true effect of the treatment is " float(`2')  ", and there is adverse " _newline "selection such that the treatment is allocated if and only if the " _newline "the pre-treatment measure of the condition is below " float(`3') " standard deviations."
}
else {
disp "You have simulated regression to the mean where the signal:noise ratio is" _newline "1:" float(`noisiness') ", the true effect of the treatment is " float(`2')  ", and there is advantageous" _newline "selection such that the treatment is allocated if and only if the " _newline "pre-treatment measure of the condition is above " float(`3') " standard deviations."
}
disp "Net of the true treatment effect, the regression to the mean artifactual " _newline "effect on those exposed to the treatment is about " round(`treatmentbias',.001) ". Furthermore, " _newline "the non-treated group will experience an average change of " round(`controlbias',.001) "."
end

tempname results
tempfile resultsfile
postfile `results' bias0 bias1 noisiness beta_treatment recruitment using "`resultsfile'"

forvalues noi=0(.1)1 {
forvalues beta=-.5(.25).5 {
disp "noise    beta      recruitment"
forvalues recr=-2(.25)2 {
disp round(`noi',.01) _column(10) round(`beta',.01) _column(20) round(`recr',.01)
quietly reg2mean `noi' `beta' `recr'
local bias0 = bias0 in 1
local bias1 = bias1 in 1
post `results' (`bias0') (`bias1') (`noi') (`beta') (`recr')
}
}
}

postclose `results'
use `resultsfile', clear
foreach var in bias0 bias1 noisiness beta_treatment recruitment {
replace `var'=round(`var',.0001)
}
lab var bias0 "artifactual change - nontreatment group"
lab var bias1 "artifactual change - treatment group"
lab var noisiness "measurement error of Y"
lab var beta_treatment "true efficacy of treatment"
lab var recruitment "sigmas out in tail that treatment is recruited"
compress
save reg2mean.dta, replace

keep if beta_treatment==0
gnuplotpm3d noisiness recruitment bias1, title (Regression to the Mean with No True Effect) xlabel(Noisiness) ylabel(Who Gets Treatment) using(r2m_0)
shell open r2m_0.eps

*have a nice day```

*The changes are a larger set of agents, integration of postfile, improved handling of macros, specification of selection, interactive mode, and surface plotting (dependent on my Gnuplot pipe).