## Posts tagged ‘simulation’

### Climbing the Charts, ch 4

| Gabriel |

A few months ago Stanford’s sociology department was nice enough to invite me up to give a talk on chapter four of Climbing the Charts. This chapter argues that the opinion leadership hypothesis cannot be supported in radio and in the talk I show a simulation of why we should be skeptical of this hypothesis in general. There’s no video, but here’s an enhanced audio file with slideshow. Also a separate PDF of the slides in case you have problems with the integrated version. (A caveat, I knew I was speaking to a technically sophisticated audience so I let the jargon flow freely, the chapter itself is much easier to follow for people without a networks background).

Also in shameless plugging news, Fabio’s review at OrgTheory.

### Control for x

| Gabriel |

An extremely common estimation strategy, which Roland Fryer calls “name that residual,” is to throw controls at an effect then say whatever effect remains net of the controls is the effect. Typically as you introduce controls the effect goes down, but not all the way down to zero. Here’s an example using simulated data where we do a regression of y (continuous) on x (dummy) with and without control (continuous and negatively associated with x).

```--------------------------------------------
(1)             (2)
--------------------------------------------
x                  -0.474***       -0.257***
(0.073)         (0.065)

control                             0.492***
(0.023)

_cons               0.577***        0.319***
(0.054)         (0.048)
--------------------------------------------
N                    1500            1500
--------------------------------------------
```

So as is typical, we see that even if you allow that x=1 tends to be associated with low values for control, you still see an x penalty. However this is a spurious finding since by assumption of the simulation there is no actual net effect of x on y, but only an effect mediated through the control.

This raises the question of what it means to have controlled for something. Typically we’re not really controlling for something perfectly, but only for a single part of a bundle of related concepts (or if you prefer, a noisy indicator of a latent variable). For instance when we say we’ve controlled for “human capital” the model specification might only really have self-reported highest degree attained. This leaves out both other aspects of formal education (eg, GPA, major, institution quality) and other forms of HC (eg, g and time preference). These related concepts will be correlated with the observed form of the control, but not perfectly. Indeed it can even work if we don’t have “omitted variable bias” but just measurement error on a single variable, as is the assumption of this simulation.

To get back to the simulation, let’s appreciate that the “control” is really the control as observed. If we could perfectly specify the control variable, the main effect might go down all the way to zero. In fact in the simulation that’s exactly what happens.

```------------------------------------------------------------
(1)             (2)             (3)
------------------------------------------------------------
x                  -0.474***       -0.257***       -0.005
(0.073)         (0.065)         (0.053)

control                             0.492***
(0.023)

control_good                                        0.980***
(0.025)

_cons               0.577***        0.319***        0.538***
(0.054)         (0.048)         (0.038)
------------------------------------------------------------
N                    1500            1500            1500
------------------------------------------------------------
```

That is, when we specify the control with error much of the x penalty persists. However when we specify the control without error the net effect of x disappears entirely. Unfortunately in reality we don’t have the option of measuring something perfectly and so all we can do is be cautious about whether a better specification would further cram down the main effect we’re trying to measure.

Here’s the code

```clear
set obs 1500
gen x=round(runiform())
gen control_good=rnormal(.05,1) - x/2
gen y=control_good+rnormal(0.5,1)
gen control=control_good+rnormal(0.5,1)
eststo clear
eststo: reg y x
eststo: reg y x control
esttab, se b(3) se(3) nodepvars nomtitles
eststo: reg y x control_good
esttab, se b(3) se(3) nodepvars nomtitles

*have a nice day```

### Christmas in July

| Gabriel |

Has it been two years already? Holy moly, Stata 12 looks awesome.

The headline feature is structural equation modeling. It comes with a graphic model builder, which even an “only scripting is replicable” zealot like me can appreciate as it helps you learn complicated command syntax. (I feel the same way about graphs). I had actually been thinking of working SEM into my next paper and was thinking through the logistics of getting a copy of M+, learning the syntax (again), etc. Now I can do it within Stata. I look forward to reading more papers that use SEM without really understanding the assumptions.

Probably the most satisfying new feature to me though is contour plots. Ever since I got interested in writing simulations a few years ago, I have been wanting to make heat maps in Stata. I’ve spent many hours writing code that can pipe to gnuplot and, not being satisfied with that, I (with some help from Lisa) have spent yet more time working on another script that can pipe to the wireframe function in R’s lattice library. Now I’m very happy to say that I will not finish writing this ado-file and submitting it to SSC as Stata 12 contains what looks to be really good native heat plots.

I’m thinking the set of commands I will feel most guilty about not using more often, is margin plots, which extends the margin command from Stata 11. In addition to the headline new features there’s a bunch of little stuff, including fixes to get more compatibility between the estimation commands and the ancillary commands (e.g., better “predict” support for count models and “svy” support for “xtmixed”). Also, Windows users should be pleased to hear that they can now do PDFs natively.

[Update: Also see Jeremy’s post on Stata 12. He closes with a pretty funny metaphor of stats packages to cell phone brands.]

### Simulations, numlist, and order of operations

| Gabriel |

I’ve been programming another simulation and as is typical am batching it through various combinations of parameter values, recording the results each time. In making such heavy (and recursive) use of the forvalues loop I noticed some issues with numlist and orders of operation in algorithms.

First, Stata’s numlist expression (as in the “forvalues” syntax) introduces weird rounding errors, especially if specified as fractions. Thus it is preferable to count by integers then scale down to the fractional value within the loop. This is also useful if you want to save each run of the simulation as a file as it lets you avoid fractional filenames.

```forvalues i=0(.01)1 {
replace x=sin(`i')
save sin`i'.dta, replace
}
```

Do this:

```forvalues i=0/100 {
local i_scaled=`i'/100
replace x=sin(`i_scaled')
save sin`i'.dta, replace
}
```

Another issue with numlist is that it can introduce infintessimal errors so that evaluating “1==1” comes back false. If you have a situation like this you need to make the comparison operator fuzzy. So instead of just writing the expression “if y==x” you would use the expression

```if y>x-.0001 & y<x+.0001
```

Finally, I’ve noticed that when you are running nested loops the number of operations grows exponentially and so it makes a big difference in what order you do things. In particular, you want to arrange operations so they are repeated the least numbers of times. For instance, suppose you have batched a simulation over three parameters (x, y, and z) and saved each combination in its own dataset with the convention “results_x_y_z” and you wish to append the results in such a way that the parameter values are variables in the new appended dataset. The simple (but slow) way to run the append is like this:

```clear
gen x=.
gen y=.
gen z=.
forvalues x=1/100 {
forvalues y=1/100 {
forvalues z=1/100 {
append using results_`x'_`y'_`z'
recode x .=`x'
recode y .=`y'
recode z .=`z'
}
}
}
```

Unfortunately this is really slow. The following code has the same number of lines but it involves about half as many operations for the computer to do. In the first version there are four commands that are each run 100^3 times. The second version has two commands that run 100^3 times, one command that runs 100^2 times, and one command that runs 100 times.

```clear
gen x=.
gen y=.
gen z=.
forvalues x=1/100 {
forvalues y=1/100 {
forvalues z=1/100 {
append using results_`x'_`y'_`z'
recode z .=`z'
}
recode y .=`y'
}
recode x .=`x'
}
```

### Conditioning on a Collider Between a Dummy and a Continuous Variable

| Gabriel |

In a post last year, I described a logical fallacy of sample truncation that helpful commenters explained to me is known in the literature as conditioning on a collider. As is common, I illustrated the issue with two continuous variables, where censorship is a function of the sum. (Specifically, I used the example of physical attractiveness and acting ability for a latent population of aspiring actresses and an observed population of working actresses to explain the paradox that Megan Fox was considered both “sexiest” and “worst” actress in a reader poll).

In revising my notes for grad stats this year, I generalized the problem to cases where at least one of the variables is categorical. For instance, college admissions is a censorship process (only especially attractive applicants become matriculants) and attractiveness to admissions officers is a function of both categorical (legacy, athlete, artist or musician, underrepresented ethnic group, in-state for public schools or out-of-state for private schools, etc) and continuous distinctions (mostly SAT and grades).

For simplicity, we can restrict the issue just to SAT and legacy. (See various empirical studies and counterfactual extrapolations by Espenshade and his collaborators for how it works with the various other things that determine admissions.) Among college applicant pools, the children of alumni to prestigious schools tend to score about a hundred points higher on the SAT than do other high school students. Thus the applicant pool looks something like this.

However, many prestigious colleges have policies of preferring legacy applicants. In practice this mean that the child of an alum can still be admitted with an SAT score about 150 points below non-legacy students. Thus admission is a function of both SAT (a continuous variable) and legacy (a dummy variable). This implies the paradox that the SAT scores of legacies are about half a sigma above average for the applicant pool but about a full sigma below average in the freshman class, as seen in this graph.

Here’s the code.

```clear
set obs 1000
gen legacy=0
replace legacy=1 in 1/500
lab def legacy 0 "Non-legacy" 1 "Legacy"
lab val legacy legacy
gen sat=0
replace sat=round(rnormal(1100,250)) if legacy==1
replace sat=round(rnormal(1000,250)) if legacy==0
lab var sat "SAT score"
recode sat -1000/0=0 1600/20000=1600 /*top code and bottom code*/
graph box sat, over(legacy) ylabel(0(200)1600) title(Applicants)
graph export collider_collegeapplicants.png, replace
graph export collider_collegeapplicants.eps, replace
ttest sat, by (legacy)
keep if (sat>1400 & legacy==0) | (sat>1250 & legacy==1)
graph box sat, over(legacy) ylabel(0(200)1600) title(Admits)
ttest sat, by (legacy)
*have a nice day```

### Status, Sorting, and Meritocracy

| Gabriel |

Over at OrgTheory, Fabio asked about how much turnover we expect to see in the NRC rankings. In the comments, myself and a few other people discussed the analysis of the rankings in Burris 2004 ASR. Kieran mentioned the interpretation of the data that it could all be sorting.

To see how plausible this is I wrote a simulation with 500 grad students, each of whom has a latent amount of talent that can only be observed with some noise. The students are admitted in cohorts of 15 each to 34 PhD granting departments and are strictly sorted so the (apparently) best students go to the best schools. There they work on their dissertations, the quality of which is a function of their talent, luck, and (to represent the possibility that top departments teach you more) a parameter proportional to the inverse root of the department’s rank. There is then a job market, with one job line per PhD granting department, and again, strict sorting (without even an exception for the incest taboo). I then summarize the amount of reproduction as the proportion of top 10 jobs that are taken by grad students from the top ten schools.

So how plausible is the meritocracy explanation? It turns out it’s pretty plausible. This table shows the average closure for the top 10 jobs averaged over 100 runs each for several combinations of assumptions. Each cell shows, on average, what proportion of the top 10 jobs we expect to be taken by students from the top 10 schools if we take as assumptions the row and column parameters. The rows represent different assumptions about how noisy is our observation of talent when we read an application to grad school or a job search. The columns represent a scaling parameter for how much you learn at different ranked schools. For instance, if we assume a learning parameter of “1.5,” a student at the 4th highest-ranked school would learn 1.5/(4^0.5), or .75. It turns out that unless you assume noise to be very high (something like a unit signal:noise ratio or worse), meritocracy is pretty plausible. Furthermore, if you assume that the top schools actually educate grad students better then meritocracy looks very plausible even if there’s a lot of noise.

```P of top 10 jobs taken by students from top 10 schools
----------------------------------------
Noisiness |
of        |
s and     |
Diss /    |How Much More Do You Learn at
Job       |         Top Schools
Market    |    0    .5     1   1.5     2
----------+-----------------------------
0 |    1     1     1     1     1
.1 |    1     1     1     1     1
.2 |    1     1     1     1     1
.3 | .999     1     1     1     1
.4 | .997     1     1     1     1
.5 | .983  .995  .999     1     1
.6 | .966   .99  .991  .999  .999
.7 | .915   .96  .982  .991  .995
.8 | .867  .932  .963  .975  .986
.9 | .817  .887  .904  .957  .977
1 | .788  .853  .873  .919   .95
----------------------------------------```

Of course, keep in mind this is all in a world of frictionless planes and perfectly spherical cows. If we assume that lots of people are choosing on other margins, or that there’s not a strict dual queue of positions and occupants (e.g., because searches are focused rather than “open”), then it gets a bit looser. Furthermore, I’m still not sure that the meritocracy model has a good explanation for the fact that academic productivity figures (citation counts, etc) have only a loose correlation with ranking.

Here’s the code, knock yourself out using different metrics of reproduction, inputting different assumptions, etc.

[Update: also see Jim Moody’s much more elaborate/realistic simulation, which gives similar results].

```capture program drop socmeritocracy
program define socmeritocracy
local gre_noise=round(`1',.001) /* size of error term, relative to standard normal, for apparenttalent=f(talent) */
local diss_noise=round(`2',.001) /* size of error term, relative to standard normal, for dissquality=f(talent) */
local quality=round(`3',.001) /* scaling parameter for valueadded (by quality grad school) */
local cohortsize=round(`4',.001) /* size of annual graduate cohort (for each programs) */
local facultylines=round(`5',.001) /* number of faculty lines (for each program)*/
local batch `6'

clear
quietly set obs 500 /*create 500 BAs applying to grad school*/
quietly gen talent=rnormal() /* draw talent from normal */
quietly gen apparenttalent=talent + rnormal(0,`gre_noise') /*observe talent w error */
*grad school admissions follows strict dual queue by apparent talent and dept rank
gsort -apparenttalent
*how much more do you actually learn at prestigious schools
*how good is dissertation, as f(talent, gschool value added, noise)
*grad school admissions follows strict dual queue of diss quality and dept rank (no incest taboo/preference)
gsort -dissquality
quietly gen placement=1 + floor(([_n]-1)/`facultylines')
lab var placement "dept rank of 1st job"
quietly replace placement=. if placement>`r(max)' /*those not placed in PhD granting departments do not have research jobs (and may not even have finished PhD)*/
*recode outcomes in a few ways for convenience of presentation
quietly gen researchjob=placement
quietly recode researchjob 0/999=1 .=0
lab var researchjob "finished PhD and has research job"
quietly recode gschool_type 1/10=1 11/999=2 .=3
quietly gen job_type= placement
quietly recode job_type 1/10=1 11/999=2 .=3
quietly gen job_top10= placement
quietly recode job_top10 1/10=1 11/999=0
lab def typology 1 "top 10" 2 "lower ranked" 3 "non-research"
lab val gschool_type job_type typology
if "`batch'"=="1" {
quietly tab gschool_type job_type, matcell(xtab)
local p_reproduction=xtab[1,1]/(xtab[1,1]+xtab[2,1])
shell echo "`gre_noise' `diss_noise' `quality' `cohortsize' `facultylines' `p_reproduction'" >> socmeritocracyresults.txt
}
else {
tab gschool_type job_type, chi2
}
end

shell echo "gre_noise diss_noise quality cohortsize facultylines p_reproduction" > socmeritocracyresults.txt

forvalues gnoise=0(.1)1 {
local dnoise=`gnoise'
forvalues qualitylearning=0(.5)2 {
forvalues i=1/100 {
disp "`gnoise' `dnoise' `qualitylearning' 15 1 1 tick `i'"
socmeritocracy `gnoise' `dnoise' `qualitylearning' 15 1 1
}
}
}

insheet using socmeritocracyresults.txt, clear delim(" ")
lab var gre_noise "Noisiness of Admissions and Diss / Job Market"
lab var quality "How Much More Do You Learn at Top Schools"
table gre_noise quality, c(m p_reproduction)
```