Conditioning on a Collider Between a Dummy and a Continuous Variable
| Gabriel |
In a post last year, I described a logical fallacy of sample truncation that helpful commenters explained to me is known in the literature as conditioning on a collider. As is common, I illustrated the issue with two continuous variables, where censorship is a function of the sum. (Specifically, I used the example of physical attractiveness and acting ability for a latent population of aspiring actresses and an observed population of working actresses to explain the paradox that Megan Fox was considered both “sexiest” and “worst” actress in a reader poll).
In revising my notes for grad stats this year, I generalized the problem to cases where at least one of the variables is categorical. For instance, college admissions is a censorship process (only especially attractive applicants become matriculants) and attractiveness to admissions officers is a function of both categorical (legacy, athlete, artist or musician, underrepresented ethnic group, in-state for public schools or out-of-state for private schools, etc) and continuous distinctions (mostly SAT and grades).
For simplicity, we can restrict the issue just to SAT and legacy. (See various empirical studies and counterfactual extrapolations by Espenshade and his collaborators for how it works with the various other things that determine admissions.) Among college applicant pools, the children of alumni to prestigious schools tend to score about a hundred points higher on the SAT than do other high school students. Thus the applicant pool looks something like this.
However, many prestigious colleges have policies of preferring legacy applicants. In practice this mean that the child of an alum can still be admitted with an SAT score about 150 points below non-legacy students. Thus admission is a function of both SAT (a continuous variable) and legacy (a dummy variable). This implies the paradox that the SAT scores of legacies are about half a sigma above average for the applicant pool but about a full sigma below average in the freshman class, as seen in this graph.
Here’s the code.
clear set obs 1000 gen legacy=0 replace legacy=1 in 1/500 lab def legacy 0 "Non-legacy" 1 "Legacy" lab val legacy legacy gen sat=0 replace sat=round(rnormal(1100,250)) if legacy==1 replace sat=round(rnormal(1000,250)) if legacy==0 lab var sat "SAT score" recode sat -1000/0=0 1600/20000=1600 /*top code and bottom code*/ graph box sat, over(legacy) ylabel(0(200)1600) title(Applicants) graph export collider_collegeapplicants.png, replace graph export collider_collegeapplicants.eps, replace ttest sat, by (legacy) keep if (sat>1400 & legacy==0) | (sat>1250 & legacy==1) graph box sat, over(legacy) ylabel(0(200)1600) title(Admits) graph export collider_collegeadmits.png, replace graph export collider_collegeadmits.eps, replace ttest sat, by (legacy) *have a nice day