Conditioning on a Collider Between a Dummy and a Continuous Variable

November 30, 2010 at 4:40 am 5 comments

| Gabriel |

In a post last year, I described a logical fallacy of sample truncation that helpful commenters explained to me is known in the literature as conditioning on a collider. As is common, I illustrated the issue with two continuous variables, where censorship is a function of the sum. (Specifically, I used the example of physical attractiveness and acting ability for a latent population of aspiring actresses and an observed population of working actresses to explain the paradox that Megan Fox was considered both “sexiest” and “worst” actress in a reader poll).

In revising my notes for grad stats this year, I generalized the problem to cases where at least one of the variables is categorical. For instance, college admissions is a censorship process (only especially attractive applicants become matriculants) and attractiveness to admissions officers is a function of both categorical (legacy, athlete, artist or musician, underrepresented ethnic group, in-state for public schools or out-of-state for private schools, etc) and continuous distinctions (mostly SAT and grades).

For simplicity, we can restrict the issue just to SAT and legacy. (See various empirical studies and counterfactual extrapolations by Espenshade and his collaborators for how it works with the various other things that determine admissions.) Among college applicant pools, the children of alumni to prestigious schools tend to score about a hundred points higher on the SAT than do other high school students. Thus the applicant pool looks something like this.

However, many prestigious colleges have policies of preferring legacy applicants. In practice this mean that the child of an alum can still be admitted with an SAT score about 150 points below non-legacy students. Thus admission is a function of both SAT (a continuous variable) and legacy (a dummy variable). This implies the paradox that the SAT scores of legacies are about half a sigma above average for the applicant pool but about a full sigma below average in the freshman class, as seen in this graph.

Here’s the code.

set obs 1000
gen legacy=0
replace legacy=1 in 1/500
lab def legacy 0 "Non-legacy" 1 "Legacy"
lab val legacy legacy
gen sat=0
replace sat=round(rnormal(1100,250)) if legacy==1
replace sat=round(rnormal(1000,250)) if legacy==0
lab var sat "SAT score"
recode sat -1000/0=0 1600/20000=1600 /*top code and bottom code*/
graph box sat, over(legacy) ylabel(0(200)1600) title(Applicants)
graph export collider_collegeapplicants.png, replace
graph export collider_collegeapplicants.eps, replace
ttest sat, by (legacy)
keep if (sat>1400 & legacy==0) | (sat>1250 & legacy==1)
graph box sat, over(legacy) ylabel(0(200)1600) title(Admits)
graph export collider_collegeadmits.png, replace
graph export collider_collegeadmits.eps, replace
ttest sat, by (legacy)
*have a nice day

Entry filed under: Uncategorized. Tags: , , .

Keep the best 5 (updated) Zeno’s Webmail Security Team Account Confirmation


  • 1. Nick Cox  |  November 30, 2010 at 4:43 pm

    Good post, but why use a box plot here?

    A quantile plot would show the effect of truncation much more explicitly.

    This is illustrated with a quite different problem in Stata Journal 10(3): 482-495 (2010).

  • 2. Nick Cox  |  December 6, 2010 at 7:53 am

    To push the point further: Clearly 25% of the values lie in each tail of a boxplot and 50% in the central box. Even in your legacy case, which from the boxplot appears to have the simpler distribution, the lower quarter is crammed into a smaller space than any of the other three quarters. This is evident from close inspection but a quantile plot would make it even clearer.

    • 3. gabrielrossman  |  December 22, 2010 at 3:50 pm

      i was simplifying the example by choosing a strict floor. in some contexts, including this one, it’s more realistic to describe censorship as a probabilistic function of the collider rather than a step function of the collider. if you do this the truncation is then fuzzy and the boxplot isn’t losing anything.
      or were you thinking something different?

  • 4. Nick Cox  |  January 3, 2011 at 2:58 pm

    Impasse here, as I don’t understand (a) what you find unclear about my suggestions or (b) how the distribution was produced has implications for how it should be displayed.

    I don’t think most box plot readers interpret them correctly when any tail is compressed. That’s all.

  • […] so, the magic of conditioning on a collider means that the subset of the population that responds to the reader poll will have an artifactual […]

The Culture Geeks

%d bloggers like this: