Sampling on the independent variables

January 4, 2010 at 4:51 am 10 comments

| Gabriel |

At Scatterplot, Jeremy notes that in a reader poll, Megan Fox was voted both “worst” and “sexiest” actress. Personally, I’ve always found Megan Fox to be less sexy than a painfully deliberate simulacra of sexy. The interesting question Jeremy asks is whether this negative association is correlation or causation. My answer is neither, it’s truncation.

What you have to understand is that the question is implicitly about famous actresses. It is quite likely that somewhere in Glendale there is some barista with a headshot by the register who is both fugly and reads lines like a robot. However this person is not famous (and probably not even Taft-Hartleyed). If there is any meritocracy at all in Hollywood, the famous are — on average — going to be desirable in at least one dimension. They may become famous because they are hot or because they are talented, but our friend at the Starbucks on Colorado is staying at the Starbucks on Colorado.

This means that when we ask about the association of acting talent and sexiness amongst the famous, we have censored data where people who are low on both dimensions are censored out. Within the truncated sample there may be a robust negative association, but the causal relationship is very indirect, and it’s not as if having perky breasts directly obstructs the ability to convincingly express emotions (a botoxed face on the other hand …).

You can see this clearly in simulation (code is at the end of the post). I’ve modeled a population of ten thousand aspiring actresses as having two dimensions, body and mind, each of which is drawn from a random normal. As built in by assumption, there is no correlation between body and mind.

Stars are a subsample of aspirants. Star power is defined as a Poisson centered on the sum of body and mind (and re-centered to avoid negative values). That is, star power is a combination of body, mind, and luck. Only the 10% of aspirants with the most star power become famous. If we now look at the correlation of body and mind among stars, it’s negative.

This is a silly example, but it reflects a serious methodological problem that I’ve seen in the literature and I propose to call “sampling on the independent variable.” You sometimes see this directly in the sample construction when a researcher takes several overlapping datasets and combines them. If the researcher then uses membership in one of the constituent datasets (or something closely associated with it) to predict membership in another of a constituent datasets (or something closely associated with it), the beta is inevitably negative. (I recently reviewed a paper that did this and treated the negative associations as substantive findings rather than methodological artifacts).

Likewise, it is very common for a researcher to rely on prepackaged composite data rather than explicitly creating original composite data. For instance, consider that favorite population of econ soc, the Fortune 500. Fortune defines this population as the top 500 firms ranked by sales. Now imagine decomposing sales by industry. Inevitably, sales in manufacturing will be negatively correlated with sales in retail. However this is an artifact of sample truncation. In the broader population the two types of sales will be positively correlated (at least among multi-dimensional firms).

clear
set obs 10000
gen body=rnormal()
gen mind=rnormal()
*corr in the population
corr body mind
scatter body mind
graph export bodymind_everybody.png, replace
*keep only the stars
gen talent=body+mind+3
recode talent -100/0=0
gen stardom=rpoisson(talent)
gsort -stardom
keep in 1/1000
*corr amongst stars
corr body mind
scatter body mind
graph export bodymind_stars.png, replace

Entry filed under: Uncategorized. Tags: , , , , , .

Praxis MDC Code (updated)

10 Comments

  • 1. Elizabeth  |  January 4, 2010 at 11:11 am

    I propose to call “sampling on the independent variable.”

    Judea Pearl calls this general class of problems “conditioning on a collider.” Your terminology has the benefit of being transparent to people who know minimal stats, but not his apparatus. But I kind of like the image of acting and sex appeal COLLIDING and exploding into star power.

    • 2. gabrielrossman  |  January 4, 2010 at 1:00 pm

      thanks, i wish i had thought of that, i’ll definitely have to read _Causality_, mostly because it sounds interesting but in part because Pearl is at UCLA and we have several mutual friends. i agree that various aspects of talent colliding and exploding into star power is a good way to put it. the best theoretical and empirical work (ex the Columbia music lab experiments by Salganik et al) indicates that talent is a subtle, subjective, and multi-dimensional but nonetheless real thing that seeds an inherently stochastic process of cumulative advantage of stardom.

  • 3. SMorgan  |  January 4, 2010 at 6:00 pm

    Pearl’s Causality is well worth a read for discussions of these issues. But the general class of puzzles has been around a long time, often cited in discussions of Simpson’s Paradox (which is a less extreme version of the same, where it is about magnitude rather than a flip in sign).

  • 4. Michael Bishop  |  January 5, 2010 at 1:57 pm

    this would be a great classroom example

    • 5. gabrielrossman  |  January 5, 2010 at 2:36 pm

      thanks, but if and when i do this in the classroom i’ll probably tweak the subject matter slightly so as to avoid the risk of getting distracted by the issue of whether it’s PC to talk about sexiness. for instance, you could make the two dimensions math and verbal SAT or something.
      if anyone does try it as a classroom exercise let me know how it goes

  • 6. SMorgan  |  January 6, 2010 at 11:20 am

    I teach this point when I teach my book on causality (Morgan and Winship 2007, CUP). Pretty much the exact same figure is placed on our page 67 in our discussion of colliders. It is simple and students immediately get it, but I don’t know of a great hypothetical example that doesn’t feel forced. In fact, when we distributed a draft of our book to about 25 colleagues about 5 years ago, and asked for a better real-world example than what we put in our book draft, no one could think of one. The problem seems to be that there are no clear cases where the result appears, although there are lots of Simpson’s paradox examples that show the similar point. The wikipedia page on Simpson’s paradox gives the standard ones.

    • 7. gabrielrossman  |  January 6, 2010 at 1:53 pm

      no kidding, there it is.
      i’m afraid i haven’t read your book til now but it will be in my next amazon order. college admission seems like a pretty good example but you have my permission to substitute sexy actresses in the second edition.

  • […] fallacy of sample truncation that helpful commenters explained to me is known in the literature as conditioning on a collider. As is common, I illustrated the issue with two continuous variables, where censorship is a […]

  • 9. PaulWalsh  |  December 5, 2010 at 3:46 pm

    Can I use this example and code for teaching my residents? It brings up a great point in EM studies

    • 10. gabrielrossman  |  December 5, 2010 at 4:06 pm

      of course, i’d be very happy for you to get some pedagogical use out of it. click on the “simulation” tag for other similar things, the “regression to the mean” simulation might be especially relevant for any kind of clinical field. also, see these class notes where i cover this and similar issues.


The Culture Geeks


%d bloggers like this: