Sampling on the independent variables
| Gabriel |
At Scatterplot, Jeremy notes that in a reader poll, Megan Fox was voted both “worst” and “sexiest” actress. Personally, I’ve always found Megan Fox to be less sexy than a painfully deliberate simulacra of sexy. The interesting question Jeremy asks is whether this negative association is correlation or causation. My answer is neither, it’s truncation.
What you have to understand is that the question is implicitly about famous actresses. It is quite likely that somewhere in Glendale there is some barista with a headshot by the register who is both fugly and reads lines like a robot. However this person is not famous (and probably not even Taft-Hartleyed). If there is any meritocracy at all in Hollywood, the famous are — on average — going to be desirable in at least one dimension. They may become famous because they are hot or because they are talented, but our friend at the Starbucks on Colorado is staying at the Starbucks on Colorado.
This means that when we ask about the association of acting talent and sexiness amongst the famous, we have censored data where people who are low on both dimensions are censored out. Within the truncated sample there may be a robust negative association, but the causal relationship is very indirect, and it’s not as if having perky breasts directly obstructs the ability to convincingly express emotions (a botoxed face on the other hand …).
You can see this clearly in simulation (code is at the end of the post). I’ve modeled a population of ten thousand aspiring actresses as having two dimensions, body and mind, each of which is drawn from a random normal. As built in by assumption, there is no correlation between body and mind.
Stars are a subsample of aspirants. Star power is defined as a Poisson centered on the sum of body and mind (and re-centered to avoid negative values). That is, star power is a combination of body, mind, and luck. Only the 10% of aspirants with the most star power become famous. If we now look at the correlation of body and mind among stars, it’s negative.
This is a silly example, but it reflects a serious methodological problem that I’ve seen in the literature and I propose to call “sampling on the independent variable.” You sometimes see this directly in the sample construction when a researcher takes several overlapping datasets and combines them. If the researcher then uses membership in one of the constituent datasets (or something closely associated with it) to predict membership in another of a constituent datasets (or something closely associated with it), the beta is inevitably negative. (I recently reviewed a paper that did this and treated the negative associations as substantive findings rather than methodological artifacts).
Likewise, it is very common for a researcher to rely on prepackaged composite data rather than explicitly creating original composite data. For instance, consider that favorite population of econ soc, the Fortune 500. Fortune defines this population as the top 500 firms ranked by sales. Now imagine decomposing sales by industry. Inevitably, sales in manufacturing will be negatively correlated with sales in retail. However this is an artifact of sample truncation. In the broader population the two types of sales will be positively correlated (at least among multi-dimensional firms).
clear set obs 10000 gen body=rnormal() gen mind=rnormal() *corr in the population corr body mind scatter body mind graph export bodymind_everybody.png, replace *keep only the stars gen talent=body+mind+3 recode talent -100/0=0 gen stardom=rpoisson(talent) gsort -stardom keep in 1/1000 *corr amongst stars corr body mind scatter body mind graph export bodymind_stars.png, replace