Over at Volokh, Dale Carpenter reproduces an email from Gary Gates (who unfortunately I don’t know personally, even though we’re both faculty affiliates of CCPR). In the email, Gates disputes a Census report on gay couples that Carpenter had previously discussed, arguing that many of the “gay” couples were actually straight couples who had coding errors for gender. This struck me as pretty funny, in no small part because in grad school my advisor used to warn me that no variable is reliable, even self-reported gender. (Paul, you were right). More broadly, this points to the problems of studying small groups. (Gays and lesbians are about 3% of the population, the famous 10% figure is a myth based on Kinsey’s use of convenience/purposive sampling).

Of course the usual problem with studying minorities is how to recruit a decent sample size in such a way that still approximates a random sample drawn from the (minority) population. If you take a random sample of the population and then do a screening question (“do you consider yourself gay”) you’re facing a lot of expense and also problems of refusal if the screener involves stigma because refusal and social desirability bias will be higher on a screener than if the same question is asked later on in the interview. On the other hand if you just direct your sample recruitment to areas where your minority is concentrated you’ll save a lot of time but you will also be getting only members of the minority who experience segregation, which is unfortunate as gays who live in West Hollywood are very different from those who live in Northridge, American Indians who live on reservations are very different from those who live in Phoenix, etc. Both premature screeners involving stigma and recruitment by concentrated area are likely to lead to recruiting unrepresentative members of the group on such dimensions as salience of the group identity.

These problems are familiar nightmares to anyone who knows survey methods. However the issue described by Gates in response to Carpenter (and the underlying Census study) presents a wholly new issue that when you are dealing with a small class you can have problems even if sampling is not a problem and even if measurement error in defining the class is minimal. Really this is the familiar Bayesian problem that when you are dealing with a low baseline probability events, even reasonably accurate measures can lead to false positives outnumbering true positives. The usual example given in statistics/probability textbooks is that if few people actually have a disease and you have a very accurate test for that disease, nonetheless the large majority of people who initially test positive for this disease will ultimately turn out to be healthy. Similarly, if straight marriages are much more common than gay marriages then it can still be that most so-called gay marriages are actually coding errors of straight marriages, even if the odds of a miscoded household roster for a given straight marriage are very low.

