Posts tagged ‘bayesian’

Control for x

| Gabriel |

An extremely common estimation strategy, which Roland Fryer calls “name that residual,” is to throw controls at an effect then say whatever effect remains net of the controls is the effect. Typically as you introduce controls the effect goes down, but not all the way down to zero. Here’s an example using simulated data where we do a regression of y (continuous) on x (dummy) with and without control (continuous and negatively associated with x).

                      (1)             (2)   
x                  -0.474***       -0.257***
                  (0.073)         (0.065)   

control                             0.492***

_cons               0.577***        0.319***
                  (0.054)         (0.048)   
N                    1500            1500   

So as is typical, we see that even if you allow that x=1 tends to be associated with low values for control, you still see an x penalty. However this is a spurious finding since by assumption of the simulation there is no actual net effect of x on y, but only an effect mediated through the control.

This raises the question of what it means to have controlled for something. Typically we’re not really controlling for something perfectly, but only for a single part of a bundle of related concepts (or if you prefer, a noisy indicator of a latent variable). For instance when we say we’ve controlled for “human capital” the model specification might only really have self-reported highest degree attained. This leaves out both other aspects of formal education (eg, GPA, major, institution quality) and other forms of HC (eg, g and time preference). These related concepts will be correlated with the observed form of the control, but not perfectly. Indeed it can even work if we don’t have “omitted variable bias” but just measurement error on a single variable, as is the assumption of this simulation.

To get back to the simulation, let’s appreciate that the “control” is really the control as observed. If we could perfectly specify the control variable, the main effect might go down all the way to zero. In fact in the simulation that’s exactly what happens.

                      (1)             (2)             (3)   
x                  -0.474***       -0.257***       -0.005   
                  (0.073)         (0.065)         (0.053)   

control                             0.492***                

control_good                                        0.980***

_cons               0.577***        0.319***        0.538***
                  (0.054)         (0.048)         (0.038)   
N                    1500            1500            1500   

That is, when we specify the control with error much of the x penalty persists. However when we specify the control without error the net effect of x disappears entirely. Unfortunately in reality we don’t have the option of measuring something perfectly and so all we can do is be cautious about whether a better specification would further cram down the main effect we’re trying to measure.

Here’s the code

set obs 1500
gen x=round(runiform())
gen control_good=rnormal(.05,1) - x/2
gen y=control_good+rnormal(0.5,1)
gen control=control_good+rnormal(0.5,1)
eststo clear
eststo: reg y x
eststo: reg y x control
esttab, se b(3) se(3) nodepvars nomtitles
eststo: reg y x control_good
esttab, se b(3) se(3) nodepvars nomtitles

*have a nice day

March 15, 2012 at 5:29 am 8 comments

Misc Links

| Gabriel |

  • There’s a very interesting discussion at AdAge comparing buzz metrics (basically, data mining blogs and Twitter) to traditional surveys. Although the context is market research, this is an issue that potentially has a lot of relevance for basic research and so I recommend it even to people who don’t particularly care about advertising. The epistemological issue is basically the old validity versus generalizability debate. Surveys are more representative of the general consumer but they suffer from extremely low salience and so answers are so riddled with question wording effects and that sort of thing as to be almost meaningless. On the other hand buzz metrics are meaningful but not representative (what kind of person tweets about laundry bleach?). The practical issue is that buzz metrics are cheaper and faster than surveys.
  • I listened to the bhtv between Fodor and Sober and I really don’t get Fodor’s argument about natural selection. He seems to think that the co-occurence of traits is some kind of devastating problem for biology when in fact biologists have well-articulated theories (i.e., “hitchhiking,” “spandrels,” and the “selection for vs. selection of” distinction) for understanding exactly these issues and as implied by the charge “hyper-adaptionist” there’s already an understanding with the field that these make natural selection a little more complicated than it otherwise might be. However the internal critics who raise these issues (e.g., the late Stephen Jay Gould) wouldn’t come anywhere close to claiming that these issues are an anomaly that challenges the paradigm.
  • As a related philosophy of science issue, Phil @ Gelman’s blog has some thoughts on (purposeful or inadvertent) data massaging to fit the model. He takes it as a Bayesian math issue, but I think you can agree with him on Quinean/Kuhnian philosophical grounds.
  • The essay “Why is there no Jewish Narnia?” has been much discussed lately (e.g., Douthat). The essay basically argues that this is because modern Judaism simply is not a mythic religion. The interesting thing though is that it once was, as can be seen clearly in various Babylonian cognates (eg, the parts of Genesis and Exodus from the J source and the 41st chapter of the book of Job). However, as the essay argues, the mythic aspects were driven out by the rabbinic tradition. Myself, I would go further than that and say that the disenchantment really began with P, though I agree that the rabbinate finished it off, as evidenced by the persistence of myth well through the composition of “Daniel” in the 2nd c. BCE. This reminds me of the conclusion to The Sacred Canopy, where Berger basically says disenchantment has been a long-term trend ever since animism gave way to distinct pagan gods and especially with monotheism.
  • Of course the animism -> paganism ->henotheism -> monotheism -> atheism thing isn’t cleanly monotonic as we sometimes see with pagan survivalism. The first episode of the new season of Breaking Bad cold opens with a couple of narcos praying at a shrine to La Santa Muerte. In a great NYer piece on narco culture, one of the worshippers says “Yes, it was true that the Catholic Church disapproved of her ‘Little Skinny One,’ she said. ‘But have you noticed how empty their churches are?'” Maybe Rodney Stark should write his next book on the market theory of religion using Mexican Satanism as a case study of a new market entrant that more effectively pandered to met the needs of worshippers than the incumbent Catholic church, what with its stodgy rules against murder. (This isn’t a critique of Stark. Since he’s fond of Chesterton’s aphorism that when people don’t believe in God they don’t believe in nothing, they believe in anything, I think he’d argue that the popularity of the Santa Muerte cult is the product of a lack of competition among decent religions).
  • The Red Letter feature length deconstructions of the Star Wars prequels are why we have the fair use doctrine. They make dense and creative use of irony, especially with the brilliant contrasts between the narrative and the visual collage. Probably the funniest two segments are the first segment of the Episode I critique when he talks about the poor character development and the fifth segment of the Episode II critique when he plays dating coach for Anakin.

April 8, 2010 at 5:14 am 2 comments

p(gay married couple | married couple reporting same sex)

| Gabriel |

Over at Volokh, Dale Carpenter reproduces an email from Gary Gates (who unfortunately I don’t know personally, even though we’re both faculty affiliates of CCPR). In the email, Gates disputes a Census report on gay couples that Carpenter had previously discussed, arguing that many of the “gay” couples were actually straight couples who had coding errors for gender. This struck me as pretty funny, in no small part because in grad school my advisor used to warn me that no variable is reliable, even self-reported gender. (Paul, you were right). More broadly, this points to the problems of studying small groups. (Gays and lesbians are about 3% of the population, the famous 10% figure is a myth based on Kinsey’s use of convenience/purposive sampling).

Of course the usual problem with studying minorities is how to recruit a decent sample size in such a way that still approximates a random sample drawn from the (minority) population. If you take a random sample of the population and then do a screening question (“do you consider yourself gay”) you’re facing a lot of expense and also problems of refusal if the screener involves stigma because refusal and social desirability bias will be higher on a screener than if the same question is asked later on in the interview. On the other hand if you just direct your sample recruitment to areas where your minority is concentrated you’ll save a lot of time but you will also be getting only members of the minority who experience segregation, which is unfortunate as gays who live in West Hollywood are very different from those who live in Northridge, American Indians who live on reservations are very different from those who live in Phoenix, etc. Both premature screeners involving stigma and recruitment by concentrated area are likely to lead to recruiting unrepresentative members of the group on such dimensions as salience of the group identity.

These problems are familiar nightmares to anyone who knows survey methods. However the issue described by Gates in response to Carpenter (and the underlying Census study) presents a wholly new issue that when you are dealing with a small class you can have problems even if sampling is not a problem and even if measurement error in defining the class is minimal. Really this is the familiar Bayesian problem that when you are dealing with a low baseline probability events, even reasonably accurate measures can lead to false positives outnumbering true positives. The usual example given in statistics/probability textbooks is that if few people actually have a disease and you have a very accurate test for that disease, nonetheless the large majority of people who initially test positive for this disease will ultimately turn out to be healthy. Similarly, if straight marriages are much more common than gay marriages then it can still be that most so-called gay marriages are actually coding errors of straight marriages, even if the odds of a miscoded household roster for a given straight marriage are very low.

June 22, 2009 at 8:03 am

Publication bias

| Gabriel |

One of the things I try to stress to my grad students is all the ways that stats can go wrong. Case in point is publication bias (the tendency of scientists to abandon, and journals to reject, work that is not statistically significant). The really weird thing about publication bias is that it means that the p-value means different things depending on where you read it. When you run numbers in Stata and it tells you “p<.05” it more or less means what you think it does (i.e., probability of seeing these results if the null were true). However, when you publish that same result and I read it I should interpret the p-value more conservatively.

The reason is that when you get a null finding you tend to give up and if you are so tenacious as to submit it, the peer reviewers tend to reject it. So if you think of the literature as a sample, the null findings tend to be censored out. This wouldn’t necessarily be a problem except that the literature does not also censor false positives. We would expect there to be rather a lot of false positives since the conventional alpha means that about 1 in 20 analyses of noise data would appear significant.

Really what you want the p-value to mean is “what’s the probability that this result is a fluke?” When it first comes out of Stata it basically has that interpretation. But once it’s gone through the peer review process a much more relevant question is:


Yes, this makes my head hurt too and it’s even worse because the figure corresponding to the left-side of the equation doesn’t appear anywhere in the tables. But the take home is that the censorship process of peer review implies that p-values are too generous, even if you assume no problems with specification, measurement error, etc. (and there’s no reason not to assume those problems).

Anyway, that’s the logic of publication bias and the most famous study of it in the social sciences is by Card and Krueger. They were trying to explain why, in a previous study, they found that an increase in the minimum wage increases employment of unskilled labor. This finding is pretty surprising since the law of supply and demand predicts that an exogenous increase in price (of labor) will lead to a decrease in the quantity (of labor) demanded. Likewise a decent size literature had findings consistent with the orthodox prediction. Card and Krueger therefore had to explain why their (well-designed) PA/NJ natural experiment was so anomalous. Against the theoretical argument they basically argued that various kinds of friction lead to the “assume a can opener” version of theory not bearing out. This sounds plausible enough, though I think it’s very likely that such friction is most relevant in the short-run and for fairly small changes in price so I would be very skeptical that their finding about a $1 increase in the minimum wage would generalize to, say, a $10 increase.

The more interesting thing they did was argue that the literature was censored and that there were in fact a large number of studies that either found no effect or (like their PA/NJ study) a small positive effect on employment, but these studies were never published. This sounds like an eminently unprovable theory of the sort given by stubborn paranoids, but in fact it had testable empirical implications which they demonstrated in a meta-analysis of minimum wage studies. Specifically, statistical significance is a function of the root of sample size and so weaker effects are “significant” with a larger sample. Therefore a spurious literature should have a negative correlation between n and beta but no correlation between n and t. On the other hand, a true literature should not be censored and therefore n should have no correlation with beta and a positive correlation with t. As an illustration for my grad students I wrote this simulation which shows this to be the case.

In a typical run, the simulation produced this graph:


begin code
*this Stata do file illustrates publication bias by simulating two literatures
*   in each case a binary variable is used to predict a continuous variable
*   in literature "spurious" there is no true underlying effect
*   in literature "true" there is an underlying effect of an arbitrary size defined by the parameter "e"
*   multiple studies with varying n are simulated for both litertures and only statistically significant results are "published"
*   finally, it shows the distribution of "published" results for both literatures
*dependency: "est_table.ado" to install "ssc install estout, replace"
*since i'm not very good with scalars, etc, this program involves writing some temp files to disk. i suggest that before running the file you "cd" to a convenient temp directory for later elimination

global trials=2000
*each literature gets $trials initial studies (ie, potential publications) at each sample size (defined below)
*   this is an unrealisticly large number but is meant to imply inference to an infinite number of studies

global e=.2
*the "true" literature's underlying model is Y=0 + $e * X
*   where X is a binary variable found half of the time
*the "spurious" literature's underlying model is Y=0
*in both cases Y is drawn from a normal

capture program drop pubbias
program define pubbias
 set more off
 capture macro drop n effect effect00
 global effect `1' /* size of the true effect. should be between 0 and 1 */
 global n      `2' /*how large should the sample size be per trial*/
 global effect00=$effect*100
 set obs 1
 gen v2="t"
 outsheet using e$n.txt, replace
 set obs $n
 gen predictor=0
 local halfn=($n/2) +1
 replace predictor=1 in `halfn'/$n
 gen fakep=predictor
 gen effect=rnormal()
 gen fakeeffect=rnormal()
 forvalues t=1/$trials {
  replace effect=rnormal()
  replace effect=rnormal()+$effect if predictor==1 /*note, the effect is created here for the true model */
  replace fakee=rnormal()
  regress effect predictor
  estout using e$n.txt, cell(t) append
  regress fakee fakep
  estout using e$n.txt, cell(t) append
  disp "iteration `t' of $trials complete"
 insheet using e$n.txt, clear
 keep if v1=="predictor" | v1=="fakep"
 gen t=real(v2)
 gen published=0
 replace published=1 if t>=1.96 /*note that this is where the censorship occurs, you can manipulate alpha to show type I vs II error trade-off. likewise you can make the criteria more complicated with a weighted average of t and n or by adding a noise element */
 gen pubbias=0
 replace pubbias=1 if v1=="fakep"
 keep published t pubbias
 gen n=$n
 save e$n.dta, replace

pubbias $e 200

pubbias $e 400

pubbias $e 600

pubbias $e 800

pubbias $e 1000

use e200, clear
append using e400
append using e600
append using e800
append using e1000
sort n pubbias

*traditionally, meta-analysis does a scatterplot, but i use a boxplot to avoid either having dots superimposed on each other or having to add jitter

lab def pubbias 0 "true effect" 1 "spurious findings"
lab val pubbias pubbias

graph box t if published==1, over(n, gap(5) label(angle(vertical))) over(pubbias) title("Simulation of Reliable Literature Vs Publication Bias") ytitle("Range of T Statistics Across Published Literature") note("Y is drawn from a standard normal. T is for beta, which is $e in the true lit and 0 in the false lit.")

*have a nice day

March 17, 2009 at 3:23 pm 3 comments

The Culture Geeks