## Posts tagged ‘random variables’

### Regression to the mean [updated]

| Gabriel |

I updated* my old old script for simulating regression to the mean.

Regression to the mean is the phenomena that when you have a condition measured before and after a treatment, where recruitment into a treatment is conditional on the condition at time zero, you can get artifactual results. For instance, people tend to go into rehab when they hit bottom (i.e., are especially screwed up) so even if rehab were useless you’d expect some people to sober up after a stint in rehab. Likewise, the placebo effect is often understood as something like the “magic feather” in *Dumbo* but another component is regression to the mean, which is why you can get a placebo effect with plants. A special case of regression to the mean is the “sophomore slump” which occurs when you select cases that were *high* rather than low for treatment.

The code simulates the process for a population of 100,000 agents (a number chosen to be large enough that sampling error is asymptotically zero). Each agent has a latent tendency drawn from a standard normal that is measured at any given time with (specifiable) noise and is sensitive to (specifiable) treatment effects. The program takes the following arguments in order:

**Noisiness**defined as noise:signal ratio for any given observation. Can take any non-negative value but 0-1 is a reasonable range to play with. Low values indicate a reliable variable (like height) whereas high values indicate an unreliable variable (like mood). At “zero” there is no measurement error and at “one” any given observation is equal parts latent tendency and random instantaneous error.**True effect of the treatment**-1 to +1 is a reasonable range but can take any value: positive, negative, or zero. For raw regression to the mean choose “zero.”**Selection of the cases for treatment**. Cases are selected for treatment on the basis of initial measured condition. The parameter defines how far out into the left tail (negative values) or right tail (positive values) the cases are selected. Negative values are “adverse selection” and positive values are “advantageous selection.” Largish absolute values (i.e., +/- 2 sigmas or higher) indicate that the treatment is applied only to a few extreme cases whereas low values indicate that the treatment is applied to a large number of moderate cases.

After a run the program has in memory the parameters it started with and two output measures. “bias1” is the classic regression to the mean effect and “bias0” is the change in non-treatment group (which is usually much smaller than bias1). The program gives text output summarizing the data for those parameters. I designed this mode for use in lab pedagogy — let students play with different parameters to see how much bias they get and try to figure out what’s responsible for it.

Alternately, you can batch it and see the big picture. Doing so shows that the true effect doesn’t matter much for the size of the regression to the mean effect (though of course they might be conflated with each other, which is the whole point). What really drives regression to the mean is primarily the noisiness (i.e., low reliability) of the condition measurement and secondarily how intensive the selection is. This is shown below in a surface graph (which is based on simulations where there is no true effect). In this graph width is noisiness, depth is where in the tail agents get recruited, and height/color is the magnitude of the regression to the mean effect.

The first thing to note is that for very reliably measurable conditions (the left side of the graph) there is no regression to the mean effect. No noise, no regression to the mean. So if you take your shortest students (as measured standing up straight with their shoes off) and have them do jumping jacks for a week to stretch them out you’ll find that they are still your shortest students after the exercise. This is true regardless of whether you impose this on the single shortest student or the shorter half of the class.

As you increase the noise (the right side of the graph) you get more regression to the mean, especially as you have more intensive selection (the front and back of the graph). So if you read your students’ midterms and send the low scorers for tutoring you’ll see improvement even if the tutoring is useless, but the effect will be bigger if you do this only for the very worst student than for the whole bottom half of the class. When you have high noise and intense selection (the front right and back right corners of the graph) you get *huge* regression to the mean effects, on the order of +/- 1.3 standard deviations. The really scary thing is that this is not some simulation fantasy but a realistic scenario. Lots of the outcomes we care about for policy purposes show intense day-to-day variation such that, if anything assuming that error is of equal magnitude to latent tendency is a *conservative* assumption. Likewise, lots of policy interventions are targeted at extreme cases (whether it be a positive “rookie of the year” or negative “hitting bottom” extreme). This is one reason to expect that programs developed with hard cases will be less effective when applied to a more representative population.

capture log close log using reg2mean.log, replace *DEPENDENCY *full do-file (but not the core reg2mean program) depends on gnuplot and gnuplotpm3d.ado *can get similar results with surface.ado, tddens.ado, by piping to R, or even MS Excel capture program drop reg2mean program define reg2mean set more off if `1'>=0 { local noisiness `1' /* how bad is our measure of Y, should range 0 (perfect measure) to 1 (1 signal: 1 noise), >1 indicates noise>signal */ } else { disp "NOTE: Noisiness must be non-negative. Set to zero for now" local noisiness = 0 } local beta_treatment `2' /* how effective is the treatment. should range from -.5 (counter-productive) to .5 (pretty good), where 0 means no effect */ local recruitment `3' /* as measured in sigmas for adverse selection use "<-1" , for advantageous selection use ">1" -- note, the program assumes that the median is in the control */ clear quietly set obs 100000 /*note large number is hard-coded to avoid conflating sampling error with reg2mean effects. */ gen y_0true=rnormal() gen y_0observed=y_0true + (rnormal()*`noisiness') gen treatment=0 *this code defines recruitment if `recruitment'<0 { quietly replace treatment=1 if y_0observed<`recruitment' } else { quietly replace treatment=1 if y_0observed>`recruitment' } quietly gen y_1true=y_0true+ (treatment*`beta_treatment') quietly gen y_1observed=y_1true+ (rnormal()*`noisiness') quietly gen delta_observed=y_1observed-y_0observed quietly gen bias=delta_observed - (treatment*`beta_treatment') collapse (mean) bias , by (treatment) quietly gen noisiness=round(`noisiness',.001) quietly gen beta_treatment=round(`beta_treatment',.001) quietly gen recruitment=round(`recruitment',.001) quietly reshape wide bias, i(noisiness beta_treatment recruitment) j(treatment) local treatmentbias = bias1 in 1 local controlbias = bias0 in 1 if `recruitment'<0 { disp "You have simulated regression to the mean where the signal:noise ratio is " _newline "1:" float(`noisiness') ", the true effect of the treatment is " float(`2') ", and there is adverse " _newline "selection such that the treatment is allocated if and only if the " _newline "the pre-treatment measure of the condition is below " float(`3') " standard deviations." } else { disp "You have simulated regression to the mean where the signal:noise ratio is" _newline "1:" float(`noisiness') ", the true effect of the treatment is " float(`2') ", and there is advantageous" _newline "selection such that the treatment is allocated if and only if the " _newline "pre-treatment measure of the condition is above " float(`3') " standard deviations." } disp "Net of the true treatment effect, the regression to the mean artifactual " _newline "effect on those exposed to the treatment is about " round(`treatmentbias',.001) ". Furthermore, " _newline "the non-treated group will experience an average change of " round(`controlbias',.001) "." end tempname results tempfile resultsfile postfile `results' bias0 bias1 noisiness beta_treatment recruitment using "`resultsfile'" forvalues noi=0(.1)1 { forvalues beta=-.5(.25).5 { disp "noise beta recruitment" forvalues recr=-2(.25)2 { disp round(`noi',.01) _column(10) round(`beta',.01) _column(20) round(`recr',.01) quietly reg2mean `noi' `beta' `recr' local bias0 = bias0 in 1 local bias1 = bias1 in 1 post `results' (`bias0') (`bias1') (`noi') (`beta') (`recr') } } } postclose `results' use `resultsfile', clear foreach var in bias0 bias1 noisiness beta_treatment recruitment { replace `var'=round(`var',.0001) } lab var bias0 "artifactual change - nontreatment group" lab var bias1 "artifactual change - treatment group" lab var noisiness "measurement error of Y" lab var beta_treatment "true efficacy of treatment" lab var recruitment "sigmas out in tail that treatment is recruited" compress save reg2mean.dta, replace keep if beta_treatment==0 gnuplotpm3d noisiness recruitment bias1, title (Regression to the Mean with No True Effect) xlabel(Noisiness) ylabel(Who Gets Treatment) using(r2m_0) shell open r2m_0.eps *have a nice day

*The changes are a larger set of agents, integration of postfile, improved handling of macros, specification of selection, interactive mode, and surface plotting (dependent on my Gnuplot pipe).

### Sampling on the independent variables

| Gabriel |

At Scatterplot, Jeremy notes that in a reader poll, Megan Fox was voted both “worst” and “sexiest” actress. Personally, I’ve always found Megan Fox to be less sexy than a painfully deliberate simulacra of sexy. The interesting question Jeremy asks is whether this negative association is correlation or causation. My answer is neither, it’s truncation.

What you have to understand is that the question is implicitly about *famous* actresses. It is quite likely that somewhere in Glendale there is some barista with a headshot by the register who is both fugly and reads lines like a robot. However this person is not famous (and probably not even Taft-Hartleyed). If there is any meritocracy at all in Hollywood, the famous are — on average — going to be desirable in at least one dimension. They may become famous because they are hot or because they are talented, but our friend at the Starbucks on Colorado is staying at the Starbucks on Colorado.

This means that when we ask about the association of acting talent and sexiness amongst the famous, we have censored data where people who are low on both dimensions are censored out. Within the truncated sample there may be a robust negative association, but the causal relationship is very indirect, and it’s not as if having perky breasts *directly* obstructs the ability to convincingly express emotions (a botoxed face on the other hand …).

You can see this clearly in simulation (code is at the end of the post). I’ve modeled a population of ten thousand aspiring actresses as having two dimensions, body and mind, each of which is drawn from a random normal. As built in by assumption, there is no correlation between body and mind.

Stars are a subsample of aspirants. Star power is defined as a Poisson centered on the sum of body and mind (and re-centered to avoid negative values). That is, star power is a combination of body, mind, and luck. Only the 10% of aspirants with the most star power become famous. If we now look at the correlation of body and mind among stars, it’s negative.

This is a silly example, but it reflects a serious methodological problem that I’ve seen in the literature and I propose to call “sampling on the independent variable.” You sometimes see this directly in the sample construction when a researcher takes several overlapping datasets and combines them. If the researcher then uses membership in one of the constituent datasets (or something closely associated with it) to predict membership in another of a constituent datasets (or something closely associated with it), the beta is inevitably negative. (I recently reviewed a paper that did this and treated the negative associations as substantive findings rather than methodological artifacts).

Likewise, it is very common for a researcher to rely on prepackaged composite data rather than explicitly creating original composite data. For instance, consider that favorite population of econ soc, the Fortune 500. *Fortune* defines this population as the top 500 firms ranked by sales. Now imagine decomposing sales by industry. Inevitably, sales in manufacturing will be negatively correlated with sales in retail. However this is an artifact of sample truncation. In the broader population the two types of sales will be *positively* correlated (at least among multi-dimensional firms).

clear set obs 10000 gen body=rnormal() gen mind=rnormal() *corr in the population corr body mind scatter body mind graph export bodymind_everybody.png, replace *keep only the stars gen talent=body+mind+3 recode talent -100/0=0 gen stardom=rpoisson(talent) gsort -stardom keep in 1/1000 *corr amongst stars corr body mind scatter body mind graph export bodymind_stars.png, replace

### Probability distributions

| Gabriel |

I wrote this little demo for my stats class to show how normal distributions result from complex processes that sum the constituent parts whereas count distributions result from complex processes where a single constituent failure is catastrophic.

*this do-file is a simple demo of how you statistical distributions are built up from additive vs sudden-death causation *this is entirely based on a simulated coin toss -- the function "round(uniform())" *one either counts how many heads out of 10 tosses or how long a streak of heads lasts *I'm building up from this simple function for pedagogical purposes, in actual programming there are much more direct functions like rnormal() *1. The normal distribution *Failure is an additive setback clear set obs 1000 forvalues var=1/10 { quietly gen x`var'=. } forvalues row=1/1000 { forvalues var=1/10 { quietly replace x`var'=round(uniform()) in `row' } } gen sumheads=x1+x2+x3+x4+x5+x6+x7+x8+x9+x10 order sumheads lab var sumheads "How Many Heads Out of 10 Flips" *show five examples list in 1/5 histogram sumheads, discrete normal graph export sumheads.png, replace *2. Count distribution *Failure is catastrophic clear set obs 1000 forvalues var=1/30 { quietly gen x`var'=. } gen streak=0 lab var streak "consecutive heads before first tails" gen fail=0 forvalues row=1/1000 { forvalues var=1/30 { quietly replace x`var'=round(uniform()) in `row' quietly replace fail=1 if x`var'==0 quietly replace streak=`var' if fail==0 } quietly replace fail=. in `row'/`row' } quietly replace streak=0 if x1==0 *show five partial examples list streak x1 x2 x3 x4 x5 in 1/5 histogram streak, discrete graph export streakheads.png, replace *have a nice day

### so random

| Gabriel |

Nate Silver at 538 has accused Strategic Vision of fudging their numbers and his argument is simply that few of their estimates end in “0” or “5” and a lot of them end in “7.” The reason this is meaningful is that there’s a big difference between random and the perception of random. A true random number generator will give you nearly equal frequency of trailing digits “0” and “7,” but to a human being a number ending in “7” *seems* more random than one ending in “0.” Likewise clusters occur in randomness but human beings see clustering as suspicious. A scatterplot of two random variables drawn from a uniform has a lot of dense and sparse patches but people expect it to look like a slightly off-kilter lattice. That is, we intuitively can’t understand that there is a difference between a uniform distribution and a random variable drawn from a uniform distribution.

This reminded me of two passages from literature. One is in *Silence of the Lambs* when Hannibal Lector tells Clarice that the locations of Buffalo Bill’s crime scenes is “desperately random, like the elaborations of a bad liar.” The other is from Stephenson’s *Cryptonomicon*, where a mathematician explains how he broke a theoretically perfect encryption scheme:

That is true in theory, … In practice, this is only true if the letters that make up the one-time pad are chosen perfectly randomly … An English speaker is accustomed to a certain frequency distribution of letters. He expects to see a great many e’s t’s, and a’s, and not so many z’s and q’s and x’s. So if such a person were using some supposedly random algorithm to generate the letters, he would be subconsciously irritated every time a z or an x came up, and, conversely, soothed by the appearance of e or t. Over time, this might skew the frequency distribution.

Going a little bit further afield, in a recent bloggingheads, Knobe and Morewidge discuss the latter’s psych lab research on various issues, including how people tend to ascribe misfortune to malicious agency but fortune to chance. They then note that this is the opposite of how we tend to talk about God, seeing fortune as divine agency and misfortune as random. This is true for Americans, but this has less to do with human nature than with the unusual nature of the Abrahamic religions.*

Ironically, the lab research *is* pretty consistent with the modal human religious experience — animism organized around a “do ut des” relationship with innumerable spirits that control every aspect of the natural world. Most noteworthy is that much of this worship appears aimed not at some special positive favor but at getting the gods to leave you alone. So the Romans had sacrifices and festivals to appease gods like Robigus, the god of mold, and Cato the Elder’s *De Agricultura* explains things like how when you clear a grove of trees you need to sacrifice a pig to the fairies who lived in the trees so they don’t haunt the farm. These religious practices seem pretty clearly derived from a human tendency to treat misfortune as the result of agency and to generalize this to supernatural agency, absent cultural traditions to the contrary.

—————-

*I generally get pretty frustrated with people who talk about religion and human nature proceeding from the assumption that ethical monotheism and atheism are the basic alternatives. Appreciating that historically and pre-historically most human beings have been animists makes the spandrel theory of hyper-sensitive agency-detection much more plausible than the group-selectionist theory of solidarity and intra-group altruism.

### St with shared frailty only

| Gabriel |

Several Stata commands in the xt family allow you to specify a random model (i.e., structured error terms) with no fixed model (i.e., independent variables). For instance:

xtreg y, re i(clustervar) xtmixed y || clustervar: gllamm y, i(clustervar)

This is very useful if the only thing you’re interested in is rho, the proportion of variance clustered within groups. To take a classic example of multilevel modeling, you might have test score data on students by classroom and you may be interested simply in how much good performance clusters by classroom (rho) before you get to independent variables like whether teacher credentials or class size matter.

In the [st] syntax, shared frailty is closely analogous to random effects (and strata are analogous to fixed effects). However unlike most xt commands, the st syntax expects there to be independent variables and it chokes if it doesn’t get them. Fortunately this is not a limitation of the model, only the syntax parsing, and you can trick Stata by feeding it a constant. It drops the constant from the model and estimates only the shared frailty. For instance, this model shows only the extent to which radio station adoptions of a particular song clustered by the stations’ corporate owners:

. gen x1=1 . streg x1, shared(owner_n) distribution(exponential) Note: frailty(gamma) assumed. failure _d: add analysis time _t: (fpdate-origin) origin: time firstevent id: station_n note: x1 dropped because of collinearity Fitting exponential model: Iteration 0: log likelihood = -260.2503 Iteration 1: log likelihood = -252.85211 Iteration 2: log likelihood = -246.83872 Iteration 3: log likelihood = -246.28281 Iteration 4: log likelihood = -246.10157 Iteration 5: log likelihood = -246.10117 Iteration 6: log likelihood = -246.10117 Exponential regression -- log relative-hazard form Number of obs = 171 Gamma shared frailty Number of groups = 46 Group variable: owner_n No. of subjects = 171 Obs per group: min = 1 No. of failures = 164 avg = 3.717391 Time at risk = 3739 max = 58 F( 0, .) = . Log likelihood = -246.10117 Prob > F = . ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- -------------+---------------------------------------------------------------- /ln_the | -1.792063 .4458721 -4.02 0.000 -2.665956 -.9181693 -------------+---------------------------------------------------------------- theta | .1666161 .0742895 .0695329 .3992493 ------------------------------------------------------------------------------ Likelihood-ratio test of theta=0: chibar2(01) = 15.60 Prob>=chibar2 = 0.000

Recent Comments