Archive for May, 2010

Regression to the mean [updated]

| Gabriel |

I updated* my old old script for simulating regression to the mean.

Regression to the mean is the phenomena that when you have a condition measured before and after a treatment, where recruitment into a treatment is conditional on the condition at time zero, you can get artifactual results. For instance, people tend to go into rehab when they hit bottom (i.e., are especially screwed up) so even if rehab were useless you’d expect some people to sober up after a stint in rehab. Likewise, the placebo effect is often understood as something like the “magic feather” in Dumbo but another component is regression to the mean, which is why you can get a placebo effect with plants. A special case of regression to the mean is the “sophomore slump” which occurs when you select cases that were high rather than low for treatment.

The code simulates the process for a population of 100,000 agents (a number chosen to be large enough that sampling error is asymptotically zero). Each agent has a latent tendency drawn from a standard normal that is measured at any given time with (specifiable) noise and is sensitive to (specifiable) treatment effects. The program takes the following arguments in order:

  1. Noisiness defined as noise:signal ratio for any given observation. Can take any non-negative value but 0-1 is a reasonable range to play with. Low values indicate a reliable variable (like height) whereas high values indicate an unreliable variable (like mood). At “zero” there is no measurement error and at “one” any given observation is equal parts latent tendency and random instantaneous error.
  2. True effect of the treatment -1 to +1 is a reasonable range but can take any value: positive, negative, or zero. For raw regression to the mean choose “zero.”
  3. Selection of the cases for treatment. Cases are selected for treatment on the basis of  initial measured condition. The parameter defines how far out into the left tail (negative values) or right tail (positive values) the cases are selected. Negative values are “adverse selection” and positive values are “advantageous selection.” Largish absolute values (i.e., +/- 2 sigmas or higher) indicate that the treatment is applied only to a few extreme cases whereas low values indicate that the treatment is applied to a large number of moderate cases.

After a run the program has in memory the parameters it started with and two output measures. “bias1” is the classic regression to the mean effect and “bias0” is the change in non-treatment group (which is usually much smaller than bias1). The program gives text output summarizing the data for those parameters. I designed this mode for use in lab pedagogy — let students play with different parameters to see how much bias they get and try to figure out what’s responsible for it.

Alternately, you can batch it and see the big picture. Doing so shows that the true effect doesn’t matter much for the size of the regression to the mean effect (though of course they might be conflated with each other, which is the whole point). What really drives regression to the mean is primarily the noisiness (i.e., low reliability) of the condition measurement and secondarily how intensive the selection is. This is shown below in a surface graph (which is based on simulations where there is no true effect). In this graph width is noisiness, depth is where in the tail agents get recruited, and height/color is the magnitude of the regression to the mean effect.

The first thing to note is that for very reliably measurable conditions (the left side of the graph) there is no regression to the mean effect. No noise, no regression to the mean. So if you take your shortest students (as measured standing up straight with their shoes off) and have them do jumping jacks for a week to stretch them out you’ll find that they are still your shortest students after the exercise. This is true regardless of whether you impose this on the single shortest student or the shorter half of the class.

As you increase the noise (the right side of the graph) you get more regression to the mean, especially as you have more intensive selection (the front and back of the graph). So if you read your students’ midterms and send the low scorers for tutoring you’ll see improvement even if the tutoring is useless, but the effect will be bigger if you do this only for the very worst student than for the whole bottom half of the class. When you have high noise and intense selection (the front right and back right corners of the graph) you get huge regression to the mean effects, on the order of +/- 1.3 standard deviations. The really scary thing is that this is not some simulation fantasy but a realistic scenario. Lots of the outcomes we care about for policy purposes show intense day-to-day variation such that, if anything assuming that error is of equal magnitude to latent tendency is a conservative assumption. Likewise, lots of policy interventions are targeted at extreme cases (whether it be a positive “rookie of the year” or negative “hitting bottom” extreme). This is one reason to expect that programs developed with hard cases will be less effective when applied to a more representative population.

capture log close
log using reg2mean.log, replace

*full do-file (but not the core reg2mean program) depends on gnuplot and gnuplotpm3d.ado
*can get similar results with surface.ado, tddens.ado, by piping to R, or even MS Excel

capture program drop reg2mean
program define reg2mean
	set more off
	if `1'>=0 {
		local noisiness `1'
		/* how bad is our measure of Y, should range 0 (perfect measure) to 1 (1 signal: 1 noise), >1 indicates noise>signal */
	else {
		disp "NOTE: Noisiness must be non-negative. Set to zero for now"
		local noisiness = 0
	local beta_treatment `2'
	/* how effective is the treatment. should range from -.5 (counter-productive) to .5 (pretty good), where 0 means no effect */
	local recruitment `3'
	/* as measured in sigmas for adverse selection use "<-1" , for advantageous selection use ">1" -- note, the program assumes that the median is in the control */
	quietly set obs 100000  /*note large number is hard-coded to avoid conflating sampling error with reg2mean effects. */
	gen y_0true=rnormal()
	gen y_0observed=y_0true + (rnormal()*`noisiness')
	gen treatment=0
	*this code defines recruitment
	if `recruitment'<0 {
		quietly replace treatment=1 if y_0observed<`recruitment'
	else {
		quietly replace treatment=1 if y_0observed>`recruitment'
	quietly gen y_1true=y_0true+ (treatment*`beta_treatment')
	quietly gen y_1observed=y_1true+ (rnormal()*`noisiness')
	quietly gen delta_observed=y_1observed-y_0observed
	quietly gen bias=delta_observed - (treatment*`beta_treatment')
	collapse (mean) bias , by (treatment)
	quietly gen noisiness=round(`noisiness',.001)
	quietly gen beta_treatment=round(`beta_treatment',.001)
	quietly gen recruitment=round(`recruitment',.001)
	quietly reshape wide bias, i(noisiness beta_treatment recruitment) j(treatment)
	local treatmentbias = bias1 in 1
	local controlbias = bias0 in 1
	if `recruitment'<0 {
		disp "You have simulated regression to the mean where the signal:noise ratio is " _newline "1:" float(`noisiness') ", the true effect of the treatment is " float(`2')  ", and there is adverse " _newline "selection such that the treatment is allocated if and only if the " _newline "the pre-treatment measure of the condition is below " float(`3') " standard deviations."
	else {
		disp "You have simulated regression to the mean where the signal:noise ratio is" _newline "1:" float(`noisiness') ", the true effect of the treatment is " float(`2')  ", and there is advantageous" _newline "selection such that the treatment is allocated if and only if the " _newline "pre-treatment measure of the condition is above " float(`3') " standard deviations."
	disp "Net of the true treatment effect, the regression to the mean artifactual " _newline "effect on those exposed to the treatment is about " round(`treatmentbias',.001) ". Furthermore, " _newline "the non-treated group will experience an average change of " round(`controlbias',.001) "."

tempname results
tempfile resultsfile
postfile `results' bias0 bias1 noisiness beta_treatment recruitment using "`resultsfile'"

forvalues noi=0(.1)1 {
	forvalues beta=-.5(.25).5 {
		disp "noise    beta      recruitment"
		forvalues recr=-2(.25)2 {
			disp round(`noi',.01) _column(10) round(`beta',.01) _column(20) round(`recr',.01)
			quietly reg2mean `noi' `beta' `recr'
			local bias0 = bias0 in 1
			local bias1 = bias1 in 1
			post `results' (`bias0') (`bias1') (`noi') (`beta') (`recr')

postclose `results'
use `resultsfile', clear
foreach var in bias0 bias1 noisiness beta_treatment recruitment {
	replace `var'=round(`var',.0001)
lab var bias0 "artifactual change - nontreatment group"
lab var bias1 "artifactual change - treatment group"
lab var noisiness "measurement error of Y"
lab var beta_treatment "true efficacy of treatment"
lab var recruitment "sigmas out in tail that treatment is recruited"
save reg2mean.dta, replace

keep if beta_treatment==0
gnuplotpm3d noisiness recruitment bias1, title (Regression to the Mean with No True Effect) xlabel(Noisiness) ylabel(Who Gets Treatment) using(r2m_0)
shell open r2m_0.eps

*have a nice day

*The changes are a larger set of agents, integration of postfile, improved handling of macros, specification of selection, interactive mode, and surface plotting (dependent on my Gnuplot pipe).

May 27, 2010 at 4:37 am 3 comments

Gnuplot [updated]

| Gabriel |

Here’s an updated version of my old script for piping from Stata to Gnuplot. Gnuplot is a generic graphing utility. The main thing it can do that Stata can’t is good 3D graphs. (Although the ado files “surface” and “tddens” make a good effort at adding this capability). In this version I set it to a coarser smooth, set the horizontal plane to the minimum value for Z, and allow either contour or surface.

Note that the dgrid3d smoothing algorithm in Gnuplot 4.2 (which is still the version maintained in Fink and Ubuntu) has some problems — the most annoying of which is that it can go haywire at the borders. I decided to handle this by setting it to a coarse smooth (5,5,1). However if unlike me you’re lucky enough to have Gnuplot 4.3 you have access to several much better smoothing algorithms and you may want to experiment with increasing the resolution and using one of the new algorithms by editing line 41 of this script.

*GHR 5/25/2010
*gnuplotpm3d 1.1
*pipes data to gnuplot's splot command, useful for color surface graphs
capture program drop gnuplotpm3d
program define gnuplotpm3d
syntax varlist(min=3 max=3 numeric) [ , title(string asis) xlabel(string asis) ylabel(string asis) using(string asis) contour rotate(string asis)]

disp "`rotate'"

if "`xlabel'"=="" {
local xlabel="`x'"
if "`ylabel'"=="" {
local ylabel="`y'"
tempfile gnuplotpm3cmd
tempfile data

keep `varlist'
order `varlist'
disp "`using'"
outsheet using `data'.txt

*assumes that the software is running on a mac using
*if installed with fink or macports, set to the default (Unix) mode of just "gnuplot"
local gnuplot="gnuplot"
if "`c(os)'"=="MacOSX" {
	local gnuplot="exec '/Applications/'"  /*assumes binary install, set to simply "gnuplot" if fink or macports install*/
if "`c(os)'"=="Windows" {
	*not tested
	local gnuplot="gnuplot.exe"  

file open gpcmd using `gnuplotpm3cmd', write text
file write gpcmd "cd '`c(pwd)'' ; " _n
file write gpcmd "set terminal postscript color ; set output '`using'.eps' ; set palette color positive ; " _n
file write gpcmd "set auto ; set parametric ; " _n
file write gpcmd "set dgrid3d 5,5,1 ; " _n  /*interpolation algorithm, allows tolerance for data irregularities -- set to low numbers for smooth graph and high numbers for bumpier graph*/
file write gpcmd `"set title "`title'" ; set ylabel "`ylabel'"; set xlabel "`xlabel'"; "' _n
file write gpcmd "unset contour; unset surface; " _n
if "`contour'"=="contour" {
	file write gpcmd "set view map; " _n
if "`rotate'"!="" & "`contour'"=="contour" {
	file write gpcmd "set view `rotate'; " _n
if "`contour'"=="contour" & "`rotate'"!="" {
	disp "rotate options ignored as they are incompatible with contour view"
file write gpcmd "set xyplane 0; " _n  /*put the xy plane right at min(z) */
file write gpcmd "set pm3d;" _n
file write gpcmd "set pm3d interpolate 10,10; " _n
file write gpcmd `"splot "`data'.txt"; "' _n
file close gpcmd

shell `gnuplot' `gnuplotpm3cmd'

May 25, 2010 at 3:01 pm 2 comments

Doing well by doing good

| Gabriel |

In the course of a recent post describing Haitian resentment of relief workers (who they view as stealing their jobs), Bryan Caplan links to a much older post he wrote (as inspired by a Cowen talk) about the “cargo” practice in rural Mexico. In this system anyone with wealth is expected to take on public office during which he funds public expenses from his private wealth. Caplan and Cowen describe this as providing an extreme incentive to either avoid wealth accumulation or to become an evangelico, which is a voluntary form of social death.

At first I found this pretty convincing (in part because I’ve heard an almost identical analysis from Alex Portes). However I was thinking about it some more and I realized that it’s a little more complicated than that. First of all, this dynamic is not unique to modern rural Latin America. An almost identical practice of euergetism was common in the ancient world, which didn’t exactly have the state capacity to collect a Value-Added Tax. Instead they relied on a combination of direct levies (conscription from the middle classes and euergetism from the big shots) and the relatively minor state treasuries came from the highly inefficient practice of tax farming the provinces. For instance, although Pericles financed most of his Athenian public works out of tribute from the other members of the Delian League, he offered to pay for them out of pocket on several occasions. Although euergetism was important in classical Greece and the Hellenistic world, I’m going to focus on Republican Rome as an example.

The main way euregetism worked in Republican Rome was through the aedileship (municipal officer), which was the second step in the cursus honorum. Every year four ambitious young men were made aediles and during their tenure they provided many municipal services, especially the ludi (games) which provided the young man an opportunity to show off his wealth and connections by bringing in exotic animals, gladiators, performers, etc. An arms race developed for largesse and spectacle and many aediles went heavily into debt.

So far this is all consistent with the picture drawn by Caplan, Cowen, and Portes, of the expectation of magnificent philanthropy being an indirect form of high marginal tax rates, with all that implies. Note though that their model implies that:
a) people would tend to avoid public service either directly or by forgoing wealth accumulation in the first place
b) the poor bastards who get stuck with public service would be bled dry

However neither of these are entirely true in the Roman Republic. The simplest way to demonstrate this is that there is a power-law distribution for how often families held the consulship. This implies cumulative advantage and thus holding high political office must have been, on net, profitable. How could this be given the massive expenses necessary to achieve and exercise high political office?

Consider the case of Gaius Julius Caesar. He was driven heavily into debt by his aedileship and praetorship (a judicial office that followed the aedileship). However then he was made propraetor (military governor) of outer Spain during which he won a lot of wealth by conquest and acquired the patronage of Crassus. Caesar then became consul (albeit in an especially ugly fashion) and went back North of the Alps to acquire enormous wealth in Gaul. So basically, the early stages of political office were expensive, but they provided the opportunity for later provincial administration where one could squeeze the provincials. Put more broadly, philanthropy is the price of purchase for political rent-seeking which could be extremely valuable. Likewise, my understanding is that in Latin America there are many opportunities for rent-seeking that are opened by embedding oneself in a patronage system. This may actually be the most invidious aspect of the system — it may not so much discourage wealth accumulation per se so much as redirect efforts away from positive sum wealth creation and to negative sum rent-seeking.

May 20, 2010 at 3:00 pm 2 comments

Matrices within matrices

| Gabriel |

I was playing with the multiple correspondence analysis command in Stata and noticed that some of the return matrices had not just row and column labels, but hierarchy within the table. For instance, the return matrix e(A) is organized such that columns are dimensions, the high level rows are variables, and the low level rows are variables. I thought this was interesting but couldn’t find any explicit documentation of the hierarchy issue, either with reference to mca return matrices specifically or to matrices in general. After some tinkering (and help from UCLA ATS) I figured out that you can either ignore the high level rows and call values by absolute position in the matrix or call values by using a colon like this “main:sub”. It seems to work best with labels.

Anyway, here’s what I’m talking about. Note that the last few commands are synonymous as all three call the Cartesian coordinates for people who strongly agree with the statement that “any change makes nature worse”. The only difference is in their use of labels or numbers and absolute position in the matrix versus using the “variable:value” structure:

. webuse issp93a, clear
(Selection from ISSP (1993))

. quietly mca A B C D

. mat mca_coeff=e(A)

. matlist mca_coeff

             |      dim1       dim2 
A            |                      
agree_stro~y |  1.836627   .7274591 
       agree |  .5462399  -.2844426 
neither_ag~e | -.4467973  -1.199439 
    disagree | -1.165903   .7367824 
disagree_s~y | -1.995217   2.470026 
B            |                      
agree_stro~y |  2.924321   1.370078 
       agree |   .641516  -.6669377 
neither_ag~e |  .3460504  -.9639179 
    disagree |  -.714126  -.2800712 
disagree_s~y | -1.353725   2.107677 
C            |                      
agree_stro~y |  2.157782    .908553 
       agree |  .2468277  -.5916111 
neither_ag~e | -.6189958  -1.044412 
    disagree | -1.348858   .6346467 
disagree_s~y | -1.467582   3.016588 
D            |                      
agree_stro~y |  1.203782   1.821975 
       agree | -.2211514  -.0069347 
neither_ag~e | -.3846555  -1.158694 
    disagree | -.2216352  -.2105125 
disagree_s~y |  .7077495   1.151804 

. matlist mca_coeff[11,1..2]

             |      dim1       dim2 
C            |                      
agree_stro~y |  2.157782    .908553 

. matlist mca_coeff[11,"dim1".."dim2"]

             |      dim1       dim2 
C            |                      
agree_stro~y |  2.157782    .908553 

. matlist mca_coeff["C:agree_strongly","dim1".."dim2"]

             |      dim1       dim2 
C            |                      
agree_stro~y |  2.157782    .908553  

Note that this kind of thing is common in other languages. So Perl uses package qualifiers (“main::sub”) to distinguish between objects within objects (or commands within libraries). Likewise R (which treats data more like Stata matrices than Stata master data) uses lots of nested data which you can call as “main[sub]”. For instance, igraph has 9 objects nested within a graph data object.

May 18, 2010 at 5:21 am

Cited reference search time-series

| Gabriel |

[Update: it looks like Google redirects you to a captcha after the first 100 pages, so the script won’t work for mega-cites like DiMaggio and Powell 1983].

I was recently talking to somebody who suspected an article he wrote 30 years ago was something of a “sleeper hit” and wanted to see an actual time-series. I wrote this little script to read Google Scholar and extract the dates. You have to tell it the Google Scholar serial number for the focal cite and how many pages to collect.

For instance if you search GS for Strang and Soule’s ARS and click where it says “Cited by 493” you get the URL “;. The important part of the URL is the number between “cites=” and “&”. To figure out how many pages to collect divide the number of citations by 10 and round down. So the syntax to scrape for this cite would be:

bash 3071200965662451019 49

Here’s the time-series for citations to that article

Here’s the code. Note that with fairly little modification you could get it to also give the names citing journal or book and authors.

# this script scrapes google scholar for references to a given cite
# GHR,

#takes as arguments the serial number of the cite followed by the number of pages deep to scrape (# of cites / 10)
#eg, for DiMaggio and Powell ASR 1983 the syntax is 11439231157488236678 1103
for (( i = 0; i < $2; i++ )); do
	curl -A "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.307.11 Safari/532.9" -o gs$1_page$i.htm "$j&hl=en&cites=$1"

echo "date" >  $1.txt
for (( i = 0; i < $2; i++ )); do
	perl -nle 'print for m/ ([0-9][0-9][0-9][0-9]) - /g' gs$1_page$i.htm >> $1.txt

# have a nice day

May 13, 2010 at 4:58 am 2 comments

Social network packages poll

| Gabriel |

Unfortunately (for now) Stata has basically no social network capabilities (though for some early efforts see Rense Corten’s netplot program and .net import filter, as well as my own .net export filter). So this necessitates leaving the warm cocoon of Stata and learning something else and a grad student recently asked me by email what social network package I recommend. Given that there’s a (steep) learning curve to all these alternatives, I think it’s reasonable to pick just one, at least to start with.

Personally, I use igraph running through R because it’s free, it’s easy to pipe to it, it can do alpha centrality, the documentation includes examples and tutorials rather than just a dictionary of functions and arguments, and it has a relatively easy syntax for importing Pajek data and generating graphs. You can see my attempts to muddle through this here. However I’m not going to claim that this is the best and it’s entirely possible I would have been better off grappling with something else (I especially have these “grass is always greener” thoughts about running igraph through Python). My only strong opinion is that whatever you use, it should be scriptable even if it comes at the cost of a higher initial learning curve.

Anyway, what do you all prefer and why? I’m allowing repeated voting for the benefit of people who use several. I’m interested in what people like in both a “wisdom of the crowd” sense and a network externalities sense.

May 6, 2010 at 4:43 am 14 comments

Misc links (niche partitioning and categories edition)

| Gabriel |

  • Attention niche partitioning people in search of a research topic! For awhile there was a fairly clean divide between ethnic shops and general shops in the advertising industry. Now the general shops are increasingly creating multicultural divisions and taking market share from the specialists. For some good background reading on ethnic partitioning of the ad market see Davila and Turow.
  • Intel is creating a more powerful dual-core version of the Atom chip that is found in most netbooks. This of course blurs the boundary between netbooks and laptops since the whole point of a netbook was it was weak but only cost $250, so if it’s powerful and $500 doesn’t that make it a laptop? (h/t Mark Kennedy.) On the other hand, this could actually heighten the contrast between netbooks and laptops. Suppose that in its quest to climb back up the value chain, Intel effectively prices itself out of the netbook market. If there is still a demand at or below the $250 price point, it will be met by computers based on ARM chips (the same thing that’s in the iPad and most cell phones) which are even cheaper and weaker than the Atom. The hitch is that ARM chips can’t run software compiled for an x86 which in practice means they can only run Linux. (Windows CE runs on ARM but there’s very little application software available for it whereas it’s pretty easy to recompile open source software for ARM and cloud services don’t care what chips or OS you’re using). In this scenario the current netbook niche would fragment into two, with some climbing the value chain to be low-end laptops running Windows and others being dirt cheap and running Chrome OS or Ubuntu Netbook Edition.
  • Everything I needed to know about economic sociology I learned from watching this old Disney cartoon. The main focus is on categorical uncertainty but it’s also got legal-rational authority/ bureaucracy, the railroads as the archetypal organizations of the second industrial revolution, reliance on expertise, social construction of price, exponential growth processes, etc.
  • Finally, it’s worth remembering that niche partitioning specifically and the organizational ecology paradigm more broadly is based on regular old biology and so even more than your average liberally educated person, econ soc folks should know something about biology. Fortunately, Stephen Stearns has podcast his intro to EEB. EEB is short for “evolution, ecology, and behavior” which us laymen can think of as the kind of biology that’s interesting, in contrast to MCB (molecular cell bio) which is the kind of biology that’s boring but will get you a job at Pfizer. As usual, Yale did a great technical job of capturing a talented lecturer. FWIW, Stearns’ voice sounds a little like that of President Obama so I keep expecting to say things like “if you like your genome, you can keep your genome.”

May 5, 2010 at 4:38 am 2 comments

Older Posts

The Culture Geeks