Archive for June, 2010

importspss.ado (requires R)

| Gabriel |

Mike Gruszczynski has a post up pointing out that you can use R to translate files, for instance from SPSS to Stata. I like this a lot because it let’s you avoid using SPSS but I’d like it even better if it let you avoid using R as well.

As such I rewrote the script to work entirely from Stata. Mike wanted to do this in Bash but couldn’t figure out how to pass arguments from the shell to R. Frankly, I don’t know how to do this either which is why my solution is to have Stata write and execute an R source file so all the argument passing occurs within Stata. This follows my general philosophy of doing a lot of code mise en place in a user-friendly language so I can spend as little time as necessary in R. (Note that you could just as easily write this in Bash, but I figured this way you can a) make it cross-platform and b) attach it to “use” for a one-stop shop “import” command).

*importspss.ado
*by GHR 6/29/2010
*this script uses R to translate SPSS to Stata
*it takes as arguments the SPSS file and Stata file
*adapted from http://mikegruz.tumblr.com/post/704966440/convert-spss-to-stata-without-stat-transfer 

*DEPENDENCY: R and library(foreign) 
*if R exists but is not in PATH, change the reference to "R" in line 27 to be the specific location

capture program drop importspss
program define importspss
	set more off
	local spssfile `1'
	if "`2'"=="" {
		local statafile "`spssfile'.dta"
	}
	else {
		local statafile `2'	
	}
	local sourcefile=round(runiform()*1000)
	capture file close rsource
	file open rsource using `sourcefile'.R, write text replace
	file write rsource "library(foreign)" _n
	file write rsource `"data <- read.spss("`spssfile'", to.data.frame=TRUE)"' _n
	file write rsource `"write.dta(data, file="`statafile'")"' _n
	file close rsource
	shell R --vanilla <`sourcefile'.R
	erase `sourcefile'.R
	use `statafile', clear
end

June 29, 2010 at 3:01 pm 6 comments

People don’t think in sigmas?

| Gabriel |

The website TouringPlans.com applies some pretty sophisticated statistical analysis to going to DisneyWorld, including “traveling salesman” algorithms of the most efficient order in which to go on rides. One of their features is the crowd calendar, which estimates how crowded the parks will be for up to a year out from the present. The model is based on historical averages, hotel occupancies, convention and school calendars, etc., but the thing that interests me is how they present the data.

Their old method was to say the peak wait time for the three mountains (Space, Thunder, and Splash) in tens of minutes. So if at the peak time of the day, the line for Space Mountain would get as long as about 70 minutes, that day would be a “7.” In practice the range across the calendar would be about 4-10, with most days being a 5 or 6. The authors found they had three problems:

  1. The authors wanted to present information speaking to a large basket of rides rather than just a few major rides. Describing such data with reference to the mountains would be problematic because when the parks get really busy the wait times at less popular rides converge on those of major rides.
  2. Many customers asked why there weren’t any ones, twos, or threes. That is, they were thinking of it as a ten point Likert scale rather than an intrinsically meaningful scale denominated in tens of minutes.
  3. Many customers asked how they should decide between two days with the same number.

As such they created a new system that states the days decile ranking rather than tens of minutes. The interesting thing is that in switching from a raw scale to deciles, they not only solved issue #2 but also issue #3. Here’s why.

Percentiles (and similar scales like deciles and quartiles) are uniform distributions. Most things in reality map to something like a normal or Poisson distribution. When you project normal (or whatever) data onto a uniform, you get distortion such that you exaggerate differences near the mode and underplay differences in the tails. (It’s pretty much the opposite of the distortions of a Mercator projection, which makes Iceland too big and Brazil too little). For American men the height distribution has a mean of 5’9″ and a standard deviation of 3″. This means that if you move from the median to the 60th you only gain one inch whereas if you go from the 90th to the 99th percentile you gain three inches.

To return to the crowd calendar, if most days are moderately crowded then describing the crowd distribution as deciles will highlight distinctions between the high side and low side of moderate. You can see this if you read their page explaining how to interpret the decile scores — at most of the parks the three decile jump from a “4” and a “7” corresponds to about the same difference in minutes as the one decile jump from a “9” to a “10.”

I should add that I totally understand why the authors changed it to meet customer demands, but safely ensconced in my ivory tower I am free to sneer at the customer for wanting data in a way the reifies trivial distinctions. What the old version of the calendar was telling the customer was that many days are effectively interchangeable and they should choose on some other margin. If two days are both a “6,” then you should probably worry a lot less about which is the “better” 6 than about which will have better weather, a better fit with your family’s work/school schedule, cheaper airfare, or fall outside of tv sweeps so you won’t miss new episodes of your favorite shows while you’re on vacation. I can think of a long list of things more important to me than whether the 2:30pm wait time for Space Mountain is 57 minutes vs. 63 minutes, assuming the model is even accurate to such a fine-grained distinction.

A similar issue comes up in college and graduate admissions. I was doing admissions this year and I noticed I perceived GRE and the other tests differently depending on whether I read the “score” column or the “percentile” column. The score difference between a GRE-Q at the 70th vs 90th percentile is actually pretty substantial and may be decisive in of itself whereas the difference between 50th and 70th is trivial and should be decided on another margin (like the letters or the writing sample). However it doesn’t feel that way and it takes an act of will to remind myself that the distortion implied by projecting normal onto uniform means that 20!=20.

Nonetheless the percentiles were very tempting for two reasons. First, much like the tourists complaining that most of the days were similarly crowded, I had to make decisions, even if the decision was arbitrary. Second, while (as a product of American education) the SAT/GRE scale of 400-800 makes intuitive sense to me, the tests on other scales like the GRE-W have no facial meaning to me. Don’t even get me started on the TOEFL which has like five different scales depending on what version of the test the student took and so in practice it’s opaque. These weird scales make no sense in of themselves so I went to percentile as this was at least familiar, even if it’s distorted.

So what’s my solution for commensurability without distortion? Standardizing. There are two problems with this though. One is that it assumes a normal distribution which implies distortion of its own, though with a Poisson you get less distortion from assuming a normal (standardizing) than assuming a uniform (percentiles). The other is that most people aren’t comfortable thinking in terms of sigmas. This latter issue can be resolved to a certain extent by creating a scale that is more comfortable to people, one of the most important aspects of which is that it should have a natural zero. So for instance you could multiply Z by 2.3, then add 5.5, and then round. This puts things on a familiar 1-10 scale, albeit one that clusters around 5 and 6. This makes things interpretable but doesn’t distort the magnitude of distinctions, which depending on how you look at it is either a bug or a feature.

June 24, 2010 at 1:34 pm 7 comments

Breaking Bad Edges

| Gabriel |

First let me say that I love Breaking Bad for exactly the same reasons as Ross Douthat. If you haven’t watched it, go to Amazon and buy the first and second seasons — it will be the best $29.98 plus shipping you ever spend.

It struck me that the season 3 finale is entirely about exchange theory. [really serious spoilers follow].

(more…)

June 18, 2010 at 1:16 pm 1 comment

the weed, period, cohort problem

| Gabriel |

Over at EconLog Bryan Caplan asks about why marijuana decriminalization has gained such traction in the last decade. I ran a few xtabs on the SDA GSS data and said that it appeared to be mostly cohort replacement. Caplan and his commenter “agnostic” both asked if there wasn’t something special about recent periods. To check this out, I did the analysis slightly more formally and got this graph by plotting % favoring legal weed as a function of birth cohort, broken out by period.

So it looks like Caplan and agnostic were on to something — cohort is a big part of it but so is period. Since the period effect is non-monotonic it’s probably a true period effect rather than an age thing. I think the simplest way to describe the pattern I see is that more recent cohorts are more favorable to legal weed than older cohorts, but both the slope and the intercept rise in the last decade. My only guess as to why this is that it’s a policy diffusion process seeded by the 1996 California plebiscite that legalized “medical” marijuana — but I would chalk it up to diffusion, wouldn’t I?

To replicate this go to SDA and get the variables GRASS, AGE, and YEAR. I used the cleaning script and dictionary SDA provides and added this to the end.

gen cohort=YEAR-AGE
recode GRASS 2=0
keep if cohort> 1900
twoway (lowess GRASS cohort if YEAR<1980) /*
    */ (lowess GRASS cohort if YEAR>=1980 & YEAR<1990) /*
    */ (lowess GRASS cohort if YEAR>=1990 & YEAR<2000) /*
    */ (lowess GRASS cohort if YEAR>=2000) /*
    */, legend(title(period) order(1 "70s" 2 "80s" 3 "90s" 4 "00s")) xtitle(Birth Cohort) ytitle(Pro-Legalization)
graph export grass.png, replace

June 11, 2010 at 1:44 am 1 comment

An open letter to ICM Research

| Gabriel |

Dear ICM Research,

I recently started taking your survey on journals and quit when I got to this question:


I didn’t complete the survey because it is premised on the assumption that publishers are highly salient to the survey taker. Distinguishing between publishers might be something that could be articulated by librarians, journal editors, or the publication committees of professional societies, but for most practicing scientists the publisher is a very non-salient distinction and the particular journal is much more significant. Questions like “Would you recommend Blackwell to a colleague” are almost literally meaningless for most scientists. Not only do I not have opinions about whether Blackwell is “customer focused” but I can’t even remember what journals Blackwell publishes or what their journal websites look like from either the reader or peer review side. Any survey asking about such issues needs to have an implicit or explicit “don’t know” option. In general people collecting surveys need to get over the assumption that multiple dimensions about obscure objects are as interesting to the person taking the survey as to the person administering the survey.

Sincerely,

Gabriel Rossman
Assistant Professor
Sociology, UCLA

June 2, 2010 at 12:48 pm 2 comments


The Culture Geeks