People don’t think in sigmas?

June 24, 2010 at 1:34 pm 7 comments

| Gabriel |

The website TouringPlans.com applies some pretty sophisticated statistical analysis to going to DisneyWorld, including “traveling salesman” algorithms of the most efficient order in which to go on rides. One of their features is the crowd calendar, which estimates how crowded the parks will be for up to a year out from the present. The model is based on historical averages, hotel occupancies, convention and school calendars, etc., but the thing that interests me is how they present the data.

Their old method was to say the peak wait time for the three mountains (Space, Thunder, and Splash) in tens of minutes. So if at the peak time of the day, the line for Space Mountain would get as long as about 70 minutes, that day would be a “7.” In practice the range across the calendar would be about 4-10, with most days being a 5 or 6. The authors found they had three problems:

  1. The authors wanted to present information speaking to a large basket of rides rather than just a few major rides. Describing such data with reference to the mountains would be problematic because when the parks get really busy the wait times at less popular rides converge on those of major rides.
  2. Many customers asked why there weren’t any ones, twos, or threes. That is, they were thinking of it as a ten point Likert scale rather than an intrinsically meaningful scale denominated in tens of minutes.
  3. Many customers asked how they should decide between two days with the same number.

As such they created a new system that states the days decile ranking rather than tens of minutes. The interesting thing is that in switching from a raw scale to deciles, they not only solved issue #2 but also issue #3. Here’s why.

Percentiles (and similar scales like deciles and quartiles) are uniform distributions. Most things in reality map to something like a normal or Poisson distribution. When you project normal (or whatever) data onto a uniform, you get distortion such that you exaggerate differences near the mode and underplay differences in the tails. (It’s pretty much the opposite of the distortions of a Mercator projection, which makes Iceland too big and Brazil too little). For American men the height distribution has a mean of 5’9″ and a standard deviation of 3″. This means that if you move from the median to the 60th you only gain one inch whereas if you go from the 90th to the 99th percentile you gain three inches.

To return to the crowd calendar, if most days are moderately crowded then describing the crowd distribution as deciles will highlight distinctions between the high side and low side of moderate. You can see this if you read their page explaining how to interpret the decile scores — at most of the parks the three decile jump from a “4” and a “7” corresponds to about the same difference in minutes as the one decile jump from a “9” to a “10.”

I should add that I totally understand why the authors changed it to meet customer demands, but safely ensconced in my ivory tower I am free to sneer at the customer for wanting data in a way the reifies trivial distinctions. What the old version of the calendar was telling the customer was that many days are effectively interchangeable and they should choose on some other margin. If two days are both a “6,” then you should probably worry a lot less about which is the “better” 6 than about which will have better weather, a better fit with your family’s work/school schedule, cheaper airfare, or fall outside of tv sweeps so you won’t miss new episodes of your favorite shows while you’re on vacation. I can think of a long list of things more important to me than whether the 2:30pm wait time for Space Mountain is 57 minutes vs. 63 minutes, assuming the model is even accurate to such a fine-grained distinction.

A similar issue comes up in college and graduate admissions. I was doing admissions this year and I noticed I perceived GRE and the other tests differently depending on whether I read the “score” column or the “percentile” column. The score difference between a GRE-Q at the 70th vs 90th percentile is actually pretty substantial and may be decisive in of itself whereas the difference between 50th and 70th is trivial and should be decided on another margin (like the letters or the writing sample). However it doesn’t feel that way and it takes an act of will to remind myself that the distortion implied by projecting normal onto uniform means that 20!=20.

Nonetheless the percentiles were very tempting for two reasons. First, much like the tourists complaining that most of the days were similarly crowded, I had to make decisions, even if the decision was arbitrary. Second, while (as a product of American education) the SAT/GRE scale of 400-800 makes intuitive sense to me, the tests on other scales like the GRE-W have no facial meaning to me. Don’t even get me started on the TOEFL which has like five different scales depending on what version of the test the student took and so in practice it’s opaque. These weird scales make no sense in of themselves so I went to percentile as this was at least familiar, even if it’s distorted.

So what’s my solution for commensurability without distortion? Standardizing. There are two problems with this though. One is that it assumes a normal distribution which implies distortion of its own, though with a Poisson you get less distortion from assuming a normal (standardizing) than assuming a uniform (percentiles). The other is that most people aren’t comfortable thinking in terms of sigmas. This latter issue can be resolved to a certain extent by creating a scale that is more comfortable to people, one of the most important aspects of which is that it should have a natural zero. So for instance you could multiply Z by 2.3, then add 5.5, and then round. This puts things on a familiar 1-10 scale, albeit one that clusters around 5 and 6. This makes things interpretable but doesn’t distort the magnitude of distinctions, which depending on how you look at it is either a bug or a feature.

Entry filed under: Uncategorized. Tags: .

Breaking Bad Edges importspss.ado (requires R)

7 Comments

  • 1. drschweitzer  |  June 29, 2010 at 8:01 pm

    I think you should test this empirically by going to Disneyland.🙂

    • 2. gabrielrossman  |  June 30, 2010 at 12:47 am

      yes, but i’d have to go dozens of times to get enough n. want to be co-PI on the grant application?

  • 3. drschweitzer  |  July 7, 2010 at 3:38 pm

    Of course! We would have to have multiple case studies, too, like waiting for scuba equipment on Caribbean Islands!

  • 4. Misc Links « Code and Culture  |  August 9, 2011 at 5:33 am

    […] In an apparent attempt to make the GRE as useless to admissions committees as the TOEFL, ETS has completely revamped the GRE to be more “practical” and less of your classic abstract IQ test general aptitude measure. I lack the expertise in psychometrics to have an opinion about that change, but what pisses me off is they abandoned the old scoring system which means that the new scores are incommensurable with the old scores and people will probably give up trying to interpret the scores and just read the percentiles. This is a problem because percentiles lead to bad decision-making. […]

  • 5. Paul Walsh  |  November 18, 2011 at 6:47 am

    How would you code a graph in stata to show this effect, one of squashed ‘centiles’ bars at the center and stretched ones at the tails of a normal curve? I’ld like to use it to demonstrate this visually since we’re in applicant season

    • 6. gabrielrossman  |  November 18, 2011 at 7:11 am

      I’d recommend a scatterplot of percentile against standardized.

      clear
      set obs 500
      gen x=rnormal()
      sort x
      gen ptile=[_n]/5
      twoway scatter x ptile
      
      • 7. Paul Walsh  |  November 19, 2011 at 4:17 am

        Oh this is much simpler than the machinations I was fumbling with! Thanks.


The Culture Geeks


%d bloggers like this: