Shufflevar

October 26, 2009 at 5:09 am 3 comments

| Gabriel |

[Update: I’ve rewritten the command to be more flexible and posted it to ssc. to get it type “ssc install shufflevar”. this post may still be of interest for understanding how to apply the command].

Sometimes you face a situation where it’s really hard to see what the null is because the data structure is really complicated and there is all sorts of nonlinearity, etc. Analyses of non-sparse square network matrices can use the quadratic assignment procedure, but you can do something similar with other data structures, including bipartite networks.

A good null keeps everything constant, but shows what associations we would expect were association random. The simplest way to do this is to keep the actual variable vectors but randomly sort one of the vectors. So for instance, you could keep the actual income distribution and the actual values of peoples’ education, race, etc, but randomly assign actual incomes to people.

Fernandez, Castilla, and Moore used what was basically this approach to build a null distribution of the effects of employment referrals. Since then Ezra Zuckerman has used it in several papers on Hollywood to measure the strength of repeat collaboration. I myself am using it in some of my current radio work to understand how much corporate clustering we’d expect to see in the diffusion of pop songs under the null hypothesis that radio corporations don’t actually practice central coordination.

I wrote a little program that takes the argument of the variable you want shuffled. It has a similar application as bsample, and like bsample it’s best used as part of a loop.

capture program drop shufflevar
program define shufflevar
  local shufflevar `1'
  tempvar oldsortorder
  gen `oldsortorder'=[_n]
  tempvar newsortorder
  gen `newsortorder'=uniform()
  sort `newsortorder'
  capture drop `shufflevar'_shuffled
  gen `shufflevar'_shuffled=`shufflevar'[_n-1]
  replace `shufflevar'_shuffled=`shufflevar'[_N] in 1/1
  sort `oldsortorder'
  drop `newsortorder' `oldsortorder'
end

Here’s an example to show how much clustering of “y” you’d expect to see by “clusterid” if we keep the observed distributions of “y” and “clusterid” but break any association between them:

shell echo "run rho" > _results_shuffled.txt

forvalues run=1/1000 {
  disp "iteration # `run' of 1000"
  quietly shufflevar clusterid
  quietly xtreg y, re i(clusterid_shuffled)
  shell echo "`run' `e(rho)'" >> _results_shuffled.txt
}

insheet using _results_shuffled.txt, names clear delimiter(" ")
histogram rho
sum rho

(Note that “shell echo” only works with Mac/Unix, Windows users should try postfile).

Entry filed under: Uncategorized. Tags: , , , , .

La vie en mort Copy mac files when booting from dvd

3 Comments

  • 1. mike3550  |  October 26, 2009 at 10:31 am

    This is a great piece of code to do this, thank you! Another area where this is used is in spatial statistics, where you keep all of the discrete units’ values the same but you modify the adjacency matrix to see what would happen if you threw all of the units up the air and re-ordered them.

  • 2. Team Sorting « Code and Culture  |  November 8, 2009 at 7:28 pm

    […] I’d do it better (i.e., I’d allow the data to have more structure and I’d build confidence intervals from randomness), but for exploratory purposes the simplest way to measure sorting is to see if a given film had at […]

  • 3. Shufflevar [update] « Code and Culture  |  February 1, 2010 at 1:47 pm

    […] wrote a more flexible version of shufflevar (here’s the old version) and an accompanying help file. The new version allows shuffling an entire varlist (rather than […]


The Culture Geeks


%d bloggers like this: