Posts tagged ‘resampling’
Shufflevar update
| Gabriel |
Thanks to Elizabeth Blankenspoor (Michigan) I corrected a bug with the “cluster” option in shufflevar. I’ve submitted the update to SSC but it’s also here:
*1.1 GHR January 24, 2011 *changelog *1.1 -- fixed bug that let one case per "cluster" be misallocated (thanks to Elizabeth Blankenspoor) capture program drop shufflevar program define shufflevar version 10 syntax varlist(min=1) [ , Joint DROPold cluster(varname)] tempvar oldsortorder gen `oldsortorder'=[_n] if "`cluster'"!="" { local bystatement "by `cluster': " } else { local bystatement "" } if "`joint'"=="joint" { tempvar newsortorder gen `newsortorder'=uniform() sort `cluster' `newsortorder' foreach var in `varlist' { capture drop `var'_shuffled quietly { `bystatement' gen `var'_shuffled=`var'[_n-1] `bystatement' replace `var'_shuffled=`var'[_N] if _n==1 } if "`dropold'"=="dropold" { drop `var' } } sort `oldsortorder' drop `newsortorder' `oldsortorder' } else { foreach var in `varlist' { tempvar newsortorder gen `newsortorder'=uniform() sort `cluster' `newsortorder' capture drop `var'_shuffled quietly { `bystatement' gen `var'_shuffled=`var'[_n-1] `bystatement' replace `var'_shuffled=`var'[_N] if _n==1 } drop `newsortorder' if "`dropold'"=="dropold" { drop `var' } } sort `oldsortorder' drop `oldsortorder' } end
Shufflevar [update]
| Gabriel |
I wrote a more flexible version of shufflevar (here’s the old version) and an accompanying help file. The new version allows shuffling an entire varlist (rather than just one variable) jointly or independently and shuffling within clusters. The easiest way to install the command is to type:
ssc install shufflevar
For thoughts on this and related algorithms, see my original shufflevar post and/or the lecture notes on bootstrapping for my grad stats class.
Here’s the code.
*1.0 GHR January 29, 2010 capture program drop shufflevar program define shufflevar version 10 syntax varlist(min=1) [ , Joint DROPold cluster(varname)] tempvar oldsortorder gen `oldsortorder'=[_n] if "`joint'"=="joint" { tempvar newsortorder gen `newsortorder'=uniform() sort `cluster' `newsortorder' foreach var in `varlist' { capture drop `var'_shuffled quietly gen `var'_shuffled=`var'[_n-1] quietly replace `var'_shuffled=`var'[_N] in 1/1 if "`dropold'"=="dropold" { drop `var' } } sort `oldsortorder' drop `newsortorder' `oldsortorder' } else { foreach var in `varlist' { tempvar newsortorder gen `newsortorder'=uniform() sort `cluster' `newsortorder' capture drop `var'_shuffled quietly gen `var'_shuffled=`var'[_n-1] quietly replace `var'_shuffled=`var'[_N] in 1/1 drop `newsortorder' if "`dropold'"=="dropold" { drop `var' } } sort `oldsortorder' drop `oldsortorder' } end
And here’s the help file.
{smcl} {* 29jan2010}{...} {hline} help for {hi:shufflevar} {hline} {title:Randomly shuffle variables} {p 8 17 2} {cmd:shufflevar} {it:varlist}[, {cmdab: Joint DROPold cluster}({it:varname})] {title:Description} {p 4 4 2} {cmd:shufflevar} takes {it:varlist} and either jointly or for each variable shuffles {it:varlist} relative to the rest of the dataset. This means any association between {it:varlist} and the rest of the dataset will be random. Much like {help bootstrap} or the Quadratic Assignment Procedure (QAP), one can build a distribution of results out of randomness to serve as a baseline against which to compare empirical results, especially for overall model-fit or clustering measures. {title:Remarks} {p 4 4 2} The program is intended for situations where it is hard to model error formally, either because the parameter is exotic or because the application violates the parameter's assumptions. For instance, the algorithm has been used by Fernandez et. al. and Zuckerman to interpret network data, the author wrote this implementation for use in interpreting {help st} frailty models with widely varying cluster sizes, and others have suggested using the metric for adjacency matrices in spatial analysis. {p 4 4 2} Much like {help bsample}, the {cmd:shufflevar} command is only really useful when worked into a {help forvalues} loop or {help program} that records the results of each iteration using {help postfile}. See the example code below to see how to construct the loop. {p 4 4 2} To avoid confusion with the actual data, the shuffled variables are renamed {it:varname}_shuffled. {p 4 4 2} This command is an implementation of an algorithm used in two papers that used it to measure network issues: {p 4 4 2} Fernandez, Roberto M., Emilio J. Castilla, and Paul Moore. 2000. "Social Capital at Work: Networks and Employment at a Phone Center." {it:American Journal of Sociology} 105:1288-1356. {p 4 4 2} Zuckerman, Ezra W. 2005. "Typecasting and Generalism in Firm and Market: Career-Based Career Concentration in the Feature Film Industry, 1935-1995." {it:Research in the Sociology of Organizations} 23:173-216. {title:Options} {p 4 8 2} {cmd:joint} specifies that {it:varlist} will be keep their actual relations to one another even as they are shuffled relative to the rest of the variables. If {cmd:joint} is omitted, each variable in the {it:varlist} will be shuffled separately. {p 4 8 2} {cmd:dropold} specifies that the original sort order versions of {it:varlist} will be dropped. {p 4 8 2} {cmd:cluster}({it:varname}) specifies that shuffling will occur by {it:varname}. {title:Examples} {p 4 8 2}{cmd:. sysuse auto, clear}{p_end} {p 4 8 2}{cmd:. regress price weight}{p_end} {p 4 8 2}{cmd:. local obs_r2=`e(r2)'}{p_end} {p 4 8 2}{cmd:. tempname memhold}{p_end} {p 4 8 2}{cmd:. tempfile results}{p_end} {p 4 8 2}{cmd:. postfile `memhold' r2 using "`results'"}{p_end} {p 4 8 2}{cmd:. forvalues i=1/100 {c -(}}{p_end} {p 4 8 2}{cmd:. shufflevar weight, cluster(foreign)}{p_end} {p 4 8 2}{cmd:. quietly regress price weight_shuffled}{p_end} {p 4 8 2}{cmd:. post `memhold' (`e(r2)')}{p_end} {p 4 8 2}{cmd:. }}{p_end} {p 4 8 2}{cmd:. postclose `memhold'}{p_end} {p 4 8 2}{cmd:. use "`results'", clear}{p_end} {p 4 8 2}{cmd:. sum r2}{p_end} {p 4 8 2}{cmd:. disp "The observed R^2 of " `obs_r2' " is " (`obs_r2'-`r(mean)')/`r(sd)' " sigmas out on the" _newline "distribution of shuffled R^2s."}{p_end} {title:Author} {p 4 4 2}Gabriel Rossman, UCLA{break} rossman@soc.ucla.edu {title:Also see} {p 4 13 2}On-line: help for {help bsample}, help for {help forvalues}, help for {help postfile}, help for {help program}
Shufflevar
| Gabriel |
[Update: I’ve rewritten the command to be more flexible and posted it to ssc. to get it type “ssc install shufflevar”. this post may still be of interest for understanding how to apply the command].
Sometimes you face a situation where it’s really hard to see what the null is because the data structure is really complicated and there is all sorts of nonlinearity, etc. Analyses of non-sparse square network matrices can use the quadratic assignment procedure, but you can do something similar with other data structures, including bipartite networks.
A good null keeps everything constant, but shows what associations we would expect were association random. The simplest way to do this is to keep the actual variable vectors but randomly sort one of the vectors. So for instance, you could keep the actual income distribution and the actual values of peoples’ education, race, etc, but randomly assign actual incomes to people.
Fernandez, Castilla, and Moore used what was basically this approach to build a null distribution of the effects of employment referrals. Since then Ezra Zuckerman has used it in several papers on Hollywood to measure the strength of repeat collaboration. I myself am using it in some of my current radio work to understand how much corporate clustering we’d expect to see in the diffusion of pop songs under the null hypothesis that radio corporations don’t actually practice central coordination.
I wrote a little program that takes the argument of the variable you want shuffled. It has a similar application as bsample, and like bsample it’s best used as part of a loop.
capture program drop shufflevar program define shufflevar local shufflevar `1' tempvar oldsortorder gen `oldsortorder'=[_n] tempvar newsortorder gen `newsortorder'=uniform() sort `newsortorder' capture drop `shufflevar'_shuffled gen `shufflevar'_shuffled=`shufflevar'[_n-1] replace `shufflevar'_shuffled=`shufflevar'[_N] in 1/1 sort `oldsortorder' drop `newsortorder' `oldsortorder' end
Here’s an example to show how much clustering of “y” you’d expect to see by “clusterid” if we keep the observed distributions of “y” and “clusterid” but break any association between them:
shell echo "run rho" > _results_shuffled.txt forvalues run=1/1000 { disp "iteration # `run' of 1000" quietly shufflevar clusterid quietly xtreg y, re i(clusterid_shuffled) shell echo "`run' `e(rho)'" >> _results_shuffled.txt } insheet using _results_shuffled.txt, names clear delimiter(" ") histogram rho sum rho
(Note that “shell echo” only works with Mac/Unix, Windows users should try postfile).
Bootstrapping superstars
| Gabriel |
Most cultural products and cultural workers follow a scale-free or power-law distribution for success with a tiny handful of ludicrously successful products/workers, a fairly small number with a modicum of success, and a truly ginormous number that are absolute failures. Ever since Sherwin Rosen described this phenomena in an influential American Economic Review theory piece this phenomena has been nicknamed “the superstar effect.” For a review of the major theories as to why there is a superstar effect, check out this lecture from my undergrad course (mp3 of the lecture and pdf of the slides).
One methodological problem this creates is that if you are interested in describing the overall market share of abstract categories the measure is highly sensitive to flukes. For instance, say you were interested in drawing trend lines for the sales volumes of different book genres and you noticed that in 2002 there was a decent-sized jump in sales of the genre “Christian.” One interpretation of this would be that this is a real trend, for instance you could make up some post hoc explanation that after the 9/11 terrorist attacks people turned to God for comfort. Another interpretation would be that there was no trend and all this reflects is that one book, The Purpose Driven Life, was a surprise hit. Distinguishing statistically between these concepts is surprisingly hard because it’s very hard (at least for me) to figure out how to model the standard error of a ratio based on an underlying count.
Fortunately you don’t have to because when in doubt about error structure you can just bootstrap it. My solution is to bootstrap on titles then calculate the ratio variable (e.g., genre market share) based on the bootstrapped sample of titles. You can then use the standard deviation of the bootstrapped distribution of ratios as a standard error. To return to our example of book genres, we could bootstrap book titles in 2001 and 2002 and calculate a bootstrapped distribution of estimates of Christian market share for book sales. You then do a t-test of means to see whether 2002 was statistically different from 2001 or whether any apparent difference is just the result of a few fluke hits. In other words, was 2002 a good year for Christian books as a genre, or just a good year for Rick Warren (whose book happened to be in that genre).
Here’s some Stata code to create bootstrapped samples of titles, weight them by sales, and record the proportion of sales with the relevant trait:
clear set obs 1 gen x=. save results.dta, replace use salesdatset, clear *var desc * title -- a title * sales -- some measure of success * trait -- dummy for some trait of interest for title (eg genre, author gender, etc) * period -- period of the obs (eg, year) compress forvalues i=1/1000 { preserve bsample, strata (period) gen traitsales = trait*sales ren sales salestotal collapse (sum) traitsales salestotal, by (period) gen traitshare=traitsales/salestotal drop traitsales salestotal gen bs=`i' /*records the run of the bstrap */ reshape wide traitshare, i(bs) j(period) append using results save results.dta, replace restore } use results, clear sum *for each traitshare variable, the sd can be interpreted as bstrapped standard error
Recent Comments