Posts tagged ‘resampling’

Resampling Approach to Power Analysis

A coauthor and I are doing power analyses based on a pilot test and we quickly realized that it’s really hard to calculate a power analysis for anything much more exotic than a t-test of means. As I usually do when there’s no obvious solution, I decided to just brute force it with a Monte Carlo approach.

The algorithm is as follows:

  1. Start with pilot data
  2. Draw a sample with replacement of the target sample size n and do this trials times
  3. Run the estimation with each of these resampled datasets and see if enough of them achieve significance (with conventional power and alpha this means 80% have p<.05)

In theory, this approach should allow a power analysis for any estimator, no matter how strange. I was really excited about this and figured we could get a methods piece out of it until my coauthor pointed out Green and MacLeod beat us to it. Nonetheless, I figured it’s still worth a blogpost in case you want to do something that doesn’t fit within the lme4 package, around which Green and MacLeod’s simr package wraps the above algorithm. For instance, as seen below, I show how you can do a power analysis for a stratified sample.

Import pre-test data

In a real workflow you would just use some minor variation on pilot <- read_csv("pilot.csv"). But as an illustration, we can assume a pilot study with a binary treatment, a binary DV, and n=6. Let us assume that 1/3 of the control group are positive for the outcome but 2/3 of the treatment group. Note that a real pilot should be bigger – I certainly wouldn’t trust a pilot with n=6.

treatment <- c(0,0,0,1,1,1)
outcome <- c(0,0,1,0,1,1)
pilot <- as.data.frame(cbind(treatment,outcome))

I suggest selecting just the variables you need to save memory given that you’ll be making many copies of the data. In this case it’s superfluous as there are no other variables, but I’m including it here to illustrate the workflow.

pilot_small <- pilot %>% select(treatment,outcome)

Display observed results

As a first step, do the analysis on the pilot data itself. This gives you a good baseline and also helps you see what the estimation object looks like. (Unfortunately this varies considerably for different estimation functions).

est_emp <- glm(outcome ~ treatment, data = pilot, family = "binomial") %>% summary()
z_emp <- est_emp[["coefficients"]][,3]
est_emp
## 
## Call:
## glm(formula = outcome ~ treatment, family = "binomial", data = pilot)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
## -0.9005  -0.9005   1.4823  -1.4823   0.9005   0.9005  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.6931     1.2247  -0.566    0.571
## treatment     1.3863     1.7321   0.800    0.423
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.3178  on 5  degrees of freedom
## Residual deviance: 7.6382  on 4  degrees of freedom
## AIC: 11.638
## 
## Number of Fisher Scoring iterations: 4

Note that we are most interested in the t or z column. In the case of glm, this is the third column of the coefficient object but for other estimators it may be a different column or you may have to create it manually by dividing the estimates column by the standard deviation column.

Now you need need to see how the z column appears when conceptualized as a vector. Look at the values and see the corresponding places in the table. I wrapped it in as.vector() because some estimators give z as a matrix.

as.vector(z_emp)
## [1] -0.5659524  0.8003776

In this case, we can see that z_emp[1] is z for the intercept and z_emp[2] is z for the treatment effect. Obviously the latter is more interesting.

Set range of assumptions for resamples

Now we need to set a range of assumptions for the resampling.

trials is the number of times you want to test each sample size. Higher values for trials are slower but make the results more reliable. I suggest starting with 100 or 1000 for exploratory purposes and then going to 10,000 once you’re pretty sure you have a good value and want to confirm it. The arithmetic is much simpler if you stick with powers of ten.


nrange
 is a vector of values you want to test out. Note that the z value for your pilot gives you a hint. If it’s about 2, you should try values similar to those in the pilot. If it’s much smaller than 2, you should try values much bigger.

trials <- 1000 # how many resamples per sample size
nrange <- c(50,60,70,80,90,100) # values of sample size to test

Set up resampled data, do the regressions, and store the results

This is the main part of the script.

It creates the resampled datasets in a list called dflist. The list is initialized empty and dataframes are stored in the list as they are generated.

Store z scores and sample size in matrix results.

dflist <- list()
k <- 1 # this object keeps track of which row of the results object to write to
results <- matrix(nrow = length(nrange)*trials, ncol=3) %>% data.frame() #adjust ncol value to be length(as.vector(z_emp))+1
colnames(results) <- c('int','treatment','n') # replace the vector with names for as.vector(z_emp) positions followed by "n." The nanes need not match the names in the regression table but should capture the same concepts.

for (i in 1:length(nrange)) {
  dflist[[i]] <- list()
  for (j in 1:trials) {
    dflist[[i]][[j]] <- sample_n(pilot_small,size=nrange[i],replace=T)
    est <- glm(outcome ~ treatment, data = dflist[[i]][[j]], family = "binomial") %>% summary() # adjust the estimation to be similar to whatever you did in the "test estimation" block of code, just using data=dflist[[i]][[j]] instead of data=pilot
    z <- est[["coefficients"]][,3] # you may need to tweak this line if not using glm
    results[k,] <- c(as.vector(z),nrange[i])
    k <- k +1
  }
}

# create vector summarizing each resample as significant (1) or not significant (0)
results$treatment.stars <- 0
results$treatment.stars[abs(results$treatment)>=1.96] <- 1

Interpret results

Dist of Z by sample size

As an optional first step, plot the distributions of z-scores across resamples by sample size.

results %>% ggplot(mapping = aes(x=treatment)) + 
  geom_density(alpha=0.4) + 
  theme_classic() + 
  facet_wrap(~n)

Number of Significant Resamples for Treatment Effect

Next make a table for what you really want to know, which is how often resamples of a given sample size gives you statistical significance. This rate can be interpreted as power.

table(results$n,results$treatment.stars)
##      
##         0   1
##   50  355 645
##   60  252 748
##   70  195 805
##   80  137 863
##   90  105 895
##   100  66 934

As you can see, n=70 seems to give about 80% power. To confirm this and get a more precise value, you’d probably want to run the script again but this time with nrange <- c(67,68,69,70,71,72,73) and trials <- 10000.

Stratified Samples

50/50 Strata (Common for RFTs)

Note that you can modify the approach slightly to have stratified resamples. For instance, you might want to ensure an equal number of treatment and outcome cases in each resample to mirror a 50/50 random assignment design. (This should mostly be an issue for relatively small resamples as for large resamples you are likely to get very close to the ratio in the pilot test just by chance.)

To do this we modify the algorithm by first splitting the pilot data into treatment and control data frames and then sampling separately from each before recombining, but otherwise using the same approach as before.

pilot_control <- pilot_small %>% filter(treatment==0)
pilot_treatment <- pilot_small %>% filter(treatment==1)

trials_5050 <- 1000 # how many resamples per sample size
nrange_5050 <- c(50,60,70,80,90,100) # values of sample size to test

nrange_5050 <- 2 * round(nrange_5050/2) # ensure all values of nrange are even

dflist_5050 <- list()
k <- 1 # this object keeps track of which row of the results object to write to
results_5050 <- matrix(nrow = length(nrange_5050)*trials_5050, ncol=3) %>% data.frame() #adjust ncol value to be length(as.vector(z_emp))+1
colnames(results_5050) <- c('int','treatment','n') # replace the vector with names for as.vector(z_emp) positions followed by "n." The nanes need not match the names in the regression table but should capture the same concepts.


for (i in 1:length(nrange_5050)) {
  dflist_5050[[i]] <- list()
  for (j in 1:trials_5050) {
    dflist_5050[[i]][[j]] <- rbind(sample_n(pilot_control,size=nrange_5050[i]/2,replace=T),
                                   sample_n(pilot_treatment,size=nrange_5050[i]/2,replace=T))
    est <- glm(outcome ~ treatment, 
               data = dflist_5050[[i]][[j]], 
               family = "binomial") %>% 
               summary()
    z <- est[["coefficients"]][,3]
    results_5050[k,] <- c(as.vector(z),nrange_5050[i])
    k <- k +1
  }
}

results_5050$treatment.stars <- 0
results_5050$treatment.stars[abs(results_5050$treatment)>=1.96] <- 1

Number of Significant Resamples for Treatment Effect With 50/50 Strata

table(results_5050$n,results_5050$treatment.stars)
##      
##         0   1
##   50  330 670
##   60  248 752
##   70  201 799
##   80  141 859
##   90   91 909
##   100  63 937

Not surprisingly, our 80% power estimate is still about 70.

One Fixed Strata and the Other Estimated

Or perhaps you know the size of a sample in one strata and want to test the necessary size of another strata. Perhaps a power analysis for a hypothesis specific to strata one gives n1 as its necessary sample size but you want to estimate a power analysis for a pooled sample where n=n1+n2. Likewise, you may wish to estimate the necessary size of an oversample. Note that if you already have one strata in hand you could modify this code to work but should just use the data for that strata, not resamples of it.

For one fixed and one estimated strata, let’s assume our pilot test is departments A and B from the UCBAdmissions file, that we know we need n=500 for 80% power on some hypothesis specific to department A, and we are trying to determine how many we need for department B in order to pool and analyze them together. I specify dummies for gender (male) and a dummy for department A (vs B).

ucb_tidy <- UCBAdmissions %>% 
  as_tibble() %>% 
  uncount() %>% 
  mutate(male = (Gender=="Male"), 
         admitted = (Admit=="Admitted")) %>% 
  select(male,admitted,Dept)

ucb_A <- ucb_tidy %>% filter(Dept=="A") %>% mutate(depA=1)
ucb_B <- ucb_tidy %>% filter(Dept=="B") %>% mutate(depA=0)

n_A <- 500
trials_B <- 1000 # how many resamples per sample size
nrange_B <- c(50,100,150,200,250,300,350,400,450,500,550,600,650,700,750,800,850,900,950,1000) # values of sample size to test

dflist_B <- list()
k <- 1 # this object keeps track of which row of the results object to write to
results_B <- matrix(nrow = length(nrange_B)*trials_B, ncol=4) %>% data.frame() #adjust ncol value to be length(as.vector(z_emp))+1
colnames(results_B) <- c('int','male','depA','n') # replace the vector with names for as.vector(z_emp) positions followed by "n." The nanes need not match the names in the regression table but should capture the same concepts.

for (i in 1:length(nrange_B)) {
  dflist_B[[i]] <- list()
  for (j in 1:trials_B) {
    dflist_B[[i]][[j]] <- rbind(sample_n(ucb_B,size=nrange_B[i],replace=T),
                                sample_n(ucb_A,size=n_A,replace=T))
    est <- glm(admitted ~ male + depA, data = dflist_B[[i]][[j]], family = "binomial") %>% summary()
    z <- est[["coefficients"]][,3]
    results_B[k,] <- c(as.vector(z),nrange_B[i])
    k <- k +1
  }
}

# add dummy for dept

results_B$male.stars <- 0
results_B$male.stars[abs(results_B$male)>=1.96] <- 1

Number of Significant Resamples for Gender Effect With Unbalanced Strata

table(results_B$n,results_B$male.stars)
##       
##          0   1
##   50   108 892
##   100  132 868
##   150  128 872
##   200  120 880
##   250  144 856
##   300  133 867
##   350  153 847
##   400  194 806
##   450  181 819
##   500  159 841
##   550  186 814
##   600  168 832
##   650  185 815
##   700  197 803
##   750  201 799
##   800  194 806
##   850  191 809
##   900  209 791
##   950  183 817
##   1000 214 786

This reveals a tricky pattern. We see about 90% power when there are either 50 or 100 cases from department B (i.e., 550-600 total including the 500 from department A). With trials_B <- 1000 it’s a bit noisy but still apparent that the power drops as we add cases from B and then rises again along a U-shaped curve.

Normally you’d expect that more sample size would mean more statistical power because standard error is inversely proportional to the square root of degrees of freedom. The trick is that this assumes nothing happens to β. As it happens, UCBAdmissions is a famous example of Simpson’s Paradox and specifically the gender effects are much stronger for department A …

glm(admitted ~ male, data = ucb_A, family = "binomial") %>% summary()
## 
## Call:
## glm(formula = admitted ~ male, family = "binomial", data = ucb_A)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8642  -1.3922   0.9768   0.9768   0.9768  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.5442     0.2527   6.110 9.94e-10 ***
## maleTRUE     -1.0521     0.2627  -4.005 6.21e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1214.7  on 932  degrees of freedom
## Residual deviance: 1195.7  on 931  degrees of freedom
## AIC: 1199.7
## 
## Number of Fisher Scoring iterations: 4

… than the gender effects are for department B.

glm(admitted ~ male, data = ucb_B, family = "binomial") %>% summary()
## 
## Call:
## glm(formula = admitted ~ male, family = "binomial", data = ucb_B)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5096  -1.4108   0.9607   0.9607   0.9607  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   0.7538     0.4287   1.758   0.0787 .
## maleTRUE     -0.2200     0.4376  -0.503   0.6151  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 769.42  on 584  degrees of freedom
## Residual deviance: 769.16  on 583  degrees of freedom
## AIC: 773.16
## 
## Number of Fisher Scoring iterations: 4

Specifically, department A strongly prefers to admit women whereas department B has only a weak preference for admitting women. The pooled model has a dummy to account for department A generally being much less selective than department B but it tacitly assumes that the gender effect is the same as it has no interaction effect. This means that as we increase the size of the department B resample, we’re effectively flattening the gender slope through compositional shifts towards department B and its weaker preference for women.

Advertisement

December 3, 2021 at 2:14 pm

Shufflevar update

| Gabriel |

Thanks to Elizabeth Blankenspoor (Michigan) I corrected a bug with the “cluster” option in shufflevar. I’ve submitted the update to SSC but it’s also here:

*1.1 GHR January 24, 2011

*changelog
*1.1 -- fixed bug that let one case per "cluster" be misallocated (thanks to Elizabeth Blankenspoor)

capture program drop shufflevar
program define shufflevar
	version 10
	syntax varlist(min=1) [ , Joint DROPold cluster(varname)]
	tempvar oldsortorder
	gen `oldsortorder'=[_n]
	if "`cluster'"!="" {
		local bystatement "by `cluster': "
	}
	else {
		local bystatement ""
	}
	if "`joint'"=="joint" {
		tempvar newsortorder
		gen `newsortorder'=uniform()
		sort `cluster' `newsortorder'
		foreach var in `varlist' {
			capture drop `var'_shuffled
			quietly {
				`bystatement' gen `var'_shuffled=`var'[_n-1]
				`bystatement' replace `var'_shuffled=`var'[_N] if _n==1
			}
			if "`dropold'"=="dropold" {
				drop `var'
			}
		}
		sort `oldsortorder'
		drop `newsortorder' `oldsortorder'
	}
	else {
		foreach var in `varlist' {
			tempvar newsortorder
			gen `newsortorder'=uniform()
			sort `cluster' `newsortorder'
			capture drop `var'_shuffled
			quietly {
				`bystatement' gen `var'_shuffled=`var'[_n-1]
				`bystatement' replace `var'_shuffled=`var'[_N] if _n==1
			}
			drop `newsortorder'
			if "`dropold'"=="dropold" {
				drop `var'
			}
		}
		sort `oldsortorder'
		drop `oldsortorder'
	}
end

January 24, 2011 at 7:30 pm 6 comments

Shufflevar [update]

| Gabriel |

I wrote a more flexible version of shufflevar (here’s the old version) and an accompanying help file. The new version allows shuffling an entire varlist (rather than just one variable) jointly or independently and shuffling within clusters. The easiest way to install the command is to type:
ssc install shufflevar
For thoughts on this and related algorithms, see my original shufflevar post and/or the lecture notes on bootstrapping for my grad stats class.
Here’s the code.

*1.0 GHR January 29, 2010
capture program drop shufflevar
program define shufflevar
	version 10
	syntax varlist(min=1) [ , Joint DROPold cluster(varname)]
	tempvar oldsortorder
	gen `oldsortorder'=[_n]
	if "`joint'"=="joint" {
		tempvar newsortorder
		gen `newsortorder'=uniform()
		sort `cluster' `newsortorder'
		foreach var in `varlist' {
			capture drop `var'_shuffled
			quietly gen `var'_shuffled=`var'[_n-1]
			quietly replace `var'_shuffled=`var'[_N] in 1/1
			if "`dropold'"=="dropold" {
				drop `var'
			}
		}
		sort `oldsortorder'
		drop `newsortorder' `oldsortorder'
	}
	else {
		foreach var in `varlist' {
			tempvar newsortorder
			gen `newsortorder'=uniform()
			sort `cluster' `newsortorder'
			capture drop `var'_shuffled
			quietly gen `var'_shuffled=`var'[_n-1]
			quietly replace `var'_shuffled=`var'[_N] in 1/1
			drop `newsortorder'
			if "`dropold'"=="dropold" {
				drop `var'
			}
		}
		sort `oldsortorder'
		drop `oldsortorder'
	}
end

And here’s the help file.

{smcl}
{* 29jan2010}{...}
{hline}
help for {hi:shufflevar}
{hline}

{title:Randomly shuffle variables}

{p 8 17 2}
{cmd:shufflevar} {it:varlist}[, {cmdab: Joint DROPold cluster}({it:varname})]

{title:Description}

{p 4 4 2}
{cmd:shufflevar} takes {it:varlist} and either jointly or for each variable
shuffles {it:varlist} relative to the rest of the dataset. This means any
association between {it:varlist} and the rest of the dataset will be random.
Much like {help bootstrap} or the Quadratic Assignment Procedure (QAP), one
can build a distribution of results out of randomness to serve as a baseline
against which to compare empirical results, especially for overall model-fit
or clustering measures.

{title:Remarks}

{p 4 4 2}
The program is intended for situations where it is hard to model error
formally, either because the parameter is exotic or because the application
violates the parameter's assumptions. For instance, the algorithm has been
used by Fernandez et. al. and Zuckerman to interpret network data, the author
wrote this implementation for use in interpreting {help st} frailty models
with widely varying cluster sizes, and others have suggested using the metric
for adjacency matrices in spatial analysis.

{p 4 4 2}
Much like {help bsample}, the {cmd:shufflevar} command is only really useful
when worked into a {help forvalues} loop or {help program} that records the
results of each iteration using {help postfile}. See the example code below to
see how to construct the loop.

{p 4 4 2}
To avoid confusion with the actual data, the shuffled variables are renamed
{it:varname}_shuffled.

{p 4 4 2}
This command is an implementation of an algorithm used in two papers that used
it to measure network issues:

{p 4 4 2}
Fernandez, Roberto M., Emilio J. Castilla, and Paul Moore. 2000. "Social
Capital at Work: Networks and Employment at a Phone Center." {it:American Journal of Sociology} 105:1288-1356.

{p 4 4 2}
Zuckerman, Ezra W. 2005. "Typecasting and Generalism in Firm and Market: Career-Based Career Concentration in the Feature Film Industry, 1935-1995." {it:Research in the Sociology of Organizations} 23:173-216.

{title:Options}

{p 4 8 2}
{cmd:joint} specifies that {it:varlist} will be keep their actual relations to
one another even as they are shuffled relative to the rest of the variables.
If {cmd:joint} is omitted, each variable in the {it:varlist} will be
shuffled separately.

{p 4 8 2}
{cmd:dropold} specifies that the original sort order versions of {it:varlist}
will be dropped.

{p 4 8 2}
{cmd:cluster}({it:varname}) specifies that shuffling will occur by {it:varname}.

{title:Examples}

{p 4 8 2}{cmd:. sysuse auto, clear}{p_end}
{p 4 8 2}{cmd:. regress price weight}{p_end}
{p 4 8 2}{cmd:. local obs_r2=`e(r2)'}{p_end}
{p 4 8 2}{cmd:. tempname memhold}{p_end}
{p 4 8 2}{cmd:. tempfile results}{p_end}
{p 4 8 2}{cmd:. postfile `memhold' r2 using "`results'"}{p_end}
{p 4 8 2}{cmd:. forvalues i=1/100 {c -(}}{p_end}
{p 4 8 2}{cmd:. 	shufflevar weight, cluster(foreign)}{p_end}
{p 4 8 2}{cmd:. 	quietly regress price weight_shuffled}{p_end}
{p 4 8 2}{cmd:. 	post `memhold' (`e(r2)')}{p_end}
{p 4 8 2}{cmd:. }}{p_end}
{p 4 8 2}{cmd:. postclose `memhold'}{p_end}
{p 4 8 2}{cmd:. use "`results'", clear}{p_end}
{p 4 8 2}{cmd:. sum r2}{p_end}
{p 4 8 2}{cmd:. disp "The observed R^2 of " `obs_r2' " is " (`obs_r2'-`r(mean)')/`r(sd)' " sigmas out on the" _newline "distribution of shuffled R^2s."}{p_end}

{title:Author}

{p 4 4 2}Gabriel Rossman, UCLA{break}
rossman@soc.ucla.edu

{title:Also see}

{p 4 13 2}On-line:
help for {help bsample},
help for {help forvalues},
help for {help postfile},
help for {help program}

February 1, 2010 at 4:33 am 1 comment

Shufflevar

| Gabriel |

[Update: I’ve rewritten the command to be more flexible and posted it to ssc. to get it type “ssc install shufflevar”. this post may still be of interest for understanding how to apply the command].

Sometimes you face a situation where it’s really hard to see what the null is because the data structure is really complicated and there is all sorts of nonlinearity, etc. Analyses of non-sparse square network matrices can use the quadratic assignment procedure, but you can do something similar with other data structures, including bipartite networks.

A good null keeps everything constant, but shows what associations we would expect were association random. The simplest way to do this is to keep the actual variable vectors but randomly sort one of the vectors. So for instance, you could keep the actual income distribution and the actual values of peoples’ education, race, etc, but randomly assign actual incomes to people.

Fernandez, Castilla, and Moore used what was basically this approach to build a null distribution of the effects of employment referrals. Since then Ezra Zuckerman has used it in several papers on Hollywood to measure the strength of repeat collaboration. I myself am using it in some of my current radio work to understand how much corporate clustering we’d expect to see in the diffusion of pop songs under the null hypothesis that radio corporations don’t actually practice central coordination.

I wrote a little program that takes the argument of the variable you want shuffled. It has a similar application as bsample, and like bsample it’s best used as part of a loop.

capture program drop shufflevar
program define shufflevar
  local shufflevar `1'
  tempvar oldsortorder
  gen `oldsortorder'=[_n]
  tempvar newsortorder
  gen `newsortorder'=uniform()
  sort `newsortorder'
  capture drop `shufflevar'_shuffled
  gen `shufflevar'_shuffled=`shufflevar'[_n-1]
  replace `shufflevar'_shuffled=`shufflevar'[_N] in 1/1
  sort `oldsortorder'
  drop `newsortorder' `oldsortorder'
end

Here’s an example to show how much clustering of “y” you’d expect to see by “clusterid” if we keep the observed distributions of “y” and “clusterid” but break any association between them:

shell echo "run rho" > _results_shuffled.txt

forvalues run=1/1000 {
  disp "iteration # `run' of 1000"
  quietly shufflevar clusterid
  quietly xtreg y, re i(clusterid_shuffled)
  shell echo "`run' `e(rho)'" >> _results_shuffled.txt
}

insheet using _results_shuffled.txt, names clear delimiter(" ")
histogram rho
sum rho

(Note that “shell echo” only works with Mac/Unix, Windows users should try postfile).

October 26, 2009 at 5:09 am 3 comments

Bootstrapping superstars

| Gabriel |

Most cultural products and cultural workers follow a scale-free or power-law distribution for success with a tiny handful of ludicrously successful products/workers, a fairly small number with a modicum of success, and a truly ginormous number that are absolute failures. Ever since Sherwin Rosen described this phenomena in an influential American Economic Review theory piece this phenomena has been nicknamed “the superstar effect.” For a review of the major theories as to why there is a superstar effect, check out this lecture from my undergrad course (mp3 of the lecture and pdf of the slides).

One methodological problem this creates is that if you are interested in describing the overall market share of abstract categories the measure is highly sensitive to flukes. For instance, say you were interested in drawing trend lines for the sales volumes of different book genres and you noticed that in 2002 there was a decent-sized jump in sales of the genre “Christian.” One interpretation of this would be that this is a real trend, for instance you could make up some post hoc explanation that after the 9/11 terrorist attacks people turned to God for comfort. Another interpretation would be that there was no trend and all this reflects is that one book, The Purpose Driven Life, was a surprise hit. Distinguishing statistically between these concepts is surprisingly hard because it’s very hard (at least for me) to figure out how to model the standard error of a ratio based on an underlying count.

Fortunately you don’t have to because when in doubt about error structure you can just bootstrap it. My solution is to bootstrap on titles then calculate the ratio variable (e.g., genre market share) based on the bootstrapped sample of titles. You can then use the standard deviation of the bootstrapped distribution of ratios as a standard error. To return to our example of book genres, we could bootstrap book titles in 2001 and 2002 and calculate a bootstrapped distribution of estimates of Christian market share for book sales. You then do a t-test of means to see whether 2002 was statistically different from 2001 or whether any apparent difference is just the result of a few fluke hits. In other words, was 2002 a good year for Christian books as a genre, or just a good year for Rick Warren (whose book happened to be in that genre).

Here’s some Stata code to create bootstrapped samples of titles, weight them by sales, and record the proportion of sales with the relevant trait:

clear
set obs 1
gen x=.
save results.dta, replace
use salesdatset, clear
*var desc
*  title  -- a title
*  sales  -- some measure of success
*  trait  -- dummy for some trait of interest for title (eg genre, author gender, etc)
*  period -- period of the obs (eg, year)
compress
forvalues i=1/1000 {
 preserve
 bsample, strata (period)
 gen traitsales = trait*sales
 ren sales salestotal
 collapse (sum) traitsales salestotal, by (period)
 gen traitshare=traitsales/salestotal
 drop traitsales salestotal
 gen bs=`i' /*records the run of the bstrap */
 reshape wide traitshare, i(bs) j(period)
 append using results
 save results.dta, replace
 restore
}
use results, clear
sum
*for each traitshare variable, the sd can be interpreted as bstrapped standard error

March 19, 2009 at 7:25 pm


The Culture Geeks