Posts tagged ‘loops’

Recursively building graph commands

| Gabriel |

I really like multi-line graphs and scatterplots where the marker color/style reflects categories. Such graphs are both more compact than just having multiple graphs and they make it easier to compare different things. The way you do this is with “twoway,” a lot of parentheses, and the “if” condition. For example:

twoway (kdensity x if year==1985) (kdensity x if year==1990) (kdensity x if year==1995) (kdensity x if year==2000) (kdensity x if year==2005), legend(order(1 "1985" 2 "1990" 3 "1995" 4 "2000" 5 "2005"))

Unfortunately such graphs can be difficult to script. This is especially so for the legends, which by default show the variable name rather than the selection criteria. I handle this by recursively looping over a local, which in the case of the legend involves embedded quote marks.

capture program drop multigraph
program define multigraph
	local var         `1'
	local interval	  `2'

	local command ""
	local legend ""
	local legendtick=1
	forvalues i=1985(`interval')2005 {
		local command "`command' (kdensity `var' if year==`i')"
		local legend `legend' `legendtick' `" `i' "'
		*"
		local legendtick=`legendtick'+1
	}
	disp "twoway ""`command'" ", legend:" "`legend'"
	twoway `command' , legend(order(`legend'))
end

To replicate the hard-coded command above, you’d call it like this:

multigraph x 5
Advertisement

March 8, 2012 at 9:56 am 2 comments

Scraping Using twitteR (updated)

| Gabriel |

Last time I described using the twitteR library for R. In that post I had R itself read over a list to loop. In this post I make the have looping occur in Bash with argument passing to R through the commandArgs() function.

First, one major limitation of using Twitter is that it times you out for an hour after 150 queries. (You can double this if you use OAuth but I’ve yet to get that to work). For reasons I don’t really understand, getting one feed can mean multiple queries, especially if you’re trying to go far back in the timeline. For this reason you need to break up your list into a bunch of small lists and cron them at least 80 minutes apart. This bit of Bash code will split up a file called “list.txt” into several files. Also, to avoid trouble later on, it makes sure you have Unix EOL.

split -l 50 list.txt short_tw
perl -pi -e 's/\r\n/\n/g' short_tw*

The next thing to keep in mind is that you’ll need to pass arguments to R. Argument passing is when a script takes input from outside the script and processes it as variables. The enthusiastic use of argument passing in Unix is the reason why there is a fine line between a file and a command in that operating system.

In theory you could have R read the target list itself but this crashes when you hit your first dead URL. Running the loop from outside R makes it more robust but this requires passing arguments to R. I’d previously solved this problem by having Stata write an entire R script, which Stata understood as having variables (or “macros”) but which from R’s perspective was hard-coded. However I was recently delighted to discover that R can accept command-line arguments with the commandArgs() function. Not surprisingly, this is more difficult than $1 in Bash, @ARGV in Perl, or `1′ in Stata, but it’s not that bad. To use it you have to use the “–args” option when invoking R and then inside of R you use the commandArgs() function to pass arguments to an array object, which behaves just like the @ARGV array in Perl.

Here’s an R script that accepts a Twitter screenname as a command-line argument, uses the twitteR library to collect that feed, and then saves it as a tab-delimited text file of the same name. (It appends if there’s an existing file). Also note that (thanks to commenters on the previous post) it turns internal EOL into regular spaces. It’s currently set to collect the last 200 tweets but you can adjust this with the third line (or you could rewrite the script to make this a command-line argument as well).

args <- commandArgs(trailingOnly = TRUE)
library(twitteR)
howmany <- 200 #how many past tweets to collect

user <- args[1]
outputfile <- paste('~/project/feeds/',user,'.txt',sep="")
print(user)
print(outputfile)

tmptimeline <- userTimeline(user,n=as.character(howmany))
tmptimeline.df <- twListToDF(tmptimeline)
tmptimeline.df$text <- gsub("\\n|\\r|\\t", " ", tmptimeline.df$text)
write.table(tmptimeline.df,file=outputfile,append=TRUE,sep="\t",col.names=FALSE)

quit()

To use the script to get just a single feed, you invoke it like this from the command-line.

R --vanilla --args asanews < datacollection.R

Of course the whole reason to write the script this way is to loop it over the lists. Here it is for the list “short_twaa”.

for i in `cat short_twaa`; do R --vanilla --args $i < datacollection.R ; done

Keep in mind that you’ll probably want to cron this, either because you want a running scrape or because it makes it easier to space put the “short_tw*” files so you don’t get timed out.

December 20, 2011 at 11:04 pm 1 comment

Another R bleg about loops and graph devices

| Gabriel |

I’m having another problem with R and was hoping somebody could help me out in the comments. Long story short, I can make graphs properly if I do them one at a time, but when I try to loop it I get device errors. In particular it creates the PDFs but they are either empty or corrupted.

Here is the log, which first fails to do it with a loop, then does it right for one case where I manually assign the variables rather than looping.

> # File-Name:       surfacegraphs.R                 
> # Date:            2011-04-25
> # Author:          Gabriel Rossman                                       
> # Purpose:         graph from Stata
> # Packages Used:   lattice   
> # note, wireframe code from lisa 
> timestamp()
##------ Tue Apr 26 10:44:09 2011 ------##
> library(lattice)
> 
> histopath <- '~/Documents/project/histograms'
> image2 <- '~/Documents/project/images/histograms'
> 
> timestamp()
##------ Tue Apr 26 10:44:10 2011 ------##
> 
> #create surface histograms, showing how population evolves over time
> #  parameters held constant
> setwd(histopath)
> for(d in 0:10) {
+ 	for(p in 0:5) {
+ 		d10 <- d*10
+ 		p100 <- p*100
+ 		datafile <- paste(histopath,'/d',d10,'p',p100,'.txt', sep="")
+ 		dataobject <- read.table(file=datafile,header=TRUE)
+ 		pdfcolor <- paste(image2,'/hist_color_d',d10,'p',p100,'.pdf', sep="")
+ 		pdfgrey <- paste(image2,'/hist_grey_d',d10,'p',p100,'.pdf', sep="")
+ 		pdf(pdfcolor)
+ 		wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE) 
+ 		dev.off()
+ 		
+ 		pdf(pdfgrey)
+ 		wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE, par.settings=standard.theme(color=FALSE))
+ 		dev.off()
+ 	}
+ }
There were 50 or more warnings (use warnings() to see the first 50)
> timestamp()
##------ Tue Apr 26 10:44:12 2011 ------##
> 
> #loop doesn't work
> #  seems to be the dev.off()
> #try a few manually
> d10 <- 0
> p100 <- 0
> datafile <- paste(histopath,'/d',d10,'p',p100,'.txt', sep="")
> dataobject <- read.table(file=datafile,header=TRUE)
> pdfcolor <- paste(image2,'/hist_color_d',d10,'p',p100,'.pdf', sep="")
> pdfgrey <- paste(image2,'/hist_grey_d',d10,'p',p100,'.pdf', sep="")
> pdf(pdfcolor)
> wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE) 
> dev.off()
null device 
          1 
> 
> 
> timestamp()
##------ Tue Apr 26 10:44:14 2011 ------##
> 
> 

The warnings start like this and go on from there:

> warnings()
Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
5: In min(x) : no non-missing arguments to min; returning Inf

Any ideas?
Do I just need to hard-code the graphs I really want rather than batching them?

[Update]
As Michal suggested, I needed to wrap wireframe in print. Here’s an example of the output (for a baseline simulation).
hist_color_d0p0

April 27, 2011 at 4:51 am 7 comments

Shufflevar update

| Gabriel |

Thanks to Elizabeth Blankenspoor (Michigan) I corrected a bug with the “cluster” option in shufflevar. I’ve submitted the update to SSC but it’s also here:

*1.1 GHR January 24, 2011

*changelog
*1.1 -- fixed bug that let one case per "cluster" be misallocated (thanks to Elizabeth Blankenspoor)

capture program drop shufflevar
program define shufflevar
	version 10
	syntax varlist(min=1) [ , Joint DROPold cluster(varname)]
	tempvar oldsortorder
	gen `oldsortorder'=[_n]
	if "`cluster'"!="" {
		local bystatement "by `cluster': "
	}
	else {
		local bystatement ""
	}
	if "`joint'"=="joint" {
		tempvar newsortorder
		gen `newsortorder'=uniform()
		sort `cluster' `newsortorder'
		foreach var in `varlist' {
			capture drop `var'_shuffled
			quietly {
				`bystatement' gen `var'_shuffled=`var'[_n-1]
				`bystatement' replace `var'_shuffled=`var'[_N] if _n==1
			}
			if "`dropold'"=="dropold" {
				drop `var'
			}
		}
		sort `oldsortorder'
		drop `newsortorder' `oldsortorder'
	}
	else {
		foreach var in `varlist' {
			tempvar newsortorder
			gen `newsortorder'=uniform()
			sort `cluster' `newsortorder'
			capture drop `var'_shuffled
			quietly {
				`bystatement' gen `var'_shuffled=`var'[_n-1]
				`bystatement' replace `var'_shuffled=`var'[_N] if _n==1
			}
			drop `newsortorder'
			if "`dropold'"=="dropold" {
				drop `var'
			}
		}
		sort `oldsortorder'
		drop `oldsortorder'
	}
end

January 24, 2011 at 7:30 pm 6 comments

Keep the best 5 (updated)

| Gabriel |

Last year I mentioned my policy of assigning about seven quizzes and then keeping the best 5. I then had a real Rube Goldberg-esque workflow that involved piping to Perl. Several people came up with simpler ideas in the comments, but the most “why didn’t I think of that” was definitely John-Paul Ferguson’s suggestions to just use reshape. Now that I’m teaching the class again, I’ve rewritten the script to work on that logic.

Also, I’ve made the script a bit more flexible by allowing it to specify in the header how many quizzes were offered and how many to keep. To make this work I made a loop that builds a local called sumstring.

[UPDATE 11/29/2010, applied Nick Cox’s suggestions. Old code remains but is commented out]

local numberofquizzes 6
local keepbest 5

*import grades, which look like this
*uid    name    mt  q1  q2  q3
*5001   Joe     40  5   4   6
*4228   Alex    20  6   3   5
insheet using grades.txt, clear
*rescale the quizzes from raw points to proportion 
forvalues qnum=1/`numberofquizzes' {
	quietly sum q`qnum'
	replace q`qnum'=q`qnum'/`r(max)'
}
/*
*build the sumstring local (original code)
local sumstring ""
forvalues i=1/`keepbest' {
	local sumstring "`sumstring' + q`i'"
	disp "`sumstring'"
	local sumstring=subinstr("`sumstring'","+","",1)
	disp "`sumstring'"
}
*/
*reshape long, keep top few quizzes
reshape long q, i( notes uid name mt) j(qnum)
recode q .=0
gsort uid -q
by uid: drop if _n>`keepbest'
by uid: replace qnum=_n
*reshape wide, calc average
reshape wide q, i(notes uid name mt) j(qnum)
*build the sumstring local (w/ Nick Cox's suggestions)
unab sumstring : q* 
disp "`sumstring'"
local sumstring : subinstr local sumstring " " "+", all
disp "`sumstring'"
gen q_avg=(`sumstring')/`keepbest'
sort name
sum q_avg

*have a nice day

November 24, 2010 at 4:24 am 4 comments

Stata Programming Lecture [updated]

| Gabriel |

I gave my introduction to Stata programming lecture again. This time the lecture was cross-listed between my graduate statistics course and the UCLA ATS faculty seminar series. Here are the lecture notes.

October 14, 2010 at 3:50 pm 3 comments

Thanks for the Aspirin Guys

| Gabriel |

In a recent post, I lamented that I couldn’t figure out how to do loops and a few other things in R. The script at issue was intended to create a slideshow of network graphs which I’d then use Image Magick to convert to an animated gif. With some help from Brian and Kieran, I ultimately got it to work and the results are in this post (as well as a presentation at a mini-conference I made a few days later).

Kieran very generously further improved on the code at his own blog using simulated data. Kieran’s code has a certain elegance to it, but it’s an R style of elegance based on objects and lists and the like so it’s not completely intuitive to me, despite his thorough commenting. Anyway, I worked his suggestions into my code (see below) and it works great.

Since Kieran was “interested to see whether this approach was at all faster,” I ran the old version and the new version with timestamps. (To make it fair, I commented out the “pdf” version of the flipbook, which only appears in the “old” code — this saves about one second). Turns out they’re almost exactly the same speed — about 26 or 27 seconds (it varies about a second or two every time I do it, sometimes new is faster, sometimes old). The vast majority of that half a minute is taken up by Image Magick. The actual R code only takes about 7 seconds to execute in either version. I think that’s amazingly fast to import 60 or so datasets (albeit small datasets), generate two different FR layouts, apply the layouts to 60 or so graphs (albeit graphs that are written directly to disk rather than to screen), and do miscellaneous housekeeping. I haven’t run a timestamped comparison, but my impression is that this is much faster than comparable operations in Pajek or NWB and appreciably faster than doing scatterplots in Stata.

# File-Name:       chrnetwork.R                 
# Date:            2010-03-11
# Created Date:    2009-11-24                               
# Author:          Gabriel Rossman (w a fair amount of help from BR and KJH)           
# Purpose:         graph CHR station network
# Data Used:       ties_bounded.net
# Packages Used:   igraph plyr   
timestamp()
library(igraph)
library(plyr)
setwd("~/Documents/Sjt/radio/survey")
#ties -- including ties to non-top40 (who can't reciprocate)
chrnet <- read.graph("ties.net", c("pajek"))
pdf("~/Documents/book/images/chrnetwork.pdf")
 plot.igraph(chrnet, layout=layout.fruchterman.reingold, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
dev.off()
#ties bounded to only top 40, includes adoption time color-codes, but use is optional
chrnetbounded <- read.graph("ties_bounded_humpcolor.net", c("pajek"))
la = layout.fruchterman.reingold(chrnetbounded)  #create layout for use on several related graphs
#graph structure only
pdf("~/Documents/book/images/chrnetworkbounded.pdf")
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
dev.off()
#graph color coded diffusion
pdf("~/Documents/book/images/chrnetworkboundedcolor.pdf")
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray80", edge.arrow.size=0.3, margin=0)
dev.off()
#flipbook
timestamp()
setwd("~/Documents/Sjt/radio/survey/flipbook")
filenames <- list.files(getwd(), pattern="\\.net$")
ind <- order(as.numeric(gsub("[^[:digit:]]", "", filenames)))
filenames <- filenames[ind]
g.list <- llply(filenames, read.graph, format="pajek")
png(file="~/Documents/book/images/flipbook/chrnet_hc%d.png", width=600, height=600, units="px") 
l_ply(g.list, plot, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0)
dev.off()
timestamp()
### Finally, generate the gif without having to renumber the files
### individually.
png.filenames <- list.files("~/Documents/book/images/flipbook/", pattern="\\.png$")
timestamp()
## Resort properly again
ind <- order(as.numeric(gsub("[^[:digit:]]", "", png.filenames)))
png.filenames <- png.filenames[ind]
png.filenames <- paste("~/Documents/book/images/flipbook/", png.filenames, sep="") # add the path back
png.string <- capture.output(cat(png.filenames))
## Open a pipe to the shell to execute the convert command
## directly.
timestamp()
gifpipe <- pipe(paste("convert", png.string, "~/Documents/book/images/flipbook/chrnet_humps.gif", sep=" "), "w")
close(gifpipe)
timestamp()

March 16, 2010 at 4:31 am 2 comments

More R headaches

| Gabriel |

I’ve continued to play with the R package igraph and I think I’ve gotten the hang of igraph itself but R itself is still pretty opaque to me and that limits my ability to use igraph. Specifically, I’m trying to merge diffusion data onto a network graph and plot it as kind of a slideshow, where each successive image is a new period. I can actually do this, but I’m having trouble looping it and so my code is very repetitive. [Update, thanks to Brian Rubineau I’ve gotten it to work]

First the two tricks that I have solved.

1. Merging the diffusion data (a vertice-level trait) onto the network.

The preferred way to do this is to read in the network file to R then merge in the vertice-level trait (or more technically, read the vertice data and associate it with the network data). This has proven really difficult to me so instead I wrote an alternate version of stata2pajek that let’s me do this within Stata. The upside is that I spend more time in Stata and less time in R, the downside to this is that I need a separate “.net” file for every version of the graph.

2. Getting the different versions of the graph to appear comparable.

It does no good to have multiple versions of a graph if they don’t look at all similar, which is what you get if you generate the layout on the fly as part of each plot. The solution to this is to first generate a layout (the object “la”) then apply it to each graph by defining the “layout” parameter to “=la” within the “plot.igraph()” function.

So instead of this:

plot.igraph(chrnetbounded, layout=layout.fruchterman.reingold, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)

Do this:

la = layout.fruchterman.reingold(chrnetbounded)
plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)

Now the problem I can’t solve, at least with anything approaching elegance.

3. Looping it.

Currently my code is really long and repetitive. I want one graph per week for a whole year. To make 52 versions of the graph, I need about two hundred lines of code, whereas I should be able to just loop my core four lines of code 52 times — pdf(); read.graph();plot.igraph();dev.off(). This works, but is ridiculous.

The thing is that I can’t figure out how to make a loop in R where the looping local feeds to be part of a filename. This kind of thing is trivially easy in Stata since Stata expands locals and then interprets them. For example, here’s how I would do this in Stata. (I’m using “twoway scatter” as a placeholder for “plot.igraph()”, which of course doesn’t exist in Stata).

forvalues i=0/52 {
	use ties_bounded`i', clear
	twoway scatter x y
	graph export chrnet_hc`i'.png, replace
}

R not only doesn’t let you do this directly, but I can’t even figure out how to do it by adding an extra step where the looped local feeds into a new object to write the filename. (The problem is that the “paste()” function adds whitespace). So basically I’m at an impasse and I see three ways to do it:

  • Wait for somebody to tell me in the comments how to do this kind of loop in R. (hint, hint).
  • Give up on writing this kind of loop and just resign myself to writing really repetitive R code.
  • Give up on using igraph from within R and learn to use it from within Python. I have basically zero experience using Python but it has a good reputation for usability. In fact, it only took me, a complete Python-noob, about ten minutes to figure out what’s been a real stumper in R. I haven’t yet worked igraph into this loop but I’m thinking it can’t be that hard.
for i in range(0,52):
     datafile='ties_bounded%d' % i
     # igraph code here that treats the python object "datafile" as a filename

Update
So thanks to Brian Rubineau’s suggestion on how to better use the “paste()” function, which is functionally equivalent to the second line of the Python code above. The catch is that you have to add a “, sep=”” ” parameter to suppress the whitespace that had been annoying me. I thought I tried this already, but apparently not. Anyway, the Python / R method of first defining a new object then calling it is an extra step compared to Stata loops (where the looping local can expand directly) but it’s still reasonably easy. Here’s my complete R code, which I’m now very happy with.

# File-Name:       chrnetwork.R                 
# Date:            2010-02-26
# Created Date:    2009-11-24                               
# Author:          Gabriel Rossman                                       
# Purpose:         graph CHR station network
# Data Used:       ties_bounded.net
# Packages Used:   igraph    
library(igraph)
setwd("~/Documents/Sjt/radio/survey")
#ties bounded to only top 40, includes adoption time color-codes, but use is optional
chrnetbounded <- read.graph("ties_bounded_humpcolor.net", c("pajek"))
la = layout.fruchterman.reingold(chrnetbounded)  #create layout for use on several related graphs
#graph structure only
pdf("~/Documents/book/images/chrnetworkbounded.pdf")
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
dev.off()
#graph color coded diffusion
pdf("~/Documents/book/images/chrnetworkboundedcolor.pdf")
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0)
dev.off()
#do as flipbook
setwd("~/Documents/Sjt/radio/survey/flipbook")
for(i in 0:52) {
	datafile<-paste('ties_bounded_hc',i,'.net', sep="")
	pngfile<-paste('~/Documents/book/images/chrnet_hc',i,'.png', sep="")
	chrnetbounded <- read.graph(datafile, c("pajek"))
	png(pngfile)
	plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0)
	dev.off()
}

February 28, 2010 at 1:23 pm 9 comments

Shufflevar [update]

| Gabriel |

I wrote a more flexible version of shufflevar (here’s the old version) and an accompanying help file. The new version allows shuffling an entire varlist (rather than just one variable) jointly or independently and shuffling within clusters. The easiest way to install the command is to type:
ssc install shufflevar
For thoughts on this and related algorithms, see my original shufflevar post and/or the lecture notes on bootstrapping for my grad stats class.
Here’s the code.

*1.0 GHR January 29, 2010
capture program drop shufflevar
program define shufflevar
	version 10
	syntax varlist(min=1) [ , Joint DROPold cluster(varname)]
	tempvar oldsortorder
	gen `oldsortorder'=[_n]
	if "`joint'"=="joint" {
		tempvar newsortorder
		gen `newsortorder'=uniform()
		sort `cluster' `newsortorder'
		foreach var in `varlist' {
			capture drop `var'_shuffled
			quietly gen `var'_shuffled=`var'[_n-1]
			quietly replace `var'_shuffled=`var'[_N] in 1/1
			if "`dropold'"=="dropold" {
				drop `var'
			}
		}
		sort `oldsortorder'
		drop `newsortorder' `oldsortorder'
	}
	else {
		foreach var in `varlist' {
			tempvar newsortorder
			gen `newsortorder'=uniform()
			sort `cluster' `newsortorder'
			capture drop `var'_shuffled
			quietly gen `var'_shuffled=`var'[_n-1]
			quietly replace `var'_shuffled=`var'[_N] in 1/1
			drop `newsortorder'
			if "`dropold'"=="dropold" {
				drop `var'
			}
		}
		sort `oldsortorder'
		drop `oldsortorder'
	}
end

And here’s the help file.

{smcl}
{* 29jan2010}{...}
{hline}
help for {hi:shufflevar}
{hline}

{title:Randomly shuffle variables}

{p 8 17 2}
{cmd:shufflevar} {it:varlist}[, {cmdab: Joint DROPold cluster}({it:varname})]

{title:Description}

{p 4 4 2}
{cmd:shufflevar} takes {it:varlist} and either jointly or for each variable
shuffles {it:varlist} relative to the rest of the dataset. This means any
association between {it:varlist} and the rest of the dataset will be random.
Much like {help bootstrap} or the Quadratic Assignment Procedure (QAP), one
can build a distribution of results out of randomness to serve as a baseline
against which to compare empirical results, especially for overall model-fit
or clustering measures.

{title:Remarks}

{p 4 4 2}
The program is intended for situations where it is hard to model error
formally, either because the parameter is exotic or because the application
violates the parameter's assumptions. For instance, the algorithm has been
used by Fernandez et. al. and Zuckerman to interpret network data, the author
wrote this implementation for use in interpreting {help st} frailty models
with widely varying cluster sizes, and others have suggested using the metric
for adjacency matrices in spatial analysis.

{p 4 4 2}
Much like {help bsample}, the {cmd:shufflevar} command is only really useful
when worked into a {help forvalues} loop or {help program} that records the
results of each iteration using {help postfile}. See the example code below to
see how to construct the loop.

{p 4 4 2}
To avoid confusion with the actual data, the shuffled variables are renamed
{it:varname}_shuffled.

{p 4 4 2}
This command is an implementation of an algorithm used in two papers that used
it to measure network issues:

{p 4 4 2}
Fernandez, Roberto M., Emilio J. Castilla, and Paul Moore. 2000. "Social
Capital at Work: Networks and Employment at a Phone Center." {it:American Journal of Sociology} 105:1288-1356.

{p 4 4 2}
Zuckerman, Ezra W. 2005. "Typecasting and Generalism in Firm and Market: Career-Based Career Concentration in the Feature Film Industry, 1935-1995." {it:Research in the Sociology of Organizations} 23:173-216.

{title:Options}

{p 4 8 2}
{cmd:joint} specifies that {it:varlist} will be keep their actual relations to
one another even as they are shuffled relative to the rest of the variables.
If {cmd:joint} is omitted, each variable in the {it:varlist} will be
shuffled separately.

{p 4 8 2}
{cmd:dropold} specifies that the original sort order versions of {it:varlist}
will be dropped.

{p 4 8 2}
{cmd:cluster}({it:varname}) specifies that shuffling will occur by {it:varname}.

{title:Examples}

{p 4 8 2}{cmd:. sysuse auto, clear}{p_end}
{p 4 8 2}{cmd:. regress price weight}{p_end}
{p 4 8 2}{cmd:. local obs_r2=`e(r2)'}{p_end}
{p 4 8 2}{cmd:. tempname memhold}{p_end}
{p 4 8 2}{cmd:. tempfile results}{p_end}
{p 4 8 2}{cmd:. postfile `memhold' r2 using "`results'"}{p_end}
{p 4 8 2}{cmd:. forvalues i=1/100 {c -(}}{p_end}
{p 4 8 2}{cmd:. 	shufflevar weight, cluster(foreign)}{p_end}
{p 4 8 2}{cmd:. 	quietly regress price weight_shuffled}{p_end}
{p 4 8 2}{cmd:. 	post `memhold' (`e(r2)')}{p_end}
{p 4 8 2}{cmd:. }}{p_end}
{p 4 8 2}{cmd:. postclose `memhold'}{p_end}
{p 4 8 2}{cmd:. use "`results'", clear}{p_end}
{p 4 8 2}{cmd:. sum r2}{p_end}
{p 4 8 2}{cmd:. disp "The observed R^2 of " `obs_r2' " is " (`obs_r2'-`r(mean)')/`r(sd)' " sigmas out on the" _newline "distribution of shuffled R^2s."}{p_end}

{title:Author}

{p 4 4 2}Gabriel Rossman, UCLA{break}
rossman@soc.ucla.edu

{title:Also see}

{p 4 13 2}On-line:
help for {help bsample},
help for {help forvalues},
help for {help postfile},
help for {help program}

February 1, 2010 at 4:33 am 1 comment

Drop the entirely missing variables

| Gabriel |

One of the purely technical frustrations with GSS is that it’s hard to figure out if a particular question was in a particular wave. The SDA server is pretty good about kicking these from extracts, but it still leaves some in. The other day I was playing with the 2008 wave and got sick of saying “oooh, that looks interesting” only to find the variable was missing for my wave, which happened repeatedly. (For instance, I thought it would be fun to run Biblical literalism against the Israel feeling thermometer, but no dice, at least for the 2008 wave).

To get rid of these phantom variables, I wrote this little loop that drops variables with entirely missing data (or that are coded as strings, see below):

foreach varname of varlist * {
 quietly sum `varname'
 if `r(N)'==0 {
  drop `varname'
  disp "dropped `varname' for too much missing data"
 }
}

Unfortunately “sum” thinks that string variables have no observations so this will also drop strings. There’s a workaround, but it involves the “ds” command, which works in Stata 11 but has been deprecated and so may not work in future versions.

ds, not(type string)
foreach varname of varlist `r(varlist)' {
 quietly sum `varname'
 if `r(N)'==0 {
  drop `varname'
  disp "dropped `varname' for too much missing data"
 }
}

December 2, 2009 at 5:00 am 9 comments

Older Posts


The Culture Geeks