Posts tagged ‘loops’
Recursively building graph commands
| Gabriel |
I really like multi-line graphs and scatterplots where the marker color/style reflects categories. Such graphs are both more compact than just having multiple graphs and they make it easier to compare different things. The way you do this is with “twoway,” a lot of parentheses, and the “if” condition. For example:
twoway (kdensity x if year==1985) (kdensity x if year==1990) (kdensity x if year==1995) (kdensity x if year==2000) (kdensity x if year==2005), legend(order(1 "1985" 2 "1990" 3 "1995" 4 "2000" 5 "2005"))
Unfortunately such graphs can be difficult to script. This is especially so for the legends, which by default show the variable name rather than the selection criteria. I handle this by recursively looping over a local, which in the case of the legend involves embedded quote marks.
capture program drop multigraph program define multigraph local var `1' local interval `2' local command "" local legend "" local legendtick=1 forvalues i=1985(`interval')2005 { local command "`command' (kdensity `var' if year==`i')" local legend `legend' `legendtick' `" `i' "' *" local legendtick=`legendtick'+1 } disp "twoway ""`command'" ", legend:" "`legend'" twoway `command' , legend(order(`legend')) end
To replicate the hard-coded command above, you’d call it like this:
multigraph x 5
Scraping Using twitteR (updated)
| Gabriel |
Last time I described using the twitteR library for R. In that post I had R itself read over a list to loop. In this post I make the have looping occur in Bash with argument passing to R through the commandArgs() function.
First, one major limitation of using Twitter is that it times you out for an hour after 150 queries. (You can double this if you use OAuth but I’ve yet to get that to work). For reasons I don’t really understand, getting one feed can mean multiple queries, especially if you’re trying to go far back in the timeline. For this reason you need to break up your list into a bunch of small lists and cron them at least 80 minutes apart. This bit of Bash code will split up a file called “list.txt” into several files. Also, to avoid trouble later on, it makes sure you have Unix EOL.
split -l 50 list.txt short_tw perl -pi -e 's/\r\n/\n/g' short_tw*
The next thing to keep in mind is that you’ll need to pass arguments to R. Argument passing is when a script takes input from outside the script and processes it as variables. The enthusiastic use of argument passing in Unix is the reason why there is a fine line between a file and a command in that operating system.
In theory you could have R read the target list itself but this crashes when you hit your first dead URL. Running the loop from outside R makes it more robust but this requires passing arguments to R. I’d previously solved this problem by having Stata write an entire R script, which Stata understood as having variables (or “macros”) but which from R’s perspective was hard-coded. However I was recently delighted to discover that R can accept command-line arguments with the commandArgs() function. Not surprisingly, this is more difficult than $1 in Bash, @ARGV in Perl, or `1′ in Stata, but it’s not that bad. To use it you have to use the “–args” option when invoking R and then inside of R you use the commandArgs() function to pass arguments to an array object, which behaves just like the @ARGV array in Perl.
Here’s an R script that accepts a Twitter screenname as a command-line argument, uses the twitteR library to collect that feed, and then saves it as a tab-delimited text file of the same name. (It appends if there’s an existing file). Also note that (thanks to commenters on the previous post) it turns internal EOL into regular spaces. It’s currently set to collect the last 200 tweets but you can adjust this with the third line (or you could rewrite the script to make this a command-line argument as well).
args <- commandArgs(trailingOnly = TRUE) library(twitteR) howmany <- 200 #how many past tweets to collect user <- args[1] outputfile <- paste('~/project/feeds/',user,'.txt',sep="") print(user) print(outputfile) tmptimeline <- userTimeline(user,n=as.character(howmany)) tmptimeline.df <- twListToDF(tmptimeline) tmptimeline.df$text <- gsub("\\n|\\r|\\t", " ", tmptimeline.df$text) write.table(tmptimeline.df,file=outputfile,append=TRUE,sep="\t",col.names=FALSE) quit()
To use the script to get just a single feed, you invoke it like this from the command-line.
R --vanilla --args asanews < datacollection.R
Of course the whole reason to write the script this way is to loop it over the lists. Here it is for the list “short_twaa”.
for i in `cat short_twaa`; do R --vanilla --args $i < datacollection.R ; done
Keep in mind that you’ll probably want to cron this, either because you want a running scrape or because it makes it easier to space put the “short_tw*” files so you don’t get timed out.
Another R bleg about loops and graph devices
| Gabriel |
I’m having another problem with R and was hoping somebody could help me out in the comments. Long story short, I can make graphs properly if I do them one at a time, but when I try to loop it I get device errors. In particular it creates the PDFs but they are either empty or corrupted.
Here is the log, which first fails to do it with a loop, then does it right for one case where I manually assign the variables rather than looping.
> # File-Name: surfacegraphs.R > # Date: 2011-04-25 > # Author: Gabriel Rossman > # Purpose: graph from Stata > # Packages Used: lattice > # note, wireframe code from lisa > timestamp() ##------ Tue Apr 26 10:44:09 2011 ------## > library(lattice) > > histopath <- '~/Documents/project/histograms' > image2 <- '~/Documents/project/images/histograms' > > timestamp() ##------ Tue Apr 26 10:44:10 2011 ------## > > #create surface histograms, showing how population evolves over time > # parameters held constant > setwd(histopath) > for(d in 0:10) { + for(p in 0:5) { + d10 <- d*10 + p100 <- p*100 + datafile <- paste(histopath,'/d',d10,'p',p100,'.txt', sep="") + dataobject <- read.table(file=datafile,header=TRUE) + pdfcolor <- paste(image2,'/hist_color_d',d10,'p',p100,'.pdf', sep="") + pdfgrey <- paste(image2,'/hist_grey_d',d10,'p',p100,'.pdf', sep="") + pdf(pdfcolor) + wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE) + dev.off() + + pdf(pdfgrey) + wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE, par.settings=standard.theme(color=FALSE)) + dev.off() + } + } There were 50 or more warnings (use warnings() to see the first 50) > timestamp() ##------ Tue Apr 26 10:44:12 2011 ------## > > #loop doesn't work > # seems to be the dev.off() > #try a few manually > d10 <- 0 > p100 <- 0 > datafile <- paste(histopath,'/d',d10,'p',p100,'.txt', sep="") > dataobject <- read.table(file=datafile,header=TRUE) > pdfcolor <- paste(image2,'/hist_color_d',d10,'p',p100,'.pdf', sep="") > pdfgrey <- paste(image2,'/hist_grey_d',d10,'p',p100,'.pdf', sep="") > pdf(pdfcolor) > wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE) > dev.off() null device 1 > > > timestamp() ##------ Tue Apr 26 10:44:14 2011 ------## > >
The warnings start like this and go on from there:
> warnings() Warning messages: 1: In min(x) : no non-missing arguments to min; returning Inf 2: In max(x) : no non-missing arguments to max; returning -Inf 3: In min(x) : no non-missing arguments to min; returning Inf 4: In max(x) : no non-missing arguments to max; returning -Inf 5: In min(x) : no non-missing arguments to min; returning Inf
Any ideas?
Do I just need to hard-code the graphs I really want rather than batching them?
[Update]
As Michal suggested, I needed to wrap wireframe in print. Here’s an example of the output (for a baseline simulation).
hist_color_d0p0
Shufflevar update
| Gabriel |
Thanks to Elizabeth Blankenspoor (Michigan) I corrected a bug with the “cluster” option in shufflevar. I’ve submitted the update to SSC but it’s also here:
*1.1 GHR January 24, 2011 *changelog *1.1 -- fixed bug that let one case per "cluster" be misallocated (thanks to Elizabeth Blankenspoor) capture program drop shufflevar program define shufflevar version 10 syntax varlist(min=1) [ , Joint DROPold cluster(varname)] tempvar oldsortorder gen `oldsortorder'=[_n] if "`cluster'"!="" { local bystatement "by `cluster': " } else { local bystatement "" } if "`joint'"=="joint" { tempvar newsortorder gen `newsortorder'=uniform() sort `cluster' `newsortorder' foreach var in `varlist' { capture drop `var'_shuffled quietly { `bystatement' gen `var'_shuffled=`var'[_n-1] `bystatement' replace `var'_shuffled=`var'[_N] if _n==1 } if "`dropold'"=="dropold" { drop `var' } } sort `oldsortorder' drop `newsortorder' `oldsortorder' } else { foreach var in `varlist' { tempvar newsortorder gen `newsortorder'=uniform() sort `cluster' `newsortorder' capture drop `var'_shuffled quietly { `bystatement' gen `var'_shuffled=`var'[_n-1] `bystatement' replace `var'_shuffled=`var'[_N] if _n==1 } drop `newsortorder' if "`dropold'"=="dropold" { drop `var' } } sort `oldsortorder' drop `oldsortorder' } end
Keep the best 5 (updated)
| Gabriel |
Last year I mentioned my policy of assigning about seven quizzes and then keeping the best 5. I then had a real Rube Goldberg-esque workflow that involved piping to Perl. Several people came up with simpler ideas in the comments, but the most “why didn’t I think of that” was definitely John-Paul Ferguson’s suggestions to just use reshape. Now that I’m teaching the class again, I’ve rewritten the script to work on that logic.
Also, I’ve made the script a bit more flexible by allowing it to specify in the header how many quizzes were offered and how many to keep. To make this work I made a loop that builds a local called sumstring.
[UPDATE 11/29/2010, applied Nick Cox’s suggestions. Old code remains but is commented out]
local numberofquizzes 6 local keepbest 5 *import grades, which look like this *uid name mt q1 q2 q3 *5001 Joe 40 5 4 6 *4228 Alex 20 6 3 5 insheet using grades.txt, clear *rescale the quizzes from raw points to proportion forvalues qnum=1/`numberofquizzes' { quietly sum q`qnum' replace q`qnum'=q`qnum'/`r(max)' } /* *build the sumstring local (original code) local sumstring "" forvalues i=1/`keepbest' { local sumstring "`sumstring' + q`i'" disp "`sumstring'" local sumstring=subinstr("`sumstring'","+","",1) disp "`sumstring'" } */ *reshape long, keep top few quizzes reshape long q, i( notes uid name mt) j(qnum) recode q .=0 gsort uid -q by uid: drop if _n>`keepbest' by uid: replace qnum=_n *reshape wide, calc average reshape wide q, i(notes uid name mt) j(qnum) *build the sumstring local (w/ Nick Cox's suggestions) unab sumstring : q* disp "`sumstring'" local sumstring : subinstr local sumstring " " "+", all disp "`sumstring'" gen q_avg=(`sumstring')/`keepbest' sort name sum q_avg *have a nice day
Stata Programming Lecture [updated]
| Gabriel |
I gave my introduction to Stata programming lecture again. This time the lecture was cross-listed between my graduate statistics course and the UCLA ATS faculty seminar series. Here are the lecture notes.
Thanks for the Aspirin Guys
| Gabriel |
In a recent post, I lamented that I couldn’t figure out how to do loops and a few other things in R. The script at issue was intended to create a slideshow of network graphs which I’d then use Image Magick to convert to an animated gif. With some help from Brian and Kieran, I ultimately got it to work and the results are in this post (as well as a presentation at a mini-conference I made a few days later).
Kieran very generously further improved on the code at his own blog using simulated data. Kieran’s code has a certain elegance to it, but it’s an R style of elegance based on objects and lists and the like so it’s not completely intuitive to me, despite his thorough commenting. Anyway, I worked his suggestions into my code (see below) and it works great.
Since Kieran was “interested to see whether this approach was at all faster,” I ran the old version and the new version with timestamps. (To make it fair, I commented out the “pdf” version of the flipbook, which only appears in the “old” code — this saves about one second). Turns out they’re almost exactly the same speed — about 26 or 27 seconds (it varies about a second or two every time I do it, sometimes new is faster, sometimes old). The vast majority of that half a minute is taken up by Image Magick. The actual R code only takes about 7 seconds to execute in either version. I think that’s amazingly fast to import 60 or so datasets (albeit small datasets), generate two different FR layouts, apply the layouts to 60 or so graphs (albeit graphs that are written directly to disk rather than to screen), and do miscellaneous housekeeping. I haven’t run a timestamped comparison, but my impression is that this is much faster than comparable operations in Pajek or NWB and appreciably faster than doing scatterplots in Stata.
# File-Name: chrnetwork.R # Date: 2010-03-11 # Created Date: 2009-11-24 # Author: Gabriel Rossman (w a fair amount of help from BR and KJH) # Purpose: graph CHR station network # Data Used: ties_bounded.net # Packages Used: igraph plyr timestamp() library(igraph) library(plyr) setwd("~/Documents/Sjt/radio/survey") #ties -- including ties to non-top40 (who can't reciprocate) chrnet <- read.graph("ties.net", c("pajek")) pdf("~/Documents/book/images/chrnetwork.pdf") plot.igraph(chrnet, layout=layout.fruchterman.reingold, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0) dev.off() #ties bounded to only top 40, includes adoption time color-codes, but use is optional chrnetbounded <- read.graph("ties_bounded_humpcolor.net", c("pajek")) la = layout.fruchterman.reingold(chrnetbounded) #create layout for use on several related graphs #graph structure only pdf("~/Documents/book/images/chrnetworkbounded.pdf") plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0) dev.off() #graph color coded diffusion pdf("~/Documents/book/images/chrnetworkboundedcolor.pdf") plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray80", edge.arrow.size=0.3, margin=0) dev.off() #flipbook timestamp() setwd("~/Documents/Sjt/radio/survey/flipbook") filenames <- list.files(getwd(), pattern="\\.net$") ind <- order(as.numeric(gsub("[^[:digit:]]", "", filenames))) filenames <- filenames[ind] g.list <- llply(filenames, read.graph, format="pajek") png(file="~/Documents/book/images/flipbook/chrnet_hc%d.png", width=600, height=600, units="px") l_ply(g.list, plot, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0) dev.off() timestamp() ### Finally, generate the gif without having to renumber the files ### individually. png.filenames <- list.files("~/Documents/book/images/flipbook/", pattern="\\.png$") timestamp() ## Resort properly again ind <- order(as.numeric(gsub("[^[:digit:]]", "", png.filenames))) png.filenames <- png.filenames[ind] png.filenames <- paste("~/Documents/book/images/flipbook/", png.filenames, sep="") # add the path back png.string <- capture.output(cat(png.filenames)) ## Open a pipe to the shell to execute the convert command ## directly. timestamp() gifpipe <- pipe(paste("convert", png.string, "~/Documents/book/images/flipbook/chrnet_humps.gif", sep=" "), "w") close(gifpipe) timestamp()
More R headaches
| Gabriel |
I’ve continued to play with the R package igraph
and I think I’ve gotten the hang of igraph
itself but R itself is still pretty opaque to me and that limits my ability to use igraph
. Specifically, I’m trying to merge diffusion data onto a network graph and plot it as kind of a slideshow, where each successive image is a new period. I can actually do this, but I’m having trouble looping it and so my code is very repetitive. [Update, thanks to Brian Rubineau I’ve gotten it to work]
First the two tricks that I have solved.
1. Merging the diffusion data (a vertice-level trait) onto the network.
The preferred way to do this is to read in the network file to R then merge in the vertice-level trait (or more technically, read the vertice data and associate it with the network data). This has proven really difficult to me so instead I wrote an alternate version of stata2pajek
that let’s me do this within Stata. The upside is that I spend more time in Stata and less time in R, the downside to this is that I need a separate “.net” file for every version of the graph.
2. Getting the different versions of the graph to appear comparable.
It does no good to have multiple versions of a graph if they don’t look at all similar, which is what you get if you generate the layout on the fly as part of each plot. The solution to this is to first generate a layout (the object “la”) then apply it to each graph by defining the “layout” parameter to “=la” within the “plot.igraph()” function.
So instead of this:
plot.igraph(chrnetbounded, layout=layout.fruchterman.reingold, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
Do this:
la = layout.fruchterman.reingold(chrnetbounded) plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
Now the problem I can’t solve, at least with anything approaching elegance.
3. Looping it.
Currently my code is really long and repetitive. I want one graph per week for a whole year. To make 52 versions of the graph, I need about two hundred lines of code, whereas I should be able to just loop my core four lines of code 52 times — pdf(); read.graph();plot.igraph();dev.off(). This works, but is ridiculous.
The thing is that I can’t figure out how to make a loop in R where the looping local feeds to be part of a filename. This kind of thing is trivially easy in Stata since Stata expands locals and then interprets them. For example, here’s how I would do this in Stata. (I’m using “twoway scatter” as a placeholder for “plot.igraph()”, which of course doesn’t exist in Stata).
forvalues i=0/52 { use ties_bounded`i', clear twoway scatter x y graph export chrnet_hc`i'.png, replace }
R not only doesn’t let you do this directly, but I can’t even figure out how to do it by adding an extra step where the looped local feeds into a new object to write the filename. (The problem is that the “paste()” function adds whitespace). So basically I’m at an impasse and I see three ways to do it:
- Wait for somebody to tell me in the comments how to do this kind of loop in R. (hint, hint).
- Give up on writing this kind of loop and just resign myself to writing really repetitive R code.
- Give up on using
igraph
from within R and learn to use it from within Python. I have basically zero experience using Python but it has a good reputation for usability. In fact, it only took me, a complete Python-noob, about ten minutes to figure out what’s been a real stumper in R. I haven’t yet workedigraph
into this loop but I’m thinking it can’t be that hard.
for i in range(0,52): datafile='ties_bounded%d' % i # igraph code here that treats the python object "datafile" as a filename
Update
So thanks to Brian Rubineau’s suggestion on how to better use the “paste()” function, which is functionally equivalent to the second line of the Python code above. The catch is that you have to add a “, sep=”” ” parameter to suppress the whitespace that had been annoying me. I thought I tried this already, but apparently not. Anyway, the Python / R method of first defining a new object then calling it is an extra step compared to Stata loops (where the looping local can expand directly) but it’s still reasonably easy. Here’s my complete R code, which I’m now very happy with.
# File-Name: chrnetwork.R # Date: 2010-02-26 # Created Date: 2009-11-24 # Author: Gabriel Rossman # Purpose: graph CHR station network # Data Used: ties_bounded.net # Packages Used: igraph library(igraph) setwd("~/Documents/Sjt/radio/survey") #ties bounded to only top 40, includes adoption time color-codes, but use is optional chrnetbounded <- read.graph("ties_bounded_humpcolor.net", c("pajek")) la = layout.fruchterman.reingold(chrnetbounded) #create layout for use on several related graphs #graph structure only pdf("~/Documents/book/images/chrnetworkbounded.pdf") plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0) dev.off() #graph color coded diffusion pdf("~/Documents/book/images/chrnetworkboundedcolor.pdf") plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0) dev.off() #do as flipbook setwd("~/Documents/Sjt/radio/survey/flipbook") for(i in 0:52) { datafile<-paste('ties_bounded_hc',i,'.net', sep="") pngfile<-paste('~/Documents/book/images/chrnet_hc',i,'.png', sep="") chrnetbounded <- read.graph(datafile, c("pajek")) png(pngfile) plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0) dev.off() }
Shufflevar [update]
| Gabriel |
I wrote a more flexible version of shufflevar (here’s the old version) and an accompanying help file. The new version allows shuffling an entire varlist (rather than just one variable) jointly or independently and shuffling within clusters. The easiest way to install the command is to type:
ssc install shufflevar
For thoughts on this and related algorithms, see my original shufflevar post and/or the lecture notes on bootstrapping for my grad stats class.
Here’s the code.
*1.0 GHR January 29, 2010 capture program drop shufflevar program define shufflevar version 10 syntax varlist(min=1) [ , Joint DROPold cluster(varname)] tempvar oldsortorder gen `oldsortorder'=[_n] if "`joint'"=="joint" { tempvar newsortorder gen `newsortorder'=uniform() sort `cluster' `newsortorder' foreach var in `varlist' { capture drop `var'_shuffled quietly gen `var'_shuffled=`var'[_n-1] quietly replace `var'_shuffled=`var'[_N] in 1/1 if "`dropold'"=="dropold" { drop `var' } } sort `oldsortorder' drop `newsortorder' `oldsortorder' } else { foreach var in `varlist' { tempvar newsortorder gen `newsortorder'=uniform() sort `cluster' `newsortorder' capture drop `var'_shuffled quietly gen `var'_shuffled=`var'[_n-1] quietly replace `var'_shuffled=`var'[_N] in 1/1 drop `newsortorder' if "`dropold'"=="dropold" { drop `var' } } sort `oldsortorder' drop `oldsortorder' } end
And here’s the help file.
{smcl} {* 29jan2010}{...} {hline} help for {hi:shufflevar} {hline} {title:Randomly shuffle variables} {p 8 17 2} {cmd:shufflevar} {it:varlist}[, {cmdab: Joint DROPold cluster}({it:varname})] {title:Description} {p 4 4 2} {cmd:shufflevar} takes {it:varlist} and either jointly or for each variable shuffles {it:varlist} relative to the rest of the dataset. This means any association between {it:varlist} and the rest of the dataset will be random. Much like {help bootstrap} or the Quadratic Assignment Procedure (QAP), one can build a distribution of results out of randomness to serve as a baseline against which to compare empirical results, especially for overall model-fit or clustering measures. {title:Remarks} {p 4 4 2} The program is intended for situations where it is hard to model error formally, either because the parameter is exotic or because the application violates the parameter's assumptions. For instance, the algorithm has been used by Fernandez et. al. and Zuckerman to interpret network data, the author wrote this implementation for use in interpreting {help st} frailty models with widely varying cluster sizes, and others have suggested using the metric for adjacency matrices in spatial analysis. {p 4 4 2} Much like {help bsample}, the {cmd:shufflevar} command is only really useful when worked into a {help forvalues} loop or {help program} that records the results of each iteration using {help postfile}. See the example code below to see how to construct the loop. {p 4 4 2} To avoid confusion with the actual data, the shuffled variables are renamed {it:varname}_shuffled. {p 4 4 2} This command is an implementation of an algorithm used in two papers that used it to measure network issues: {p 4 4 2} Fernandez, Roberto M., Emilio J. Castilla, and Paul Moore. 2000. "Social Capital at Work: Networks and Employment at a Phone Center." {it:American Journal of Sociology} 105:1288-1356. {p 4 4 2} Zuckerman, Ezra W. 2005. "Typecasting and Generalism in Firm and Market: Career-Based Career Concentration in the Feature Film Industry, 1935-1995." {it:Research in the Sociology of Organizations} 23:173-216. {title:Options} {p 4 8 2} {cmd:joint} specifies that {it:varlist} will be keep their actual relations to one another even as they are shuffled relative to the rest of the variables. If {cmd:joint} is omitted, each variable in the {it:varlist} will be shuffled separately. {p 4 8 2} {cmd:dropold} specifies that the original sort order versions of {it:varlist} will be dropped. {p 4 8 2} {cmd:cluster}({it:varname}) specifies that shuffling will occur by {it:varname}. {title:Examples} {p 4 8 2}{cmd:. sysuse auto, clear}{p_end} {p 4 8 2}{cmd:. regress price weight}{p_end} {p 4 8 2}{cmd:. local obs_r2=`e(r2)'}{p_end} {p 4 8 2}{cmd:. tempname memhold}{p_end} {p 4 8 2}{cmd:. tempfile results}{p_end} {p 4 8 2}{cmd:. postfile `memhold' r2 using "`results'"}{p_end} {p 4 8 2}{cmd:. forvalues i=1/100 {c -(}}{p_end} {p 4 8 2}{cmd:. shufflevar weight, cluster(foreign)}{p_end} {p 4 8 2}{cmd:. quietly regress price weight_shuffled}{p_end} {p 4 8 2}{cmd:. post `memhold' (`e(r2)')}{p_end} {p 4 8 2}{cmd:. }}{p_end} {p 4 8 2}{cmd:. postclose `memhold'}{p_end} {p 4 8 2}{cmd:. use "`results'", clear}{p_end} {p 4 8 2}{cmd:. sum r2}{p_end} {p 4 8 2}{cmd:. disp "The observed R^2 of " `obs_r2' " is " (`obs_r2'-`r(mean)')/`r(sd)' " sigmas out on the" _newline "distribution of shuffled R^2s."}{p_end} {title:Author} {p 4 4 2}Gabriel Rossman, UCLA{break} rossman@soc.ucla.edu {title:Also see} {p 4 13 2}On-line: help for {help bsample}, help for {help forvalues}, help for {help postfile}, help for {help program}
Drop the entirely missing variables
| Gabriel |
One of the purely technical frustrations with GSS is that it’s hard to figure out if a particular question was in a particular wave. The SDA server is pretty good about kicking these from extracts, but it still leaves some in. The other day I was playing with the 2008 wave and got sick of saying “oooh, that looks interesting” only to find the variable was missing for my wave, which happened repeatedly. (For instance, I thought it would be fun to run Biblical literalism against the Israel feeling thermometer, but no dice, at least for the 2008 wave).
To get rid of these phantom variables, I wrote this little loop that drops variables with entirely missing data (or that are coded as strings, see below):
foreach varname of varlist * { quietly sum `varname' if `r(N)'==0 { drop `varname' disp "dropped `varname' for too much missing data" } }
Unfortunately “sum” thinks that string variables have no observations so this will also drop strings. There’s a workaround, but it involves the “ds” command, which works in Stata 11 but has been deprecated and so may not work in future versions.
ds, not(type string) foreach varname of varlist `r(varlist)' { quietly sum `varname' if `r(N)'==0 { drop `varname' disp "dropped `varname' for too much missing data" } }
Recent Comments