Posts tagged ‘loops’

Recursively building graph commands

| Gabriel |

I really like multi-line graphs and scatterplots where the marker color/style reflects categories. Such graphs are both more compact than just having multiple graphs and they make it easier to compare different things. The way you do this is with “twoway,” a lot of parentheses, and the “if” condition. For example:

twoway (kdensity x if year==1985) (kdensity x if year==1990) (kdensity x if year==1995) (kdensity x if year==2000) (kdensity x if year==2005), legend(order(1 "1985" 2 "1990" 3 "1995" 4 "2000" 5 "2005"))

Unfortunately such graphs can be difficult to script. This is especially so for the legends, which by default show the variable name rather than the selection criteria. I handle this by recursively looping over a local, which in the case of the legend involves embedded quote marks.

capture program drop multigraph
program define multigraph
	local var         `1'
	local interval	  `2'

	local command ""
	local legend ""
	local legendtick=1
	forvalues i=1985(`interval')2005 {
		local command "`command' (kdensity `var' if year==`i')"
		local legend `legend' `legendtick' `" `i' "'
		*"
		local legendtick=`legendtick'+1
	}
	disp "twoway ""`command'" ", legend:" "`legend'"
	twoway `command' , legend(order(`legend'))
end

To replicate the hard-coded command above, you’d call it like this:

multigraph x 5
Advertisements

March 8, 2012 at 9:56 am 2 comments

Scraping Using twitteR (updated)

| Gabriel |

Last time I described using the twitteR library for R. In that post I had R itself read over a list to loop. In this post I make the have looping occur in Bash with argument passing to R through the commandArgs() function.

First, one major limitation of using Twitter is that it times you out for an hour after 150 queries. (You can double this if you use OAuth but I’ve yet to get that to work). For reasons I don’t really understand, getting one feed can mean multiple queries, especially if you’re trying to go far back in the timeline. For this reason you need to break up your list into a bunch of small lists and cron them at least 80 minutes apart. This bit of Bash code will split up a file called “list.txt” into several files. Also, to avoid trouble later on, it makes sure you have Unix EOL.

split -l 50 list.txt short_tw
perl -pi -e 's/\r\n/\n/g' short_tw*

The next thing to keep in mind is that you’ll need to pass arguments to R. Argument passing is when a script takes input from outside the script and processes it as variables. The enthusiastic use of argument passing in Unix is the reason why there is a fine line between a file and a command in that operating system.

In theory you could have R read the target list itself but this crashes when you hit your first dead URL. Running the loop from outside R makes it more robust but this requires passing arguments to R. I’d previously solved this problem by having Stata write an entire R script, which Stata understood as having variables (or “macros”) but which from R’s perspective was hard-coded. However I was recently delighted to discover that R can accept command-line arguments with the commandArgs() function. Not surprisingly, this is more difficult than $1 in Bash, @ARGV in Perl, or `1′ in Stata, but it’s not that bad. To use it you have to use the “–args” option when invoking R and then inside of R you use the commandArgs() function to pass arguments to an array object, which behaves just like the @ARGV array in Perl.

Here’s an R script that accepts a Twitter screenname as a command-line argument, uses the twitteR library to collect that feed, and then saves it as a tab-delimited text file of the same name. (It appends if there’s an existing file). Also note that (thanks to commenters on the previous post) it turns internal EOL into regular spaces. It’s currently set to collect the last 200 tweets but you can adjust this with the third line (or you could rewrite the script to make this a command-line argument as well).

args <- commandArgs(trailingOnly = TRUE)
library(twitteR)
howmany <- 200 #how many past tweets to collect

user <- args[1]
outputfile <- paste('~/project/feeds/',user,'.txt',sep="")
print(user)
print(outputfile)

tmptimeline <- userTimeline(user,n=as.character(howmany))
tmptimeline.df <- twListToDF(tmptimeline)
tmptimeline.df$text <- gsub("\\n|\\r|\\t", " ", tmptimeline.df$text)
write.table(tmptimeline.df,file=outputfile,append=TRUE,sep="\t",col.names=FALSE)

quit()

To use the script to get just a single feed, you invoke it like this from the command-line.

R --vanilla --args asanews < datacollection.R

Of course the whole reason to write the script this way is to loop it over the lists. Here it is for the list “short_twaa”.

for i in `cat short_twaa`; do R --vanilla --args $i < datacollection.R ; done

Keep in mind that you’ll probably want to cron this, either because you want a running scrape or because it makes it easier to space put the “short_tw*” files so you don’t get timed out.

December 20, 2011 at 11:04 pm 1 comment

Another R bleg about loops and graph devices

| Gabriel |

I’m having another problem with R and was hoping somebody could help me out in the comments. Long story short, I can make graphs properly if I do them one at a time, but when I try to loop it I get device errors. In particular it creates the PDFs but they are either empty or corrupted.

Here is the log, which first fails to do it with a loop, then does it right for one case where I manually assign the variables rather than looping.

> # File-Name:       surfacegraphs.R                 
> # Date:            2011-04-25
> # Author:          Gabriel Rossman                                       
> # Purpose:         graph from Stata
> # Packages Used:   lattice   
> # note, wireframe code from lisa 
> timestamp()
##------ Tue Apr 26 10:44:09 2011 ------##
> library(lattice)
> 
> histopath <- '~/Documents/project/histograms'
> image2 <- '~/Documents/project/images/histograms'
> 
> timestamp()
##------ Tue Apr 26 10:44:10 2011 ------##
> 
> #create surface histograms, showing how population evolves over time
> #  parameters held constant
> setwd(histopath)
> for(d in 0:10) {
+ 	for(p in 0:5) {
+ 		d10 <- d*10
+ 		p100 <- p*100
+ 		datafile <- paste(histopath,'/d',d10,'p',p100,'.txt', sep="")
+ 		dataobject <- read.table(file=datafile,header=TRUE)
+ 		pdfcolor <- paste(image2,'/hist_color_d',d10,'p',p100,'.pdf', sep="")
+ 		pdfgrey <- paste(image2,'/hist_grey_d',d10,'p',p100,'.pdf', sep="")
+ 		pdf(pdfcolor)
+ 		wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE) 
+ 		dev.off()
+ 		
+ 		pdf(pdfgrey)
+ 		wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE, par.settings=standard.theme(color=FALSE))
+ 		dev.off()
+ 	}
+ }
There were 50 or more warnings (use warnings() to see the first 50)
> timestamp()
##------ Tue Apr 26 10:44:12 2011 ------##
> 
> #loop doesn't work
> #  seems to be the dev.off()
> #try a few manually
> d10 <- 0
> p100 <- 0
> datafile <- paste(histopath,'/d',d10,'p',p100,'.txt', sep="")
> dataobject <- read.table(file=datafile,header=TRUE)
> pdfcolor <- paste(image2,'/hist_color_d',d10,'p',p100,'.pdf', sep="")
> pdfgrey <- paste(image2,'/hist_grey_d',d10,'p',p100,'.pdf', sep="")
> pdf(pdfcolor)
> wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE) 
> dev.off()
null device 
          1 
> 
> 
> timestamp()
##------ Tue Apr 26 10:44:14 2011 ------##
> 
> 

The warnings start like this and go on from there:

> warnings()
Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
5: In min(x) : no non-missing arguments to min; returning Inf

Any ideas?
Do I just need to hard-code the graphs I really want rather than batching them?

[Update]
As Michal suggested, I needed to wrap wireframe in print. Here’s an example of the output (for a baseline simulation).
hist_color_d0p0

April 27, 2011 at 4:51 am 7 comments

Shufflevar update

| Gabriel |

Thanks to Elizabeth Blankenspoor (Michigan) I corrected a bug with the “cluster” option in shufflevar. I’ve submitted the update to SSC but it’s also here:

*1.1 GHR January 24, 2011

*changelog
*1.1 -- fixed bug that let one case per "cluster" be misallocated (thanks to Elizabeth Blankenspoor)

capture program drop shufflevar
program define shufflevar
	version 10
	syntax varlist(min=1) [ , Joint DROPold cluster(varname)]
	tempvar oldsortorder
	gen `oldsortorder'=[_n]
	if "`cluster'"!="" {
		local bystatement "by `cluster': "
	}
	else {
		local bystatement ""
	}
	if "`joint'"=="joint" {
		tempvar newsortorder
		gen `newsortorder'=uniform()
		sort `cluster' `newsortorder'
		foreach var in `varlist' {
			capture drop `var'_shuffled
			quietly {
				`bystatement' gen `var'_shuffled=`var'[_n-1]
				`bystatement' replace `var'_shuffled=`var'[_N] if _n==1
			}
			if "`dropold'"=="dropold" {
				drop `var'
			}
		}
		sort `oldsortorder'
		drop `newsortorder' `oldsortorder'
	}
	else {
		foreach var in `varlist' {
			tempvar newsortorder
			gen `newsortorder'=uniform()
			sort `cluster' `newsortorder'
			capture drop `var'_shuffled
			quietly {
				`bystatement' gen `var'_shuffled=`var'[_n-1]
				`bystatement' replace `var'_shuffled=`var'[_N] if _n==1
			}
			drop `newsortorder'
			if "`dropold'"=="dropold" {
				drop `var'
			}
		}
		sort `oldsortorder'
		drop `oldsortorder'
	}
end

January 24, 2011 at 7:30 pm 6 comments

Keep the best 5 (updated)

| Gabriel |

Last year I mentioned my policy of assigning about seven quizzes and then keeping the best 5. I then had a real Rube Goldberg-esque workflow that involved piping to Perl. Several people came up with simpler ideas in the comments, but the most “why didn’t I think of that” was definitely John-Paul Ferguson’s suggestions to just use reshape. Now that I’m teaching the class again, I’ve rewritten the script to work on that logic.

Also, I’ve made the script a bit more flexible by allowing it to specify in the header how many quizzes were offered and how many to keep. To make this work I made a loop that builds a local called sumstring.

[UPDATE 11/29/2010, applied Nick Cox’s suggestions. Old code remains but is commented out]

local numberofquizzes 6
local keepbest 5

*import grades, which look like this
*uid    name    mt  q1  q2  q3
*5001   Joe     40  5   4   6
*4228   Alex    20  6   3   5
insheet using grades.txt, clear
*rescale the quizzes from raw points to proportion 
forvalues qnum=1/`numberofquizzes' {
	quietly sum q`qnum'
	replace q`qnum'=q`qnum'/`r(max)'
}
/*
*build the sumstring local (original code)
local sumstring ""
forvalues i=1/`keepbest' {
	local sumstring "`sumstring' + q`i'"
	disp "`sumstring'"
	local sumstring=subinstr("`sumstring'","+","",1)
	disp "`sumstring'"
}
*/
*reshape long, keep top few quizzes
reshape long q, i( notes uid name mt) j(qnum)
recode q .=0
gsort uid -q
by uid: drop if _n>`keepbest'
by uid: replace qnum=_n
*reshape wide, calc average
reshape wide q, i(notes uid name mt) j(qnum)
*build the sumstring local (w/ Nick Cox's suggestions)
unab sumstring : q* 
disp "`sumstring'"
local sumstring : subinstr local sumstring " " "+", all
disp "`sumstring'"
gen q_avg=(`sumstring')/`keepbest'
sort name
sum q_avg

*have a nice day

November 24, 2010 at 4:24 am 4 comments

Stata Programming Lecture [updated]

| Gabriel |

I gave my introduction to Stata programming lecture again. This time the lecture was cross-listed between my graduate statistics course and the UCLA ATS faculty seminar series. Here are the lecture notes.

October 14, 2010 at 3:50 pm 3 comments

Thanks for the Aspirin Guys

| Gabriel |

In a recent post, I lamented that I couldn’t figure out how to do loops and a few other things in R. The script at issue was intended to create a slideshow of network graphs which I’d then use Image Magick to convert to an animated gif. With some help from Brian and Kieran, I ultimately got it to work and the results are in this post (as well as a presentation at a mini-conference I made a few days later).

Kieran very generously further improved on the code at his own blog using simulated data. Kieran’s code has a certain elegance to it, but it’s an R style of elegance based on objects and lists and the like so it’s not completely intuitive to me, despite his thorough commenting. Anyway, I worked his suggestions into my code (see below) and it works great.

Since Kieran was “interested to see whether this approach was at all faster,” I ran the old version and the new version with timestamps. (To make it fair, I commented out the “pdf” version of the flipbook, which only appears in the “old” code — this saves about one second). Turns out they’re almost exactly the same speed — about 26 or 27 seconds (it varies about a second or two every time I do it, sometimes new is faster, sometimes old). The vast majority of that half a minute is taken up by Image Magick. The actual R code only takes about 7 seconds to execute in either version. I think that’s amazingly fast to import 60 or so datasets (albeit small datasets), generate two different FR layouts, apply the layouts to 60 or so graphs (albeit graphs that are written directly to disk rather than to screen), and do miscellaneous housekeeping. I haven’t run a timestamped comparison, but my impression is that this is much faster than comparable operations in Pajek or NWB and appreciably faster than doing scatterplots in Stata.

# File-Name:       chrnetwork.R                 
# Date:            2010-03-11
# Created Date:    2009-11-24                               
# Author:          Gabriel Rossman (w a fair amount of help from BR and KJH)           
# Purpose:         graph CHR station network
# Data Used:       ties_bounded.net
# Packages Used:   igraph plyr   
timestamp()
library(igraph)
library(plyr)
setwd("~/Documents/Sjt/radio/survey")
#ties -- including ties to non-top40 (who can't reciprocate)
chrnet <- read.graph("ties.net", c("pajek"))
pdf("~/Documents/book/images/chrnetwork.pdf")
 plot.igraph(chrnet, layout=layout.fruchterman.reingold, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
dev.off()
#ties bounded to only top 40, includes adoption time color-codes, but use is optional
chrnetbounded <- read.graph("ties_bounded_humpcolor.net", c("pajek"))
la = layout.fruchterman.reingold(chrnetbounded)  #create layout for use on several related graphs
#graph structure only
pdf("~/Documents/book/images/chrnetworkbounded.pdf")
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
dev.off()
#graph color coded diffusion
pdf("~/Documents/book/images/chrnetworkboundedcolor.pdf")
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray80", edge.arrow.size=0.3, margin=0)
dev.off()
#flipbook
timestamp()
setwd("~/Documents/Sjt/radio/survey/flipbook")
filenames <- list.files(getwd(), pattern="\\.net$")
ind <- order(as.numeric(gsub("[^[:digit:]]", "", filenames)))
filenames <- filenames[ind]
g.list <- llply(filenames, read.graph, format="pajek")
png(file="~/Documents/book/images/flipbook/chrnet_hc%d.png", width=600, height=600, units="px") 
l_ply(g.list, plot, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0)
dev.off()
timestamp()
### Finally, generate the gif without having to renumber the files
### individually.
png.filenames <- list.files("~/Documents/book/images/flipbook/", pattern="\\.png$")
timestamp()
## Resort properly again
ind <- order(as.numeric(gsub("[^[:digit:]]", "", png.filenames)))
png.filenames <- png.filenames[ind]
png.filenames <- paste("~/Documents/book/images/flipbook/", png.filenames, sep="") # add the path back
png.string <- capture.output(cat(png.filenames))
## Open a pipe to the shell to execute the convert command
## directly.
timestamp()
gifpipe <- pipe(paste("convert", png.string, "~/Documents/book/images/flipbook/chrnet_humps.gif", sep=" "), "w")
close(gifpipe)
timestamp()

March 16, 2010 at 4:31 am 2 comments

Older Posts


The Culture Geeks