Posts tagged ‘R’

Another R bleg about loops and graph devices

| Gabriel |

I’m having another problem with R and was hoping somebody could help me out in the comments. Long story short, I can make graphs properly if I do them one at a time, but when I try to loop it I get device errors. In particular it creates the PDFs but they are either empty or corrupted.

Here is the log, which first fails to do it with a loop, then does it right for one case where I manually assign the variables rather than looping.

> # File-Name:       surfacegraphs.R                 
> # Date:            2011-04-25
> # Author:          Gabriel Rossman                                       
> # Purpose:         graph from Stata
> # Packages Used:   lattice   
> # note, wireframe code from lisa 
> timestamp()
##------ Tue Apr 26 10:44:09 2011 ------##
> library(lattice)
> 
> histopath <- '~/Documents/project/histograms'
> image2 <- '~/Documents/project/images/histograms'
> 
> timestamp()
##------ Tue Apr 26 10:44:10 2011 ------##
> 
> #create surface histograms, showing how population evolves over time
> #  parameters held constant
> setwd(histopath)
> for(d in 0:10) {
+ 	for(p in 0:5) {
+ 		d10 <- d*10
+ 		p100 <- p*100
+ 		datafile <- paste(histopath,'/d',d10,'p',p100,'.txt', sep="")
+ 		dataobject <- read.table(file=datafile,header=TRUE)
+ 		pdfcolor <- paste(image2,'/hist_color_d',d10,'p',p100,'.pdf', sep="")
+ 		pdfgrey <- paste(image2,'/hist_grey_d',d10,'p',p100,'.pdf', sep="")
+ 		pdf(pdfcolor)
+ 		wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE) 
+ 		dev.off()
+ 		
+ 		pdf(pdfgrey)
+ 		wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE, par.settings=standard.theme(color=FALSE))
+ 		dev.off()
+ 	}
+ }
There were 50 or more warnings (use warnings() to see the first 50)
> timestamp()
##------ Tue Apr 26 10:44:12 2011 ------##
> 
> #loop doesn't work
> #  seems to be the dev.off()
> #try a few manually
> d10 <- 0
> p100 <- 0
> datafile <- paste(histopath,'/d',d10,'p',p100,'.txt', sep="")
> dataobject <- read.table(file=datafile,header=TRUE)
> pdfcolor <- paste(image2,'/hist_color_d',d10,'p',p100,'.pdf', sep="")
> pdfgrey <- paste(image2,'/hist_grey_d',d10,'p',p100,'.pdf', sep="")
> pdf(pdfcolor)
> wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE) 
> dev.off()
null device 
          1 
> 
> 
> timestamp()
##------ Tue Apr 26 10:44:14 2011 ------##
> 
> 

The warnings start like this and go on from there:

> warnings()
Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
5: In min(x) : no non-missing arguments to min; returning Inf

Any ideas?
Do I just need to hard-code the graphs I really want rather than batching them?

[Update]
As Michal suggested, I needed to wrap wireframe in print. Here’s an example of the output (for a baseline simulation).
hist_color_d0p0

April 27, 2011 at 4:51 am 7 comments

Growl in R and Stata

| Gabriel |

Growl is a system notification tool for Mac that lets applications, the system itself, or hardware display brief notification, usually in the top-right corner. The translucent floating look reminds me of the more recent versions of KDE.

Anyway, one of the things it’s good for is letting programs run in the background and let you know when something noteworthy has happened. Of course, a large statistics batch would qualify. I was running a 10 minute job in the R package igraph and got tired of checking to see when it was done so I found this tip. In a nutshell, it says to download Growl, including the command-line tool GrowlNotify from the “Extras” folder, then create this R function.

growl <- function(m = 'Hello world')
system(paste('growlnotify -a R -m \'',m,'\' -t \'R is calling\'; echo \'\a\' ', sep=''))

Since the R function “system()” is equivalent to the Stata command “shell,” I realized this would work in Stata as well and so I wrote this ado file.* You call it just by typing “growl”. It takes as an (optional) argument whatever you’d like to see displayed in Growl, such as “Done with first analysis” or “All finished” but by default displays “Stata needs attention.” Note that while Stata already bounces in the dock when it completes a script or hits an error, you can also have Growl appear at various points during the run of a script.

I’ve only tested it with StataMP, but I’d appreciate it if people who use Growl and other versions of Stata would post their results in the comments. If it proves robust I’ll submit it to SSC.

Here’s the Stata code. The most important line is #32 and if you were doing it by hand you’d do it as a one-liner but the rest of this stuff allows argument passing and compatibility with the different versions (small, IC, SE, MP) of Stata:

*1.0 GHR Jan 19, 2011
capture program drop growl
program define growl
	version 10
	set more off
	syntax [anything]

	if "`anything'"=="" {
		local message "Stata needs attention"		
	}
	else {
		local message "`anything'"
	}
	
	local appversion "Stata"
	if "`c(flavor)'"=="Small" {
		local appversion "smStata"
	}
	else {
		if `c(SE)'==1 {
			if `c(MP)'==1 {
				local appversion "StataMP"
			}
			else {
				local appversion "StataSE"
			}
		}
		else {
			local appversion "Stata"
		}
	}
	shell growlnotify -a `appversion' -m \ "`message'" \ `appversion' \
end

*Most programming/scripting languages and a few GUI applications have a similar system call and you can likewise get Growl to work with them. For instance, Perl also has a function called system(). In shell scripting of course you can just use the “growlnotify” command directly.

January 20, 2011 at 5:10 am 10 comments

Some ways Stata is an unusual language

| Gabriel |

As I’ve tried to learn other languages, I’ve realized that part of the difficulty isn’t that they’re hard (although in some cases they are) but that I’m used to Stata’s very distinctive paradigm and nomenclature. Some aspects of Stata are pretty standard (e.g., “while”/”foreach”/”forvalues” loops, log files, and the “file” syntax for using text files on disk), but other bits are pretty strange. Or rather, they’re strange from a computer science perspective but intuitive from a social science perspective.

Stata seems to have been designed to make sense to social scientists and if this makes it confusing to programmers, then so be it. A simple example of this is that Stata uses the word “variable” in the sense meant by social scientists. More broadly, Stata is pretty bold about defaults so as to make things easy for beginners. It presumes that anything you’re doing applies to the dataset (aka the master data)  — which is always a flat-file database. Other things that might be held in memory have a secondary status and beginning users don’t even know that they’re there. Likewise, commands distinguish between the important arguments (usually variables) and the secondary arguments, which Stata calls “options”. There’s also the very sensible assumptions about what to report and what to put in ephemeral data objects that can be accessed immediately after the primary command (but need not be stored as part of the original command, as they would in most other languages).

Note, I’m not complaining about any of this. Very few of Stata’s quirks are pointlessly arbitrary. (The only arbitrary deviation I can think of is using “*” instead of “#” for commenting). Most of Stata’s quirks are necessary in order to make it so user-friendly to social scientists. In a lot of ways R is a more conventional language than Stata, but most social scientists find Stata much easier to learn. In part because Stata is willing to deviate from the conventions of general purpose programming languages, running and interpreting a regression in Stata looks like this “reg y x” instead of this “summary(lm(y~x))” and loading a dataset looks like this “use mydata, clear” instead of this “data <- read.table(mydata.txt)”. Stata has some pretty complicated syntax (e.g., the entire Mata language) but you can get a lot done with just a handful of simple commands like “use,” “gen,” and “reg”.

Nonetheless all this means that when Stata native speakers like me learn a second programming language it can be a bit confusing. And FWIW, I worry that rumored improvements to Stata (such as allowing relational data in memory) will detract from its user-friendliness. Anyway, the point is that I love Stata and I think it’s entirely appropriate for social scientists to learn it first. I do most of my work in Stata and I teach/mentor my graduate students in Stata unless there’s a specific reason for them to learn something else. At the same time I know that many social scientists would benefit a lot from also learning other languages. For instance, people into social networks should learn R, people who want to do content analysis should learn Perl or Python, and people who want to do simulations should learn NetLogo or Java. The thing is that when you do, you’re in for a culture shock and so I’m making explicit some ways in which Stata is weird.

Do-files and Ado-files. In any other language a do-file would be called a script and an ado-file would be called a library. Also note that Stata very conveniently reads all your ado-files automatically, whereas most other languages require you to specifically load the relevant libraries into memory at the beginning of each script.

Commands, Programs, and Functions. In Stata a program is basically just a command that you wrote yourself. Stata is somewhat unusual in drawing a distinction between a command/program and a function. So in Stata a function usually means some kind of transformation that attaches its output to a variable or macro, as in “gen ln_income=log(income)”. In contrast a command/program is pretty much anything that doesn’t directly attach to an operator and includes all file operations (e.g., “use”) and estimations (e.g, “regress”). Other languages don’t really draw this distinction but consider everything a function, no matter what it does and whether the user wrote it or not. (Some languages use “primitive” to mean something like the Stata command vs. program distinction, but it’s not terribly important).

Because most languages only have functions this means that pretty much everything has to be assigned to an object via an operator. Hence Stata users would usually type “reg y x” whereas R users would usually type “myregression <- lm(y~x)”. This is because “regress” in Stata is a command whereas “lm()” in R is a function. Also note that Stata distinguishes between commands and everything else by word order syntax with the command being the first word. In contrast functions in other languages (just like Stata functions) have the function being the thing outside the parentheses and inside the parentheses goes all of the arguments, both data objects and options.

The Dataset. Stata is one of the only languages where it’s appropriate to use the definite article in reference to data. (NetLogo is arguably another case of this). In other languages it’s more appropriate to speak of “a data object” than “the dataset,” even if there only happens to be one data object in memory. For the same reason, most languages don’t “use” or “open” data, but “read” the data and assign it to an object. Another way to think about it is that only Stata has a “dataset” whereas other languages only have “matrices.” Of course, Stata/Mata also has matrices but most Stata end users don’t bother with them as they tend to be kind of a backend thing that’s usually handled by ado-files. Furthermore, in other languages (e.g., Perl) it’s common to not even load a file into memory but to process it line-by-line, which in Stata terms is kind of like a cross between the “file read/write” syntax and a “while” loop.

Variables. Stata uses the term “variable” in the statistical or social scientific meaning of the term. In other languages this would usually be called a field or vector.

Macros. What most other languages call variables, Stata calls local and global “macros.” Stata’s usage of the local vs global distinction is standard. In other languages the concept of “declaring” a variable is usually a little more explicit than it is in Stata.

Stata is extremely good about expanding macros in situ and this can spoil us Stata users. In other languages you often have to do some kind of crude work around by first using some kind of concatenate function to create a string object containing the expansion and then you use that string object. For instance, if you wanted to access a series of numbered files in Stata you could just loop over this:

use ~/project/file`i', clear 

In other languages you’d have to add a separate line for the expansion. So in R you’d loop over:

filename <- paste('~/project/file',i, sep="")
data <- read.table(filename)

[Update: Also see this Statalist post by Nick Cox on the distinction between variables and macros]

Reporting. Stata allows you to pass estimations on for further work (that’s what return macros, ereturn matrices, and postestimation commands are all about), but it assumes you probably won’t and so it is unusually generous in reporting most of the really interesting things after a command. In other languages you usually have to specifically ask to get this level of reporting. Another way to put it is that in Stata verbosity is assumed by default and can be suppressed with “quietly,” whereas in R silence is assumed by default and verbosity can be invoked by wrapping the estimation (or an object saving the estimation) in the “summary()” function.

August 6, 2010 at 4:36 am 12 comments

Using R to parse (a lot of) HTML tables

| Gabriel |

For a few months I’ve been doing a daily scrape of a website but I’ve put off actually parsing the data until a colleague was dealing with a similar problem, and solving his problem reminded me of my problem. The scrape creates a folder named after the date with several dozen html files in it. So basically, the data is stored like this:

project/raw/
  20100601/
    page1.htm
    page2.htm
  20100601/
    page1.htm
    page2.htm

Each html file has one main table along with a couple of sidebar tables. For each html file, I want to extract the main table and write it to a text file. These text files will be put in a “clean” directory that mirrors the “raw” directory.

This is the kind of thing most people would do in Perl (or Python). I had trouble getting the Perl HTML libraries to load although I probably could have coded it from scratch since HTML table structure is pretty simple (push the contents of <td> tags to an array, then write it out and clear the memory when you hit a </tr> tag). In any case, I ended up using R’s XML library, which is funny because usually I clean data in Perl or Stata and use R only as a last resort. Nonetheless, in what is undoubtedly a sign of the end times, here I am using R for cleaning. Forty years of darkness; The dead rising from the grave; Cats and dogs living together; Mass hysteria!

Anyway, the first step is to get a list of the directories in “raw” and use that to seed the top level loop. (Though note that R’s XML library can also read data directly off the web). Within this loop I create a clean subdirectory to mirror the raw subdirectory. I then get a list of every file in the raw subdirectory and seed the lower level loop. The lower level loop reads each file with “readHTMLTable” and writes it out to the mirroring clean subdirectory. Then I come out of both loops and don’t really care if the top is still spinning.

# File-Name:       websiteclean.R
# Date:            2010-07-28
# Author:          Gabriel Rossman
# Purpose:         parse the scraped files
# Packages Used:   xml

timestamp()
library(XML)
parentpath<-"~/Documents/project"
rawdir<-paste(parentpath,"/raw",sep="")
setwd(rawdir)
dirlist <- list.files()
for (dir in dirlist) {
	setwd(rawdir)
	setwd(dir)
	filenames <- list.files()
	cleandir<-paste(parentpath,'/clean/',dir, sep="") #create ../../clean/`dir' and call `cleandir'
	shellcommand<-paste("mkdir ",cleandir, sep="")
	system(shellcommand)
	print(cleandir) #progress report
	for (targetfile in filenames) {
		setwd(rawdir)
		setwd(dir)
		datafromtarget = readHTMLTable(targetfile, header=FALSE)
		outputfile<-paste(targetfile,'.txt', sep="")
		setwd(cleandir)
		write.table(datafromtarget[1], file = outputfile , sep = "\t", quote=TRUE)  #when writing out, limit to subobject 1 to avoid the sidebar tables
	}
}

# have a nice day

July 29, 2010 at 4:58 am 2 comments

importspss.ado (requires R)

| Gabriel |

Mike Gruszczynski has a post up pointing out that you can use R to translate files, for instance from SPSS to Stata. I like this a lot because it let’s you avoid using SPSS but I’d like it even better if it let you avoid using R as well.

As such I rewrote the script to work entirely from Stata. Mike wanted to do this in Bash but couldn’t figure out how to pass arguments from the shell to R. Frankly, I don’t know how to do this either which is why my solution is to have Stata write and execute an R source file so all the argument passing occurs within Stata. This follows my general philosophy of doing a lot of code mise en place in a user-friendly language so I can spend as little time as necessary in R. (Note that you could just as easily write this in Bash, but I figured this way you can a) make it cross-platform and b) attach it to “use” for a one-stop shop “import” command).

*importspss.ado
*by GHR 6/29/2010
*this script uses R to translate SPSS to Stata
*it takes as arguments the SPSS file and Stata file
*adapted from http://mikegruz.tumblr.com/post/704966440/convert-spss-to-stata-without-stat-transfer 

*DEPENDENCY: R and library(foreign) 
*if R exists but is not in PATH, change the reference to "R" in line 27 to be the specific location

capture program drop importspss
program define importspss
	set more off
	local spssfile `1'
	if "`2'"=="" {
		local statafile "`spssfile'.dta"
	}
	else {
		local statafile `2'	
	}
	local sourcefile=round(runiform()*1000)
	capture file close rsource
	file open rsource using `sourcefile'.R, write text replace
	file write rsource "library(foreign)" _n
	file write rsource `"data <- read.spss("`spssfile'", to.data.frame=TRUE)"' _n
	file write rsource `"write.dta(data, file="`statafile'")"' _n
	file close rsource
	shell R --vanilla <`sourcefile'.R
	erase `sourcefile'.R
	use `statafile', clear
end

June 29, 2010 at 3:01 pm 6 comments

Thanks for the Aspirin Guys

| Gabriel |

In a recent post, I lamented that I couldn’t figure out how to do loops and a few other things in R. The script at issue was intended to create a slideshow of network graphs which I’d then use Image Magick to convert to an animated gif. With some help from Brian and Kieran, I ultimately got it to work and the results are in this post (as well as a presentation at a mini-conference I made a few days later).

Kieran very generously further improved on the code at his own blog using simulated data. Kieran’s code has a certain elegance to it, but it’s an R style of elegance based on objects and lists and the like so it’s not completely intuitive to me, despite his thorough commenting. Anyway, I worked his suggestions into my code (see below) and it works great.

Since Kieran was “interested to see whether this approach was at all faster,” I ran the old version and the new version with timestamps. (To make it fair, I commented out the “pdf” version of the flipbook, which only appears in the “old” code — this saves about one second). Turns out they’re almost exactly the same speed — about 26 or 27 seconds (it varies about a second or two every time I do it, sometimes new is faster, sometimes old). The vast majority of that half a minute is taken up by Image Magick. The actual R code only takes about 7 seconds to execute in either version. I think that’s amazingly fast to import 60 or so datasets (albeit small datasets), generate two different FR layouts, apply the layouts to 60 or so graphs (albeit graphs that are written directly to disk rather than to screen), and do miscellaneous housekeeping. I haven’t run a timestamped comparison, but my impression is that this is much faster than comparable operations in Pajek or NWB and appreciably faster than doing scatterplots in Stata.

# File-Name:       chrnetwork.R                 
# Date:            2010-03-11
# Created Date:    2009-11-24                               
# Author:          Gabriel Rossman (w a fair amount of help from BR and KJH)           
# Purpose:         graph CHR station network
# Data Used:       ties_bounded.net
# Packages Used:   igraph plyr   
timestamp()
library(igraph)
library(plyr)
setwd("~/Documents/Sjt/radio/survey")
#ties -- including ties to non-top40 (who can't reciprocate)
chrnet <- read.graph("ties.net", c("pajek"))
pdf("~/Documents/book/images/chrnetwork.pdf")
 plot.igraph(chrnet, layout=layout.fruchterman.reingold, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
dev.off()
#ties bounded to only top 40, includes adoption time color-codes, but use is optional
chrnetbounded <- read.graph("ties_bounded_humpcolor.net", c("pajek"))
la = layout.fruchterman.reingold(chrnetbounded)  #create layout for use on several related graphs
#graph structure only
pdf("~/Documents/book/images/chrnetworkbounded.pdf")
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
dev.off()
#graph color coded diffusion
pdf("~/Documents/book/images/chrnetworkboundedcolor.pdf")
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray80", edge.arrow.size=0.3, margin=0)
dev.off()
#flipbook
timestamp()
setwd("~/Documents/Sjt/radio/survey/flipbook")
filenames <- list.files(getwd(), pattern="\\.net$")
ind <- order(as.numeric(gsub("[^[:digit:]]", "", filenames)))
filenames <- filenames[ind]
g.list <- llply(filenames, read.graph, format="pajek")
png(file="~/Documents/book/images/flipbook/chrnet_hc%d.png", width=600, height=600, units="px") 
l_ply(g.list, plot, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0)
dev.off()
timestamp()
### Finally, generate the gif without having to renumber the files
### individually.
png.filenames <- list.files("~/Documents/book/images/flipbook/", pattern="\\.png$")
timestamp()
## Resort properly again
ind <- order(as.numeric(gsub("[^[:digit:]]", "", png.filenames)))
png.filenames <- png.filenames[ind]
png.filenames <- paste("~/Documents/book/images/flipbook/", png.filenames, sep="") # add the path back
png.string <- capture.output(cat(png.filenames))
## Open a pipe to the shell to execute the convert command
## directly.
timestamp()
gifpipe <- pipe(paste("convert", png.string, "~/Documents/book/images/flipbook/chrnet_humps.gif", sep=" "), "w")
close(gifpipe)
timestamp()

March 16, 2010 at 4:31 am 2 comments

Network slideshow

| Gabriel |

Now that I’ve gotten R and igraph to make a set of 53 png files (see yesterday’s post), the next step is animating them. I did this using the command line tool ImageMagick, which I installed using Fink, the (buggy) Mac version of the Debian package manager. Once ImageMagick is installed, I can do everything from directly within R using system(). To accomplish this, I just added these lines of code to the end of yesterday’s script. The “mv” commands are necessary because ImageMagick has a naive view of alphabetical order.

#create animated gif in image magick
setwd("~/Documents/book/images/")
system("mv chrnet_hc0.png chrnet_hc00.png")
system("mv chrnet_hc1.png chrnet_hc01.png")
system("mv chrnet_hc2.png chrnet_hc02.png")
system("mv chrnet_hc3.png chrnet_hc03.png")
system("mv chrnet_hc4.png chrnet_hc04.png")
system("mv chrnet_hc5.png chrnet_hc05.png")
system("mv chrnet_hc6.png chrnet_hc06.png")
system("mv chrnet_hc7.png chrnet_hc07.png")
system("mv chrnet_hc8.png chrnet_hc08.png")
system("mv chrnet_hc9.png chrnet_hc09.png")
system("convert *.png chrnet_humps.gif")

Here are the results. Vertices are stations, which turn black when the station has begun playing “My Humps” by Black Eyed Peas. Yellow vertices have missing data on airplay (true missing data, not just right-censored). The graph layout is based on directed nominations from a survey so vertices near each other have low path length, but I hid the actual edges to preserve some privacy about the stations social network ties. My substantive interpretation of this animation (and a comparable event history) is that the network doesn’t really matter and the endogenous cascade is based on attention to aggregate peer behavior rather than that of specific alters.

Note that you may have to click on the image to see the animation.

March 1, 2010 at 1:35 pm 1 comment

Older Posts Newer Posts


The Culture Geeks

Recent Posts


Follow

Get every new post delivered to your Inbox.

Join 2,099 other followers