Posts tagged ‘R’

importspss.ado (requires R)

| Gabriel |

Mike Gruszczynski has a post up pointing out that you can use R to translate files, for instance from SPSS to Stata. I like this a lot because it let’s you avoid using SPSS but I’d like it even better if it let you avoid using R as well.

As such I rewrote the script to work entirely from Stata. Mike wanted to do this in Bash but couldn’t figure out how to pass arguments from the shell to R. Frankly, I don’t know how to do this either which is why my solution is to have Stata write and execute an R source file so all the argument passing occurs within Stata. This follows my general philosophy of doing a lot of code mise en place in a user-friendly language so I can spend as little time as necessary in R. (Note that you could just as easily write this in Bash, but I figured this way you can a) make it cross-platform and b) attach it to “use” for a one-stop shop “import” command).

*by GHR 6/29/2010
*this script uses R to translate SPSS to Stata
*it takes as arguments the SPSS file and Stata file
*adapted from 

*DEPENDENCY: R and library(foreign) 
*if R exists but is not in PATH, change the reference to "R" in line 27 to be the specific location

capture program drop importspss
program define importspss
	set more off
	local spssfile `1'
	if "`2'"=="" {
		local statafile "`spssfile'.dta"
	else {
		local statafile `2'	
	local sourcefile=round(runiform()*1000)
	capture file close rsource
	file open rsource using `sourcefile'.R, write text replace
	file write rsource "library(foreign)" _n
	file write rsource `"data <- read.spss("`spssfile'","' _n
	file write rsource `"write.dta(data, file="`statafile'")"' _n
	file close rsource
	shell R --vanilla <`sourcefile'.R
	erase `sourcefile'.R
	use `statafile', clear

June 29, 2010 at 3:01 pm 6 comments

Thanks for the Aspirin Guys

| Gabriel |

In a recent post, I lamented that I couldn’t figure out how to do loops and a few other things in R. The script at issue was intended to create a slideshow of network graphs which I’d then use Image Magick to convert to an animated gif. With some help from Brian and Kieran, I ultimately got it to work and the results are in this post (as well as a presentation at a mini-conference I made a few days later).

Kieran very generously further improved on the code at his own blog using simulated data. Kieran’s code has a certain elegance to it, but it’s an R style of elegance based on objects and lists and the like so it’s not completely intuitive to me, despite his thorough commenting. Anyway, I worked his suggestions into my code (see below) and it works great.

Since Kieran was “interested to see whether this approach was at all faster,” I ran the old version and the new version with timestamps. (To make it fair, I commented out the “pdf” version of the flipbook, which only appears in the “old” code — this saves about one second). Turns out they’re almost exactly the same speed — about 26 or 27 seconds (it varies about a second or two every time I do it, sometimes new is faster, sometimes old). The vast majority of that half a minute is taken up by Image Magick. The actual R code only takes about 7 seconds to execute in either version. I think that’s amazingly fast to import 60 or so datasets (albeit small datasets), generate two different FR layouts, apply the layouts to 60 or so graphs (albeit graphs that are written directly to disk rather than to screen), and do miscellaneous housekeeping. I haven’t run a timestamped comparison, but my impression is that this is much faster than comparable operations in Pajek or NWB and appreciably faster than doing scatterplots in Stata.

# File-Name:       chrnetwork.R                 
# Date:            2010-03-11
# Created Date:    2009-11-24                               
# Author:          Gabriel Rossman (w a fair amount of help from BR and KJH)           
# Purpose:         graph CHR station network
# Data Used:
# Packages Used:   igraph plyr   
#ties -- including ties to non-top40 (who can't reciprocate)
chrnet <- read.graph("", c("pajek"))
 plot.igraph(chrnet, layout=layout.fruchterman.reingold, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
#ties bounded to only top 40, includes adoption time color-codes, but use is optional
chrnetbounded <- read.graph("", c("pajek"))
la = layout.fruchterman.reingold(chrnetbounded)  #create layout for use on several related graphs
#graph structure only
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
#graph color coded diffusion
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray80", edge.arrow.size=0.3, margin=0)
filenames <- list.files(getwd(), pattern="\\.net$")
ind <- order(as.numeric(gsub("[^[:digit:]]", "", filenames)))
filenames <- filenames[ind]
g.list <- llply(filenames, read.graph, format="pajek")
png(file="~/Documents/book/images/flipbook/chrnet_hc%d.png", width=600, height=600, units="px") 
l_ply(g.list, plot, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0)
### Finally, generate the gif without having to renumber the files
### individually.
png.filenames <- list.files("~/Documents/book/images/flipbook/", pattern="\\.png$")
## Resort properly again
ind <- order(as.numeric(gsub("[^[:digit:]]", "", png.filenames)))
png.filenames <- png.filenames[ind]
png.filenames <- paste("~/Documents/book/images/flipbook/", png.filenames, sep="") # add the path back
png.string <- capture.output(cat(png.filenames))
## Open a pipe to the shell to execute the convert command
## directly.
gifpipe <- pipe(paste("convert", png.string, "~/Documents/book/images/flipbook/chrnet_humps.gif", sep=" "), "w")

March 16, 2010 at 4:31 am 2 comments

Network slideshow

| Gabriel |

Now that I’ve gotten R and igraph to make a set of 53 png files (see yesterday’s post), the next step is animating them. I did this using the command line tool ImageMagick, which I installed using Fink, the (buggy) Mac version of the Debian package manager. Once ImageMagick is installed, I can do everything from directly within R using system(). To accomplish this, I just added these lines of code to the end of yesterday’s script. The “mv” commands are necessary because ImageMagick has a naive view of alphabetical order.

#create animated gif in image magick
system("mv chrnet_hc0.png chrnet_hc00.png")
system("mv chrnet_hc1.png chrnet_hc01.png")
system("mv chrnet_hc2.png chrnet_hc02.png")
system("mv chrnet_hc3.png chrnet_hc03.png")
system("mv chrnet_hc4.png chrnet_hc04.png")
system("mv chrnet_hc5.png chrnet_hc05.png")
system("mv chrnet_hc6.png chrnet_hc06.png")
system("mv chrnet_hc7.png chrnet_hc07.png")
system("mv chrnet_hc8.png chrnet_hc08.png")
system("mv chrnet_hc9.png chrnet_hc09.png")
system("convert *.png chrnet_humps.gif")

Here are the results. Vertices are stations, which turn black when the station has begun playing “My Humps” by Black Eyed Peas. Yellow vertices have missing data on airplay (true missing data, not just right-censored). The graph layout is based on directed nominations from a survey so vertices near each other have low path length, but I hid the actual edges to preserve some privacy about the stations social network ties. My substantive interpretation of this animation (and a comparable event history) is that the network doesn’t really matter and the endogenous cascade is based on attention to aggregate peer behavior rather than that of specific alters.

Note that you may have to click on the image to see the animation.

March 1, 2010 at 1:35 pm 1 comment

More R headaches

| Gabriel |

I’ve continued to play with the R package igraph and I think I’ve gotten the hang of igraph itself but R itself is still pretty opaque to me and that limits my ability to use igraph. Specifically, I’m trying to merge diffusion data onto a network graph and plot it as kind of a slideshow, where each successive image is a new period. I can actually do this, but I’m having trouble looping it and so my code is very repetitive. [Update, thanks to Brian Rubineau I've gotten it to work]

First the two tricks that I have solved.

1. Merging the diffusion data (a vertice-level trait) onto the network.

The preferred way to do this is to read in the network file to R then merge in the vertice-level trait (or more technically, read the vertice data and associate it with the network data). This has proven really difficult to me so instead I wrote an alternate version of stata2pajek that let’s me do this within Stata. The upside is that I spend more time in Stata and less time in R, the downside to this is that I need a separate “.net” file for every version of the graph.

2. Getting the different versions of the graph to appear comparable.

It does no good to have multiple versions of a graph if they don’t look at all similar, which is what you get if you generate the layout on the fly as part of each plot. The solution to this is to first generate a layout (the object “la”) then apply it to each graph by defining the “layout” parameter to “=la” within the “plot.igraph()” function.

So instead of this:

plot.igraph(chrnetbounded, layout=layout.fruchterman.reingold, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)

Do this:

la = layout.fruchterman.reingold(chrnetbounded)
plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)

Now the problem I can’t solve, at least with anything approaching elegance.

3. Looping it.

Currently my code is really long and repetitive. I want one graph per week for a whole year. To make 52 versions of the graph, I need about two hundred lines of code, whereas I should be able to just loop my core four lines of code 52 times — pdf(); read.graph();plot.igraph(); This works, but is ridiculous.

The thing is that I can’t figure out how to make a loop in R where the looping local feeds to be part of a filename. This kind of thing is trivially easy in Stata since Stata expands locals and then interprets them. For example, here’s how I would do this in Stata. (I’m using “twoway scatter” as a placeholder for “plot.igraph()”, which of course doesn’t exist in Stata).

forvalues i=0/52 {
	use ties_bounded`i', clear
	twoway scatter x y
	graph export chrnet_hc`i'.png, replace

R not only doesn’t let you do this directly, but I can’t even figure out how to do it by adding an extra step where the looped local feeds into a new object to write the filename. (The problem is that the “paste()” function adds whitespace). So basically I’m at an impasse and I see three ways to do it:

  • Wait for somebody to tell me in the comments how to do this kind of loop in R. (hint, hint).
  • Give up on writing this kind of loop and just resign myself to writing really repetitive R code.
  • Give up on using igraph from within R and learn to use it from within Python. I have basically zero experience using Python but it has a good reputation for usability. In fact, it only took me, a complete Python-noob, about ten minutes to figure out what’s been a real stumper in R. I haven’t yet worked igraph into this loop but I’m thinking it can’t be that hard.
for i in range(0,52):
     datafile='ties_bounded%d' % i
     # igraph code here that treats the python object "datafile" as a filename

So thanks to Brian Rubineau’s suggestion on how to better use the “paste()” function, which is functionally equivalent to the second line of the Python code above. The catch is that you have to add a “, sep=”” ” parameter to suppress the whitespace that had been annoying me. I thought I tried this already, but apparently not. Anyway, the Python / R method of first defining a new object then calling it is an extra step compared to Stata loops (where the looping local can expand directly) but it’s still reasonably easy. Here’s my complete R code, which I’m now very happy with.

# File-Name:       chrnetwork.R                 
# Date:            2010-02-26
# Created Date:    2009-11-24                               
# Author:          Gabriel Rossman                                       
# Purpose:         graph CHR station network
# Data Used:
# Packages Used:   igraph    
#ties bounded to only top 40, includes adoption time color-codes, but use is optional
chrnetbounded <- read.graph("", c("pajek"))
la = layout.fruchterman.reingold(chrnetbounded)  #create layout for use on several related graphs
#graph structure only
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
#graph color coded diffusion
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0)
#do as flipbook
for(i in 0:52) {
	datafile<-paste('ties_bounded_hc',i,'.net', sep="")
	pngfile<-paste('~/Documents/book/images/chrnet_hc',i,'.png', sep="")
	chrnetbounded <- read.graph(datafile, c("pajek"))
	plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0)

February 28, 2010 at 1:23 pm 9 comments

R and TextMate

| Gabriel |

Now that I’ve started dabbling in R, I figured I needed to get my text editor to highlight the Klingon-esque syntax. TextWrangler and Smultron already support R, but getting it for TextMate requires the Terminal:

cd "~/Library/Application Support/TextMate/Bundles"
svn co

Note that 64-bit R is buggy so if you have trouble piping scripts from TextMate to Rdaemon (i.e., the command line R running in the background), you can use the bundle editor to redirect it to “R32″ instead of just “R” which will force it to use the slightly slower but more reliable 32-bit R. Or if that’s too hard, just stick to piping to instead of Rdaemon.

Also, as long as you’re playing with the TextMate library, you might as well install “GetBundles,” a GUI frontend for browsing the TextMate bundle server.
svn co

Note that GetBundles (with “s”) supersedes the now defunct GetBundle (without “s”) that you might see mentioned if you google things like “TextMate Bundle” or “TextMate R syntax”.

December 1, 2009 at 4:54 am 6 comments

some R baby steps

| Gabriel |

I’ve tried to learn R a few times but the syntax has always been opaque to my Stata-centric mind. I think the trick is to realize that two of the key differences are that:

  • whereas handles only come up in a few Stata commands (e.g., file, postfile, log), they are very important in R, what with all the “<-” statements
  • in R there’s a much fuzzier line between commands and functions than in Stata. What I mean by this is both the superficial thing of all the parentheses and also the more substantive issue that often you don’t put them one to a line and they just do something (like Stata commands) but you usually put them many to a line and feed them into something else (like Stata functions). Related to this is that the typical Stata line has the syntax “verb object, adverb” whereas the typical R line has the syntax “object <- verb(object2, adverb)”

The two combine in an obvious way with something as simple as opening a dataset, which is just use file in Stata but is filehandle <- read.table(“file”) in R, that is, there’s not a read.table() command but a read.table() function and you feed this function to a handle. (And people say R isn’t intuitive!)

At least I think that’s a good way to think about the basic syntax — I suck at R and I really could be totally wrong about this. (Pierre or Kieran please correct me).

Anyway, I wrote my first useful R file the other day. It reads my Pajek formatted network data on top 40 radio stations and does a graph.

# File-Name: testgraph.R
# Date: 2009-11-20
# Author: Gabriel Rossman
# Purpose: graph CHR station network
# Data Used:
# Packages Used: igraph
chrnet <- read.graph("", c("pajek"))
plot.igraph(chrnet, layout=layout.fruchterman.reingold, vertex.size=2, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)

The weird thing is that it works fine in but breaks when I try to R run from the Terminal, regardless of whether I try to do it all in one line or first invoke R and then feed it the script. [Update: the issue is a 32 bit library and 64 bit R, the simple solution is to invoke "R32" rather than just plain "R". see the comments for details]. Here’s a session with both problems:

gabriel-rossmans-macbook-2:~ rossman$ Rscript ~/Documents/book/stata/testgraph.R
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared library '/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/':
dlopen(/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/, 10): Symbol not found: ___gmpz_clear
Referenced from: /Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/
Expected in: dynamic lookup

Error : .onLoad failed in 'loadNamespace' for 'igraph'
Error: package/namespace load failed for 'igraph'
Execution halted
gabriel-rossmans-macbook-2:~ rossman$ R 'source("~/Documents/book/stata/testgraph.R")'
ARGUMENT 'source("~/Documents/book/stata/testgraph.R")' __ignored__

R version 2.10.0 (2009-10-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> source("~/Documents/book/stata/testgraph.R")
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared library '/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/':
dlopen(/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/, 10): Symbol not found: ___gmpz_clear
Referenced from: /Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/
Expected in: dynamic lookup


Error : .onLoad failed in 'loadNamespace' for 'igraph'
Error: package/namespace load failed for 'igraph'

The problem seems to be that R (terminal) can’t find the igraph library. This is weird because has no trouble finding it. Furthermore, I get the same error even if I make sure igraph is installed directly from R (Terminal) in the same R session:

chooseCRANmirror(graphics = FALSE)

I guess that’s another difference with Stata, StataConsole knows where the ado library is. I’d like to be able to use the Terminal mode for R as this would let me to reach my Nirvana-like goal of having a single script that does everything without any proximate human intervention. So I’ll just ask? How do I get R (Terminal) to run as reliably as Is this a naive question?

Or would it be better to try to feed a “source” script from the command line? Much how like I can do this for Stata to launch a do-file into the Stata GUI
exec /Applications/Stata/ ~/Documents/book/stata/

November 24, 2009 at 4:42 am 11 comments

R, Stata and descriptive stats

| Pierre |

It’s amazing how R can make complicated things look simple and simple things look complicated.

I tried to explain in my previous post that R could have important advantages over Stata when it came to managing weird, “non-rectangular” datasets that need to be transformed/combined/reshaped in non-trivial ways: R makes it much easier to work on several datasets at the same time, and different types of objects can be used in consistent ways.

Still, I haven’t completely stopped using Stata: one of the things that bother me when I use R is the lack of nice and quick descriptive statistics functions like “tabulate” or “tabstat”. Of course, it is possible to use standard R functions to get about the same desired output, but they tend to be quite a bit more cumbersome. Here’s an example:

tabstat y, by(group) stats(N mean p10 median p90)

could be translated into R as:

tapply(levels(group), levels(group), function(i)
cbind(N=length(y[group == i],
mean(y[group == i]),
quantile(y[group == i], c(.1,.5,.9)))

or, for a more concise version:

by(y, group, function(x)

That’s quite ugly compared to the simple tabstat command, but I could deal with it… Now suppose I am working on survey data and observations have sampling weights, and the syntax will have to get even more complicated — I’d have to think about something for a few minutes, when all Stata would need is a quick [fw=weight] statement added before the comma.

True, R can deal with survey weights, but it almost never matches the simplicity of Stata when all I am trying to do is get a few simple descriptive statistics on survey data:

One of my latest problems with R involved trying to make a two-way table of relative frequencies by column with weighted data… yes, a simple contingency table! The table() function cannot even compare with Stata’s tabulate twoway command, since:

  1. it does not handle weights;
  2. it does not report marginal distributions in the last row and column of the table (which I always find helpful);
  3. it calculates cell frequencies but not relative frequencies by row or column.

Luckily, writing an R function that can achieve this is not too hard:

col.table <- function(var1, var2, weights=rep(1,length(var1)), margins=TRUE){
# Creating table of (weighted) relative frequencies by column, and adding row variable margins as the last column
crosstab <- prop.table(xtabs(weights ~ var1 + var2), margin=2)
t <- cbind(crosstab, Total=prop.table(xtabs(weights ~ var1)))
# Adding column sums in the last row
t <- rbind(t,Total = colSums(t))
# Naming rows and columns of the table after var1 and var2 used, and returning result
names(dimnames(t)) <- c(deparse(substitute(var1)), deparse(substitute(var2)))

col.table(x,y,w) gives the same output as Stata’s “tabulate x y [fw=w], col nofreq”. Note that the weight argument is optional so that: col.table(x,y) is equivalent to tabulate x y, col nofreq.

Here’s the same function, but for relative distributions by row:

row.table <- function(var1, var2, weights=rep(1,length(var1)), margins=TRUE){
t <- rbind(prop.table(xtabs(weights ~ var1 + var2), margin=1),
Total=prop.table(xtabs(weights ~ var2)))
t <- cbind(t,Total = rowSums(t))
names(dimnames(t)) <- c(deparse(substitute(var1)), deparse(substitute(var2)))

May 6, 2009 at 7:41 pm 11 comments

R, Stata and “non-rectangular” data

| Pierre |

Thanks Gabriel for letting me join you on this blog. For those who don’t know me, my name is Pierre, I am a graduate student in sociology at Princeton and I’ve been doing work on organizations, culture and economic sociology (guess who’s on my committee). Recently, I’ve become interested in diffusion processes — in quite unrelated domains: the emergence of new composers and their adoption in orchestra repertoires, the evolution of attitudes towards financial risk, the diffusion of stock-ownership and the recent stock-market booms and busts.

When Gabriel asked me if I wanted to post on this Stata/soc-of-culture-oriented blog, I first told him I was actually slowly moving away from Stata and using R more and more frequently… which is going to be the topic of my first post. I am not trying to start the first flamewar of “Code and culture” — rather I’d like to argue that both languages have their own strengths and weaknesses; the important thing for me is not to come to a definitive conclusion (“Stata is better than R” or vice versa) and only use one package while discarding the other, but to identify conditions under which R or Stata are more or less painful to use for the type of data analysis I am working on.

People usually emphasize graphics functions and the number of high-quality user-contributed packages for cutting-edge models as being R’s greatest strengths over other statistical packages. I have to say I don’t run very often into R estimation functions for which I can’t find an equivalent Stata command. And while I agree that R-generated graphs can be amazingly cool, Stata has become much better in recent years. For me, R is particularly useful when I need to manipulate certain kinds of data and turn them into a “rectangular” dataset:

Stata is perfect for “rectangular” data, when the dataset fits nicely inside a rectangle of observations (rows) and variables (colums) and when the conceptual difference between rows and columns is clear — this is what a dataset will have to look like just before running a regression. But things can get complicated when the raw dataset I need to manipulate is not already “rectangular”: this may include network data and multilevel data — even when the ultimate goal is to turn these messy-looking data, sometimes originating from multiple sources, into a nice rectangular dataset that can be analyzed with a simple linear model… Sure, Stata has a few powerful built-in commands (although I’d be embarrassed to say how many times I had to recheck the proper syntax for “reshape” in the Stata help). But sometimes egen, merge, expand, collapse and reshape won’t do the trick… and I find myself sorting, looping, using, saving and merging until I realize (too late of course!) that Stata can be a horrible, horrible tool when it comes to manipulating datasets that are not already rectangular. R on the other hand has two features that make it a great tool for data management:

  1. R can have multiple objects loaded in memory at the same time. Stata on the other hand can only work on one dataset at a time — which is not just inefficient (you always need to write the data into temporary files and read a new file to switch from one dataset to another), it can also  unnecessarily add lines to the code and create confusion.
  2. R can easily handle multiple types of objects: vectors, matrices, arrays, data frames (i.e. datasets), lists, functions… Stata on the other hand is mostly designed to work on datasets: most commands take variables or variable lists as input; and when Stata tries to handle other types of objects (matrices, scalars, macros, ado files…), Stata uses distinct commands each with a different syntax (e.g. “matrix define”, “scalar”, “local”, “global”, “program define” instead of “generate”…) and sometimes a completely different language (Mata for matrix operations — which I have never had the patience to learn). R on the other hand handles these objects in a simple and consistent manner (for example it uses the same assignment operator “<-” for a matrix, a vector, an array, a list or a function…) and can extract elements which are seamlessly “converted” into other object types (e.g. a column of a matrix, or coefficients/standard errors from a model are by definition vectors, which can be treated as such and added as variables in a data frame, without even using any special command à la “svmat”).

In my next post, I’ll try to explain why I keep using Stata despite all this…

April 28, 2009 at 12:57 pm 9 comments

Newer Posts

Recent Posts


Get every new post delivered to your Inbox.

Join 1,477 other followers