Posts Tagged R

some R baby steps

| Gabriel |

I’ve tried to learn R a few times but the syntax has always been opaque to my Stata-centric mind. I think the trick is to realize that two of the key differences are that:

  • whereas handles only come up in a few Stata commands (e.g., file, postfile, log), they are very important in R, what with all the “<-” statements
  • in R there’s a much fuzzier line between commands and functions than in Stata. What I mean by this is both the superficial thing of all the parentheses and also the more substantive issue that often you don’t put them one to a line and they just do something (like Stata commands) but you usually put them many to a line and feed them into something else (like Stata functions). Related to this is that the typical Stata line has the syntax “verb object, adverb” whereas the typical R line has the syntax “object <- verb(object2, adverb)”

The two combine in an obvious way with something as simple as opening a dataset, which is just use file in Stata but is filehandle <- read.table(“file”) in R, that is, there’s not a read.table() command but a read.table() function and you feed this function to a handle. (And people say R isn’t intuitive!)

At least I think that’s a good way to think about the basic syntax — I suck at R and I really could be totally wrong about this. (Pierre or Kieran please correct me).

Anyway, I wrote my first useful R file the other day. It reads my Pajek formatted network data on top 40 radio stations and does a graph.

# File-Name: testgraph.R
# Date: 2009-11-20
# Author: Gabriel Rossman
# Purpose: graph CHR station network
# Data Used: ties_bounded.net
# Packages Used: igraph
library(igraph)
setwd("~/Documents/Sjt/radio/survey")
chrnet <- read.graph("ties.net", c("pajek"))
pdf("~/Documents/book/images/chrnetworkbounded.pdf")
plot.igraph(chrnet, layout=layout.fruchterman.reingold, vertex.size=2, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
dev.off()

The weird thing is that it works fine in R.app but breaks when I try to R run from the Terminal, regardless of whether I try to do it all in one line or first invoke R and then feed it the script. [Update: the issue is a 32 bit library and 64 bit R, the simple solution is to invoke "R32" rather than just plain "R". see the comments for details]. Here’s a session with both problems:

gabriel-rossmans-macbook-2:~ rossman$ Rscript ~/Documents/book/stata/testgraph.R
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared library '/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so':
dlopen(/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so, 10): Symbol not found: ___gmpz_clear
Referenced from: /Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so
Expected in: dynamic lookup

Error : .onLoad failed in 'loadNamespace' for 'igraph'
Error: package/namespace load failed for 'igraph'
Execution halted
gabriel-rossmans-macbook-2:~ rossman$ R 'source("~/Documents/book/stata/testgraph.R")'
ARGUMENT 'source("~/Documents/book/stata/testgraph.R")' __ignored__

R version 2.10.0 (2009-10-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> source("~/Documents/book/stata/testgraph.R")
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared library '/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so':
dlopen(/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so, 10): Symbol not found: ___gmpz_clear
Referenced from: /Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so
Expected in: dynamic lookup

 

Error : .onLoad failed in 'loadNamespace' for 'igraph'
Error: package/namespace load failed for 'igraph'
>

The problem seems to be that R (terminal) can’t find the igraph library. This is weird because R.app has no trouble finding it. Furthermore, I get the same error even if I make sure igraph is installed directly from R (Terminal) in the same R session:

chooseCRANmirror(graphics = FALSE)
install.packages("igraph")
source("/Users/rossman/Documents/book/stata/testgraph.R")

I guess that’s another difference with Stata, StataConsole knows where the ado library is. I’d like to be able to use the Terminal mode for R as this would let me to reach my Nirvana-like goal of having a single script that does everything without any proximate human intervention. So I’ll just ask? How do I get R (Terminal) to run as reliably as R.app? Is this a naive question?

Or would it be better to try to feed R.app a “source” script from the command line? Much how like I can do this for Stata to launch a do-file into the Stata GUI
exec /Applications/Stata/StataMP.app/Contents/MacOS/stataMP ~/Documents/book/stata/import.do

6 comments November 24, 2009

R, Stata and descriptive stats

| Pierre |

It’s amazing how R can make complicated things look simple and simple things look complicated.

I tried to explain in my previous post that R could have important advantages over Stata when it came to managing weird, “non-rectangular” datasets that need to be transformed/combined/reshaped in non-trivial ways: R makes it much easier to work on several datasets at the same time, and different types of objects can be used in consistent ways.

Still, I haven’t completely stopped using Stata: one of the things that bother me when I use R is the lack of nice and quick descriptive statistics functions like “tabulate” or “tabstat”. Of course, it is possible to use standard R functions to get about the same desired output, but they tend to be quite a bit more cumbersome. Here’s an example:

tabstat y, by(group) stats(N mean p10 median p90)

could be translated into R as:

tapply(levels(group), levels(group), function(i)
cbind(N=length(y[group == i],
mean(y[group == i]),
quantile(y[group == i], c(.1,.5,.9)))

or, for a more concise version:

by(y, group, function(x)
c(N=length(x),mean=mean(x),quantile(x,c(.1,.5,.9)))

That’s quite ugly compared to the simple tabstat command, but I could deal with it… Now suppose I am working on survey data and observations have sampling weights, and the syntax will have to get even more complicated — I’d have to think about something for a few minutes, when all Stata would need is a quick [fw=weight] statement added before the comma.

True, R can deal with survey weights, but it almost never matches the simplicity of Stata when all I am trying to do is get a few simple descriptive statistics on survey data:

One of my latest problems with R involved trying to make a two-way table of relative frequencies by column with weighted data… yes, a simple contingency table! The table() function cannot even compare with Stata’s tabulate twoway command, since:

  1. it does not handle weights;
  2. it does not report marginal distributions in the last row and column of the table (which I always find helpful);
  3. it calculates cell frequencies but not relative frequencies by row or column.

Luckily, writing an R function that can achieve this is not too hard:

col.table <- function(var1, var2, weights=rep(1,length(var1)), margins=TRUE){
# Creating table of (weighted) relative frequencies by column, and adding row variable margins as the last column
crosstab <- prop.table(xtabs(weights ~ var1 + var2), margin=2)
t <- cbind(crosstab, Total=prop.table(xtabs(weights ~ var1)))
# Adding column sums in the last row
t <- rbind(t,Total = colSums(t))
# Naming rows and columns of the table after var1 and var2 used, and returning result
names(dimnames(t)) <- c(deparse(substitute(var1)), deparse(substitute(var2)))
return(round(100*t,2))
}

col.table(x,y,w) gives the same output as Stata’s “tabulate x y [fw=w], col nofreq”. Note that the weight argument is optional so that: col.table(x,y) is equivalent to tabulate x y, col nofreq.

Here’s the same function, but for relative distributions by row:

row.table <- function(var1, var2, weights=rep(1,length(var1)), margins=TRUE){
t <- rbind(prop.table(xtabs(weights ~ var1 + var2), margin=1),
Total=prop.table(xtabs(weights ~ var2)))
t <- cbind(t,Total = rowSums(t))
names(dimnames(t)) <- c(deparse(substitute(var1)), deparse(substitute(var2)))
return(round(100*t,2))
}

5 comments May 6, 2009

R, Stata and “non-rectangular” data

| Pierre |

Thanks Gabriel for letting me join you on this blog. For those who don’t know me, my name is Pierre, I am a graduate student in sociology at Princeton and I’ve been doing work on organizations, culture and economic sociology (guess who’s on my committee). Recently, I’ve become interested in diffusion processes — in quite unrelated domains: the emergence of new composers and their adoption in orchestra repertoires, the evolution of attitudes towards financial risk, the diffusion of stock-ownership and the recent stock-market booms and busts.

When Gabriel asked me if I wanted to post on this Stata/soc-of-culture-oriented blog, I first told him I was actually slowly moving away from Stata and using R more and more frequently… which is going to be the topic of my first post. I am not trying to start the first flamewar of “Code and culture” — rather I’d like to argue that both languages have their own strengths and weaknesses; the important thing for me is not to come to a definitive conclusion (“Stata is better than R” or vice versa) and only use one package while discarding the other, but to identify conditions under which R or Stata are more or less painful to use for the type of data analysis I am working on.

People usually emphasize graphics functions and the number of high-quality user-contributed packages for cutting-edge models as being R’s greatest strengths over other statistical packages. I have to say I don’t run very often into R estimation functions for which I can’t find an equivalent Stata command. And while I agree that R-generated graphs can be amazingly cool, Stata has become much better in recent years. For me, R is particularly useful when I need to manipulate certain kinds of data and turn them into a “rectangular” dataset:

Stata is perfect for “rectangular” data, when the dataset fits nicely inside a rectangle of observations (rows) and variables (colums) and when the conceptual difference between rows and columns is clear — this is what a dataset will have to look like just before running a regression. But things can get complicated when the raw dataset I need to manipulate is not already “rectangular”: this may include network data and multilevel data — even when the ultimate goal is to turn these messy-looking data, sometimes originating from multiple sources, into a nice rectangular dataset that can be analyzed with a simple linear model… Sure, Stata has a few powerful built-in commands (although I’d be embarrassed to say how many times I had to recheck the proper syntax for “reshape” in the Stata help). But sometimes egen, merge, expand, collapse and reshape won’t do the trick… and I find myself sorting, looping, using, saving and merging until I realize (too late of course!) that Stata can be a horrible, horrible tool when it comes to manipulating datasets that are not already rectangular. R on the other hand has two features that make it a great tool for data management:

  1. R can have multiple objects loaded in memory at the same time. Stata on the other hand can only work on one dataset at a time — which is not just inefficient (you always need to write the data into temporary files and read a new file to switch from one dataset to another), it can also  unnecessarily add lines to the code and create confusion.
  2. R can easily handle multiple types of objects: vectors, matrices, arrays, data frames (i.e. datasets), lists, functions… Stata on the other hand is mostly designed to work on datasets: most commands take variables or variable lists as input; and when Stata tries to handle other types of objects (matrices, scalars, macros, ado files…), Stata uses distinct commands each with a different syntax (e.g. “matrix define”, “scalar”, “local”, “global”, “program define” instead of “generate”…) and sometimes a completely different language (Mata for matrix operations — which I have never had the patience to learn). R on the other hand handles these objects in a simple and consistent manner (for example it uses the same assignment operator “<-” for a matrix, a vector, an array, a list or a function…) and can extract elements which are seamlessly “converted” into other object types (e.g. a column of a matrix, or coefficients/standard errors from a model are by definition vectors, which can be treated as such and added as variables in a data frame, without even using any special command à la “svmat”).

In my next post, I’ll try to explain why I keep using Stata despite all this…

8 comments April 28, 2009


The Culture Geeks

Tags

bayesian cleaning culture diffusion economics economic sociology ethnomethodology financial crisis graphs history IMDB loops lyx macros networks phenomenology philosophy of science R random variables regular expressions resampling shell sociology of organizations sociology of science st Stata superstar text editor typesetting

Archives

Recent Posts

Recent Comments

Blogroll