## R, Stata and “non-rectangular” data

| Pierre |

Thanks Gabriel for letting me join you on this blog. For those who don’t know me, my name is Pierre, I am a graduate student in sociology at Princeton and I’ve been doing work on organizations, culture and economic sociology (guess who’s on my committee). Recently, I’ve become interested in diffusion processes — in quite unrelated domains: the emergence of new composers and their adoption in orchestra repertoires, the evolution of attitudes towards financial risk, the diffusion of stock-ownership and the recent stock-market booms and busts.

When Gabriel asked me if I wanted to post on this Stata/soc-of-culture-oriented blog, I first told him I was actually slowly moving away from Stata and using R more and more frequently… which is going to be the topic of my first post. I am not trying to start the first flamewar of “Code and culture” — rather I’d like to argue that both languages have their own strengths and weaknesses; the important thing for me is not to come to a definitive conclusion (“Stata is better than R” or vice versa) and only use one package while discarding the other, but to identify conditions under which R or Stata are more or less painful to use for the type of data analysis I am working on.

People usually emphasize graphics functions and the number of high-quality user-contributed packages for cutting-edge models as being R’s greatest strengths over other statistical packages. I have to say I don’t run very often into R estimation functions for which I can’t find an equivalent Stata command. And while I agree that R-generated graphs can be amazingly cool, Stata has become much better in recent years. For me, R is particularly useful when I need to manipulate certain kinds of data and turn them into a “rectangular” dataset:

Stata is perfect for “rectangular” data, when the dataset fits nicely inside a rectangle of observations (rows) and variables (colums) and when the conceptual difference between rows and columns is clear — this is what a dataset will have to look like just before running a regression. But things can get complicated when the raw dataset I need to manipulate is not already “rectangular”: this may include network data and multilevel data — even when the ultimate goal is to turn these messy-looking data, sometimes originating from multiple sources, into a nice rectangular dataset that can be analyzed with a simple linear model… Sure, Stata has a few powerful built-in commands (although I’d be embarrassed to say how many times I had to recheck the proper syntax for “reshape” in the Stata help). But sometimes egen, merge, expand, collapse and reshape won’t do the trick… and I find myself sorting, looping, using, saving and merging until I realize (too late of course!) that Stata can be a horrible, horrible tool when it comes to manipulating datasets that are not already rectangular. R on the other hand has two features that make it a great tool for data management:

1. R can have multiple objects loaded in memory at the same time. Stata on the other hand can only work on one dataset at a time — which is not just inefficient (you always need to write the data into temporary files and read a new file to switch from one dataset to another), it can also  unnecessarily add lines to the code and create confusion.
2. R can easily handle multiple types of objects: vectors, matrices, arrays, data frames (i.e. datasets), lists, functions… Stata on the other hand is mostly designed to work on datasets: most commands take variables or variable lists as input; and when Stata tries to handle other types of objects (matrices, scalars, macros, ado files…), Stata uses distinct commands each with a different syntax (e.g. “matrix define”, “scalar”, “local”, “global”, “program define” instead of “generate”…) and sometimes a completely different language (Mata for matrix operations — which I have never had the patience to learn). R on the other hand handles these objects in a simple and consistent manner (for example it uses the same assignment operator “<-” for a matrix, a vector, an array, a list or a function…) and can extract elements which are seamlessly “converted” into other object types (e.g. a column of a matrix, or coefficients/standard errors from a model are by definition vectors, which can be treated as such and added as variables in a data frame, without even using any special command à la “svmat”).

In my next post, I’ll try to explain why I keep using Stata despite all this…

Entry filed under: Uncategorized. Tags: , .

• 1. Jenn Lena  |  April 28, 2009 at 1:23 pm

Welcome, Pierre!

• 2. Vincent  |  April 28, 2009 at 3:34 pm

Thanks a lot Pierre. This is a very timely post for me, and I look forward to your next entries.

• 3. Bobby Chen  |  April 28, 2009 at 3:46 pm

Hi Pierre:

In my department there’s a divide between R users and Stata users. Mainly having to do with the R users doing more work in network analysis as well as bayesian statistics. The stata users (like myself) is generally thought of as lower level non-quantiods. But I agree with you, working on merging multiple datasets gets very cumbersome with Stata.

• 4. gabrielrossman  |  April 28, 2009 at 6:28 pm

it’s funny because i spend a lot of time merging and reshaping datasets but it never occurred to me that there was a better way to do it, i just figured that’s the way it is.
a friend of mine likes to say “all 3s!” that he (rightly) considers this to be a boast is pretty good evidence of what a pain merging issues are in Stata, i’m just surprised that it’s easier in another package. (no doubt you’re thinking about me right now the same way i think of spss people)

• 5. pkremp  |  April 28, 2009 at 9:20 pm

actually, i don’t have any opinion on people using spss — as long as they save their code/logs and can replicate their results, I’m fine with any statistical package… (but I agree that point-and-click without logs is a recipe for disaster)
i guess reformatting datasets can be a pain in any language; R is no exception… but being able to work on different datasets at the same time can definitely save time and avoid coding mistakes.

• 6. shrinkingisaac  |  April 28, 2009 at 8:04 pm

i’ve always been baffled by how Stata can take something so simple and make it so complicated (“reshape” – no need to be embarrassed by having to look that one up. It just doesn’t make sense).

• 7. Trey  |  April 28, 2009 at 9:09 pm

Agreed with what’s written above, but it is possible to open more than one copy of Stata at once (on Windows boxes) to work with multiple datasets.

• [...] Back at Gabe’s house, Code and Culture, Pierre is a new blogger and has two posts on R vs. Stata and how to make R behave. Meanwhile, the Rossman does a hazard analysis of how people unsubrsribe to [...]

• 9. Stata2Pajek w vertice colors « Code and Culture  |  June 29, 2010 at 7:01 pm

[...] and this required tweaking the syntax a bit. Because this is a relational database problem (and Stata likes flat-file), I do it by merging in a vertice level file on disk. Although I’ve written it to merge on [...]