R, Stata and “non-rectangular” data
| Pierre |
Thanks Gabriel for letting me join you on this blog. For those who don’t know me, my name is Pierre, I am a graduate student in sociology at Princeton and I’ve been doing work on organizations, culture and economic sociology (guess who’s on my committee). Recently, I’ve become interested in diffusion processes — in quite unrelated domains: the emergence of new composers and their adoption in orchestra repertoires, the evolution of attitudes towards financial risk, the diffusion of stock-ownership and the recent stock-market booms and busts.
When Gabriel asked me if I wanted to post on this Stata/soc-of-culture-oriented blog, I first told him I was actually slowly moving away from Stata and using R more and more frequently… which is going to be the topic of my first post. I am not trying to start the first flamewar of “Code and culture” — rather I’d like to argue that both languages have their own strengths and weaknesses; the important thing for me is not to come to a definitive conclusion (“Stata is better than R” or vice versa) and only use one package while discarding the other, but to identify conditions under which R or Stata are more or less painful to use for the type of data analysis I am working on.
People usually emphasize graphics functions and the number of high-quality user-contributed packages for cutting-edge models as being R’s greatest strengths over other statistical packages. I have to say I don’t run very often into R estimation functions for which I can’t find an equivalent Stata command. And while I agree that R-generated graphs can be amazingly cool, Stata has become much better in recent years. For me, R is particularly useful when I need to manipulate certain kinds of data and turn them into a “rectangular” dataset:
Stata is perfect for “rectangular” data, when the dataset fits nicely inside a rectangle of observations (rows) and variables (colums) and when the conceptual difference between rows and columns is clear — this is what a dataset will have to look like just before running a regression. But things can get complicated when the raw dataset I need to manipulate is not already “rectangular”: this may include network data and multilevel data — even when the ultimate goal is to turn these messy-looking data, sometimes originating from multiple sources, into a nice rectangular dataset that can be analyzed with a simple linear model… Sure, Stata has a few powerful built-in commands (although I’d be embarrassed to say how many times I had to recheck the proper syntax for “reshape” in the Stata help). But sometimes egen, merge, expand, collapse and reshape won’t do the trick… and I find myself sorting, looping, using, saving and merging until I realize (too late of course!) that Stata can be a horrible, horrible tool when it comes to manipulating datasets that are not already rectangular. R on the other hand has two features that make it a great tool for data management:
- R can have multiple objects loaded in memory at the same time. Stata on the other hand can only work on one dataset at a time — which is not just inefficient (you always need to write the data into temporary files and read a new file to switch from one dataset to another), it can also unnecessarily add lines to the code and create confusion.
- R can easily handle multiple types of objects: vectors, matrices, arrays, data frames (i.e. datasets), lists, functions… Stata on the other hand is mostly designed to work on datasets: most commands take variables or variable lists as input; and when Stata tries to handle other types of objects (matrices, scalars, macros, ado files…), Stata uses distinct commands each with a different syntax (e.g. “matrix define”, “scalar”, “local”, “global”, “program define” instead of “generate”…) and sometimes a completely different language (Mata for matrix operations — which I have never had the patience to learn). R on the other hand handles these objects in a simple and consistent manner (for example it uses the same assignment operator “<-” for a matrix, a vector, an array, a list or a function…) and can extract elements which are seamlessly “converted” into other object types (e.g. a column of a matrix, or coefficients/standard errors from a model are by definition vectors, which can be treated as such and added as variables in a data frame, without even using any special command à la “svmat”).
In my next post, I’ll try to explain why I keep using Stata despite all this…