R, Stata and “non-rectangular” data

April 28, 2009

| Pierre |

Thanks Gabriel for letting me join you on this blog. For those who don’t know me, my name is Pierre, I am a graduate student in sociology at Princeton and I’ve been doing work on organizations, culture and economic sociology (guess who’s on my committee). Recently, I’ve become interested in diffusion processes — in quite unrelated domains: the emergence of new composers and their adoption in orchestra repertoires, the evolution of attitudes towards financial risk, the diffusion of stock-ownership and the recent stock-market booms and busts.

When Gabriel asked me if I wanted to post on this Stata/soc-of-culture-oriented blog, I first told him I was actually slowly moving away from Stata and using R more and more frequently… which is going to be the topic of my first post. I am not trying to start the first flamewar of “Code and culture” — rather I’d like to argue that both languages have their own strengths and weaknesses; the important thing for me is not to come to a definitive conclusion (“Stata is better than R” or vice versa) and only use one package while discarding the other, but to identify conditions under which R or Stata are more or less painful to use for the type of data analysis I am working on.

People usually emphasize graphics functions and the number of high-quality user-contributed packages for cutting-edge models as being R’s greatest strengths over other statistical packages. I have to say I don’t run very often into R estimation functions for which I can’t find an equivalent Stata command. And while I agree that R-generated graphs can be amazingly cool, Stata has become much better in recent years. For me, R is particularly useful when I need to manipulate certain kinds of data and turn them into a “rectangular” dataset:

Stata is perfect for “rectangular” data, when the dataset fits nicely inside a rectangle of observations (rows) and variables (colums) and when the conceptual difference between rows and columns is clear — this is what a dataset will have to look like just before running a regression. But things can get complicated when the raw dataset I need to manipulate is not already “rectangular”: this may include network data and multilevel data — even when the ultimate goal is to turn these messy-looking data, sometimes originating from multiple sources, into a nice rectangular dataset that can be analyzed with a simple linear model… Sure, Stata has a few powerful built-in commands (although I’d be embarrassed to say how many times I had to recheck the proper syntax for “reshape” in the Stata help). But sometimes egen, merge, expand, collapse and reshape won’t do the trick… and I find myself sorting, looping, using, saving and merging until I realize (too late of course!) that Stata can be a horrible, horrible tool when it comes to manipulating datasets that are not already rectangular. R on the other hand has two features that make it a great tool for data management:

  1. R can have multiple objects loaded in memory at the same time. Stata on the other hand can only work on one dataset at a time — which is not just inefficient (you always need to write the data into temporary files and read a new file to switch from one dataset to another), it can also  unnecessarily add lines to the code and create confusion.
  2. R can easily handle multiple types of objects: vectors, matrices, arrays, data frames (i.e. datasets), lists, functions… Stata on the other hand is mostly designed to work on datasets: most commands take variables or variable lists as input; and when Stata tries to handle other types of objects (matrices, scalars, macros, ado files…), Stata uses distinct commands each with a different syntax (e.g. “matrix define”, “scalar”, “local”, “global”, “program define” instead of “generate”…) and sometimes a completely different language (Mata for matrix operations — which I have never had the patience to learn). R on the other hand handles these objects in a simple and consistent manner (for example it uses the same assignment operator “<-” for a matrix, a vector, an array, a list or a function…) and can extract elements which are seamlessly “converted” into other object types (e.g. a column of a matrix, or coefficients/standard errors from a model are by definition vectors, which can be treated as such and added as variables in a data frame, without even using any special command à la “svmat”).

In my next post, I’ll try to explain why I keep using Stata despite all this…

Entry Filed under: Uncategorized. Tags: , .

8 Comments Add your own

  • 1. Jenn Lena  |  April 28, 2009 at 1:23 pm

    Welcome, Pierre!

    Reply
    • 2. Vincent  |  April 28, 2009 at 3:34 pm

      Thanks a lot Pierre. This is a very timely post for me, and I look forward to your next entries.

      Reply
  • 3. Bobby Chen  |  April 28, 2009 at 3:46 pm

    Hi Pierre:

    In my department there’s a divide between R users and Stata users. Mainly having to do with the R users doing more work in network analysis as well as bayesian statistics. The stata users (like myself) is generally thought of as lower level non-quantiods. But I agree with you, working on merging multiple datasets gets very cumbersome with Stata.

    Reply
  • 4. gabrielrossman  |  April 28, 2009 at 6:28 pm

    it’s funny because i spend a lot of time merging and reshaping datasets but it never occurred to me that there was a better way to do it, i just figured that’s the way it is.
    a friend of mine likes to say “all 3s!” that he (rightly) considers this to be a boast is pretty good evidence of what a pain merging issues are in Stata, i’m just surprised that it’s easier in another package. (no doubt you’re thinking about me right now the same way i think of spss people)

    Reply
    • 5. pkremp  |  April 28, 2009 at 9:20 pm

      actually, i don’t have any opinion on people using spss — as long as they save their code/logs and can replicate their results, I’m fine with any statistical package… (but I agree that point-and-click without logs is a recipe for disaster)
      i guess reformatting datasets can be a pain in any language; R is no exception… but being able to work on different datasets at the same time can definitely save time and avoid coding mistakes.

      Reply
  • 6. shrinkingisaac  |  April 28, 2009 at 8:04 pm

    i’ve always been baffled by how Stata can take something so simple and make it so complicated (“reshape” – no need to be embarrassed by having to look that one up. It just doesn’t make sense).

    Reply
  • 7. Trey  |  April 28, 2009 at 9:09 pm

    Agreed with what’s written above, but it is possible to open more than one copy of Stata at once (on Windows boxes) to work with multiple datasets.

    Reply
  • [...] Back at Gabe’s house, Code and Culture, Pierre is a new blogger and has two posts on R vs. Stata and how to make R behave. Meanwhile, the Rossman does a hazard analysis of how people unsubrsribe to [...]

    Reply

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


The Culture Geeks

Tags

bayesian cleaning culture diffusion economics economic sociology ethnomethodology financial crisis graphs history IMDB loops lyx macros networks phenomenology philosophy of science R random variables regular expressions resampling shell sociology of organizations sociology of science st Stata superstar text editor typesetting

Archives

Recent Posts

Recent Comments

Blogroll