R, Stata and descriptive stats

May 6, 2009 at 7:41 pm 11 comments

| Pierre |

It’s amazing how R can make complicated things look simple and simple things look complicated.

I tried to explain in my previous post that R could have important advantages over Stata when it came to managing weird, “non-rectangular” datasets that need to be transformed/combined/reshaped in non-trivial ways: R makes it much easier to work on several datasets at the same time, and different types of objects can be used in consistent ways.

Still, I haven’t completely stopped using Stata: one of the things that bother me when I use R is the lack of nice and quick descriptive statistics functions like “tabulate” or “tabstat”. Of course, it is possible to use standard R functions to get about the same desired output, but they tend to be quite a bit more cumbersome. Here’s an example:

tabstat y, by(group) stats(N mean p10 median p90)

could be translated into R as:

tapply(levels(group), levels(group), function(i)
cbind(N=length(y[group == i],
mean(y[group == i]),
quantile(y[group == i], c(.1,.5,.9)))

or, for a more concise version:

by(y, group, function(x)

That’s quite ugly compared to the simple tabstat command, but I could deal with it… Now suppose I am working on survey data and observations have sampling weights, and the syntax will have to get even more complicated — I’d have to think about something for a few minutes, when all Stata would need is a quick [fw=weight] statement added before the comma.

True, R can deal with survey weights, but it almost never matches the simplicity of Stata when all I am trying to do is get a few simple descriptive statistics on survey data:

One of my latest problems with R involved trying to make a two-way table of relative frequencies by column with weighted data… yes, a simple contingency table! The table() function cannot even compare with Stata’s tabulate twoway command, since:

  1. it does not handle weights;
  2. it does not report marginal distributions in the last row and column of the table (which I always find helpful);
  3. it calculates cell frequencies but not relative frequencies by row or column.

Luckily, writing an R function that can achieve this is not too hard:

col.table <- function(var1, var2, weights=rep(1,length(var1)), margins=TRUE){
# Creating table of (weighted) relative frequencies by column, and adding row variable margins as the last column
crosstab <- prop.table(xtabs(weights ~ var1 + var2), margin=2)
t <- cbind(crosstab, Total=prop.table(xtabs(weights ~ var1)))
# Adding column sums in the last row
t <- rbind(t,Total = colSums(t))
# Naming rows and columns of the table after var1 and var2 used, and returning result
names(dimnames(t)) <- c(deparse(substitute(var1)), deparse(substitute(var2)))

col.table(x,y,w) gives the same output as Stata’s “tabulate x y [fw=w], col nofreq”. Note that the weight argument is optional so that: col.table(x,y) is equivalent to tabulate x y, col nofreq.

Here’s the same function, but for relative distributions by row:

row.table <- function(var1, var2, weights=rep(1,length(var1)), margins=TRUE){
t <- rbind(prop.table(xtabs(weights ~ var1 + var2), margin=1),
Total=prop.table(xtabs(weights ~ var2)))
t <- cbind(t,Total = rowSums(t))
names(dimnames(t)) <- c(deparse(substitute(var1)), deparse(substitute(var2)))

Entry filed under: Uncategorized. Tags: , .

Computer viruses, herd immunity, and public goods Reply to All: Unsubscribe!


  • […] house, Code and Culture, Pierre is a new blogger and has two posts on R vs. Stata and how to make R behave. Meanwhile, the Rossman does a hazard analysis of how people unsubrsribe to email lists. As usual, […]

  • 2. danny  |  May 17, 2009 at 8:29 am

    Thanks for this post Pierre. I’m just beginning to use R and your points about storing data (good) and descriptive statistics (bad) completely fit with my experience.

    I was just really surprised by how hard it seemed to be to get simple descriptive statistics. I figured that it was just my lack of coding experience in R, but it sounds like I am not alone.

  • 3. Art  |  June 24, 2009 at 6:35 pm

    Look at the CrossTable function in package gmodels


    • 4. pkremp  |  June 24, 2009 at 8:11 pm

      Thanks! It seems like a great package — the function doesn’t seem to handle weights though…

  • 5. Andrew  |  November 14, 2009 at 3:14 pm

    R has a number of functions for calculating descriptive statistics, depending on what you are looking for. I am not familiar with Stata so I can’t make a comparison, but I’d encourage you to take a look at a few functions that may do what you are looking for. Note the format is functionName {package it is in}:

    summary {base}: will describe the min, max, median, quartiles, and means of variables
    – ex. summary(age) OR summary(cbind(age, educ))

    describe {psych}: calculates means, medians, skew, etc.
    describe.by {psych}: calculates everything describe {psych} does, but allows you to subdivide by another variable
    – ex. describe.by(age, group=gender) will give you descriptive statistics for the age variable for men and women separately

    describe {Hmisc}: calculates means, etc. for variables, and allows for sampling weights

    What I am still trying to find, or maybe will just try and write, is how to combine the weighting function of describe {Hmisc} with the sub-dividing function of describe.by {psych}

  • 6. Brad  |  May 11, 2011 at 1:05 pm

    Pierre…In your code above, it’s unclear where to insert the weight variable.

    Thanks for your help.


  • 7. sae  |  May 17, 2011 at 11:36 pm

    Related: For way too long I was wondering how in R to use (frequency) weights instead of counts in a cross tabulation of level of satisfaction with transit service by income group. xtabs() allows weights on the left side of the function, e.g. xtabs(df$weight ~ df$satisf + df$income). But I much prefer to use the CrossTable function, which easily gives row and column percentages, chi-square and more. To my knowledge, CrossTable itself does not take weights, however. Finally, I found that CrossTable can be used together with xtabs, for the best of both words, like this:
    CrossTable(xtabs(df$weight ~ df$satisf + df$income))

    • 8. Zoltan  |  March 6, 2012 at 8:38 am

      Thank you for the tip. It is very useful for two-way contingency tables, but unfortunately CrossTable can not handle three- or more-way tables produced by xtabs.

  • 9. Dean Eckles  |  December 4, 2012 at 4:32 pm

    You might give plyr a try in combination with any function that takes weights to give you your subgroup analyses etc. It is great for these kinds of purposes. Folks I know who uses R much uses plyr in nearly every session.

  • 10. Dab  |  December 22, 2012 at 7:38 am

    How would you change your code to handle pweights?

    • 11. Andrew  |  January 2, 2013 at 11:18 am

      Who are you asking?

      As a general response, though, pweights are just probability weights if I recall correctly, which you should just be able to incorporate using the weight option in any descriptive function that takes weights. See the functions for weighted statistics in the Hmisc package (such as describe).

The Culture Geeks

%d bloggers like this: