R, Stata and descriptive stats
May 6, 2009
| Pierre |
It’s amazing how R can make complicated things look simple and simple things look complicated.
I tried to explain in my previous post that R could have important advantages over Stata when it came to managing weird, “non-rectangular” datasets that need to be transformed/combined/reshaped in non-trivial ways: R makes it much easier to work on several datasets at the same time, and different types of objects can be used in consistent ways.
Still, I haven’t completely stopped using Stata: one of the things that bother me when I use R is the lack of nice and quick descriptive statistics functions like “tabulate” or “tabstat”. Of course, it is possible to use standard R functions to get about the same desired output, but they tend to be quite a bit more cumbersome. Here’s an example:
tabstat y, by(group) stats(N mean p10 median p90)
could be translated into R as:
tapply(levels(group), levels(group), function(i)
cbind(N=length(y[group == i],
mean(y[group == i]),
quantile(y[group == i], c(.1,.5,.9)))
or, for a more concise version:
by(y, group, function(x)
c(N=length(x),mean=mean(x),quantile(x,c(.1,.5,.9)))
That’s quite ugly compared to the simple tabstat command, but I could deal with it… Now suppose I am working on survey data and observations have sampling weights, and the syntax will have to get even more complicated — I’d have to think about something for a few minutes, when all Stata would need is a quick [fw=weight] statement added before the comma.
True, R can deal with survey weights, but it almost never matches the simplicity of Stata when all I am trying to do is get a few simple descriptive statistics on survey data:
One of my latest problems with R involved trying to make a two-way table of relative frequencies by column with weighted data… yes, a simple contingency table! The table() function cannot even compare with Stata’s tabulate twoway command, since:
- it does not handle weights;
- it does not report marginal distributions in the last row and column of the table (which I always find helpful);
- it calculates cell frequencies but not relative frequencies by row or column.
Luckily, writing an R function that can achieve this is not too hard:
col.table <- function(var1, var2, weights=rep(1,length(var1)), margins=TRUE){
# Creating table of (weighted) relative frequencies by column, and adding row variable margins as the last column
crosstab <- prop.table(xtabs(weights ~ var1 + var2), margin=2)
t <- cbind(crosstab, Total=prop.table(xtabs(weights ~ var1)))
# Adding column sums in the last row
t <- rbind(t,Total = colSums(t))
# Naming rows and columns of the table after var1 and var2 used, and returning result
names(dimnames(t)) <- c(deparse(substitute(var1)), deparse(substitute(var2)))
return(round(100*t,2))
}
col.table(x,y,w) gives the same output as Stata’s “tabulate x y [fw=w], col nofreq”. Note that the weight argument is optional so that: col.table(x,y) is equivalent to tabulate x y, col nofreq.
Here’s the same function, but for relative distributions by row:
row.table <- function(var1, var2, weights=rep(1,length(var1)), margins=TRUE){
t <- rbind(prop.table(xtabs(weights ~ var1 + var2), margin=1),
Total=prop.table(xtabs(weights ~ var2)))
t <- cbind(t,Total = rowSums(t))
names(dimnames(t)) <- c(deparse(substitute(var1)), deparse(substitute(var2)))
return(round(100*t,2))
}
Entry Filed under: Uncategorized. Tags: R, Stata.
5 Comments Add your own
Leave a Comment
Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
Trackback this post | Subscribe to the comments via RSS Feed
1. sunday morning links – academic experiences edition « orgtheory.net | May 16, 2009 at 10:35 pm
[...] house, Code and Culture, Pierre is a new blogger and has two posts on R vs. Stata and how to make R behave. Meanwhile, the Rossman does a hazard analysis of how people unsubrsribe to email lists. As usual, [...]
2.
danny | May 17, 2009 at 8:29 am
Thanks for this post Pierre. I’m just beginning to use R and your points about storing data (good) and descriptive statistics (bad) completely fit with my experience.
I was just really surprised by how hard it seemed to be to get simple descriptive statistics. I figured that it was just my lack of coding experience in R, but it sounds like I am not alone.
3.
Art | June 24, 2009 at 6:35 pm
Look at the CrossTable function in package gmodels
Art
4.
pkremp | June 24, 2009 at 8:11 pm
Thanks! It seems like a great package — the function doesn’t seem to handle weights though…
5.
Andrew | November 14, 2009 at 3:14 pm
R has a number of functions for calculating descriptive statistics, depending on what you are looking for. I am not familiar with Stata so I can’t make a comparison, but I’d encourage you to take a look at a few functions that may do what you are looking for. Note the format is functionName {package it is in}:
summary {base}: will describe the min, max, median, quartiles, and means of variables
– ex. summary(age) OR summary(cbind(age, educ))
describe {psych}: calculates means, medians, skew, etc.
describe.by {psych}: calculates everything describe {psych} does, but allows you to subdivide by another variable
– ex. describe.by(age, group=gender) will give you descriptive statistics for the age variable for men and women separately
describe {Hmisc}: calculates means, etc. for variables, and allows for sampling weights
What I am still trying to find, or maybe will just try and write, is how to combine the weighting function of describe {Hmisc} with the sub-dividing function of describe.by {psych}