## R, Stata and descriptive stats

| Pierre |

It’s amazing how R can make complicated things look simple and simple things look complicated.

I tried to explain in my previous post that R could have important advantages over Stata when it came to managing weird, “non-rectangular” datasets that need to be transformed/combined/reshaped in non-trivial ways: R makes it much easier to work on several datasets at the same time, and different types of objects can be used in consistent ways.

Still, I haven’t completely stopped using Stata: one of the things that bother me when I use R is the lack of nice and quick descriptive statistics functions like “tabulate” or “tabstat”. Of course, it is possible to use standard R functions to get about the same desired output, but they tend to be quite a bit more cumbersome. Here’s an example:

`tabstat y, by(group) stats(N mean p10 median p90)`

could be translated into R as:

```tapply(levels(group), levels(group), function(i) cbind(N=length(y[group == i], mean(y[group == i]), quantile(y[group == i], c(.1,.5,.9)))```

or, for a more concise version:

```by(y, group, function(x) c(N=length(x),mean=mean(x),quantile(x,c(.1,.5,.9)))```

That’s quite ugly compared to the simple tabstat command, but I could deal with it… Now suppose I am working on survey data and observations have sampling weights, and the syntax will have to get even more complicated — I’d have to think about something for a few minutes, when all Stata would need is a quick [fw=weight] statement added before the comma.

True, R can deal with survey weights, but it almost never matches the simplicity of Stata when all I am trying to do is get a few simple descriptive statistics on survey data:

One of my latest problems with R involved trying to make a two-way table of relative frequencies by column with weighted data… yes, a simple contingency table! The table() function cannot even compare with Stata’s tabulate twoway command, since:

1. it does not handle weights;
2. it does not report marginal distributions in the last row and column of the table (which I always find helpful);
3. it calculates cell frequencies but not relative frequencies by row or column.

Luckily, writing an R function that can achieve this is not too hard:

```col.table <- function(var1, var2, weights=rep(1,length(var1)), margins=TRUE){ # Creating table of (weighted) relative frequencies by column, and adding row variable margins as the last column crosstab <- prop.table(xtabs(weights ~ var1 + var2), margin=2) t <- cbind(crosstab, Total=prop.table(xtabs(weights ~ var1))) # Adding column sums in the last row t <- rbind(t,Total = colSums(t)) # Naming rows and columns of the table after var1 and var2 used, and returning result names(dimnames(t)) <- c(deparse(substitute(var1)), deparse(substitute(var2))) return(round(100*t,2)) }```

col.table(x,y,w) gives the same output as Stata’s “tabulate x y [fw=w], col nofreq”. Note that the weight argument is optional so that: col.table(x,y) is equivalent to tabulate x y, col nofreq.

Here’s the same function, but for relative distributions by row:

```row.table <- function(var1, var2, weights=rep(1,length(var1)), margins=TRUE){ t <- rbind(prop.table(xtabs(weights ~ var1 + var2), margin=1), Total=prop.table(xtabs(weights ~ var2))) t <- cbind(t,Total = rowSums(t)) names(dimnames(t)) <- c(deparse(substitute(var1)), deparse(substitute(var2))) return(round(100*t,2)) }```

Entry filed under: Uncategorized. Tags: , .

• 1. sunday morning links – academic experiences edition « orgtheory.net  |  May 16, 2009 at 10:35 pm

[…] house, Code and Culture, Pierre is a new blogger and has two posts on R vs. Stata and how to make R behave. Meanwhile, the Rossman does a hazard analysis of how people unsubrsribe to email lists. As usual, […]

• 2. danny  |  May 17, 2009 at 8:29 am

Thanks for this post Pierre. I’m just beginning to use R and your points about storing data (good) and descriptive statistics (bad) completely fit with my experience.

I was just really surprised by how hard it seemed to be to get simple descriptive statistics. I figured that it was just my lack of coding experience in R, but it sounds like I am not alone.

• 3. Art  |  June 24, 2009 at 6:35 pm

Look at the CrossTable function in package gmodels

Art

• 4. pkremp  |  June 24, 2009 at 8:11 pm

Thanks! It seems like a great package — the function doesn’t seem to handle weights though…

• 5. Andrew  |  November 14, 2009 at 3:14 pm

R has a number of functions for calculating descriptive statistics, depending on what you are looking for. I am not familiar with Stata so I can’t make a comparison, but I’d encourage you to take a look at a few functions that may do what you are looking for. Note the format is functionName {package it is in}:

summary {base}: will describe the min, max, median, quartiles, and means of variables
– ex. summary(age) OR summary(cbind(age, educ))

describe {psych}: calculates means, medians, skew, etc.
describe.by {psych}: calculates everything describe {psych} does, but allows you to subdivide by another variable
– ex. describe.by(age, group=gender) will give you descriptive statistics for the age variable for men and women separately

describe {Hmisc}: calculates means, etc. for variables, and allows for sampling weights

What I am still trying to find, or maybe will just try and write, is how to combine the weighting function of describe {Hmisc} with the sub-dividing function of describe.by {psych}

• 6. Brad  |  May 11, 2011 at 1:05 pm

Pierre…In your code above, it’s unclear where to insert the weight variable.

• 7. sae  |  May 17, 2011 at 11:36 pm

Related: For way too long I was wondering how in R to use (frequency) weights instead of counts in a cross tabulation of level of satisfaction with transit service by income group. xtabs() allows weights on the left side of the function, e.g. xtabs(df\$weight ~ df\$satisf + df\$income). But I much prefer to use the CrossTable function, which easily gives row and column percentages, chi-square and more. To my knowledge, CrossTable itself does not take weights, however. Finally, I found that CrossTable can be used together with xtabs, for the best of both words, like this:
CrossTable(xtabs(df\$weight ~ df\$satisf + df\$income))

• 8. Zoltan  |  March 6, 2012 at 8:38 am

Thank you for the tip. It is very useful for two-way contingency tables, but unfortunately CrossTable can not handle three- or more-way tables produced by xtabs.

• 9. Dean Eckles  |  December 4, 2012 at 4:32 pm

You might give plyr a try in combination with any function that takes weights to give you your subgroup analyses etc. It is great for these kinds of purposes. Folks I know who uses R much uses plyr in nearly every session.

• 10. Dab  |  December 22, 2012 at 7:38 am

How would you change your code to handle pweights?

• 11. Andrew  |  January 2, 2013 at 11:18 am