The Workflow of Data Analysis Using Stata

June 29, 2009 at 5:48 am 5 comments

| Gabriel |

I recently read Scott Long’s new book The Workflow of Data Analysis Using Stata and I highly recommend it. One of the ironies of graduate education in the social sciences is that we spend quite a bit of time trying to explain things like standard error but largely ignore that on a modal day quantitative research is all about data management and programming. Although Long is too charitable to mention it, one of the reasons to emphasize these issues is that many of the notorious horror stories of quantitative research do not involve modeling but data management. For instance, “88” was an unnoticed missing value code not actual data on senescent priapism, it was a weighting error that led to wildly exaggerated estimates of post-divorce income effects, and, most recently, findings about anomie were at least in part an artifact of a NORC missing data coding error.

By focusing on these largely neglected but critical data management issues, Long has done a service to the discipline. The publication of it may even reduce Indiana’s comparative advantage of producing hotshot quant PhDs now that grad students elsewhere can vicariously benefit from this important aspect of the training there. Certain aspects of it aren’t relevant to everyone (e.g., his section on value labels is most applicable to surveys with lots of Likert scales) but almost any serious quant is likely to find an enormous amount of clearly presented useful information.

For many of the issues the book addresses he shows a highly efficient and reliable way to do things. This is a service because many self-taught people will satisfice with a clunky and inefficient technique, even though with a little more upfront effort (an upfront effort greatly reduced by this book) they could avoid both effort and error in the long run. “Chapter 4: Automating Your Work” is particularly good in this respect. Since I lacked the benefit of a a copy of this book time-warped to 1997, I used Stata for years until I learned the “program” and “foreach” syntax. Even until now, I’d never understood how to use matrices (which is why this script is so hideously clunky, really, please don’t click the link) but Long has a very clear explanation of how to use all of these programming constructs. In the future I think my scripts will be much more elegant for having read his book, and especially chapter 4.

A less obvious contribution is that in several places he suggests standards. For instance, he suggests several missing data codes to distinguish between different types of missing data (coding error, skip code, respondent refused, etc). The particular codes he provides are necessarily arbitrary but no less useful for it because standards benefit from network externalities and it would make data analysis much easier if Stata users harmonized on these standards. Therefore the important thing is to have a remotely sensible standard, regardless of what it is.

Despite my enthusiasm I had a few differences of opinion and style. The main one is that the book reads something like a series of clear but nonetheless relatively discrete pieces of advice with only implicitly unifying themes. Over the course of a 200+ page book even consistently good advice starts to feel like one thing after another.

I think it might have made more sense and been more engaging to lay out a short list of principles for good code in the introduction. Then throughout the text each particular technique or standard could be shown as a manifestation of one or more of these general rules. Here is my own attempt to codify the general principles that at present are only implicit. Over the next few weeks I’ll elaborate on how these principles manifest in the book.

  1. The project should be replicable. (As Hillel said, “this is the whole law, the rest is commentary.”)
  2. Document your work by doing everything through adequately commented, organized, and archived scripts.
  3. Treat the raw data files as read-only.
  4. Good code will let you make changes in one place and see those changes propagate. (Note: Long embraces this principle within a single version of a single script, but otherwise sees this as a bug not a feature. As I’ll discuss in a few days, I disagree with him on the trade-offs involved in this issue).
  5. Good code is modular.

Entry filed under: Uncategorized. Tags: , .

Waxman-Markey-Meyer Archiving


  • 1. Eszter Hargittai  |  June 29, 2009 at 6:15 am

    Very helpful, thanks! (I’m with you on how nice it would have been to learn about some of this earlier.. not only because of time saved back then, but because it’s become so much harder to find time to learn things like this now. I know we thought we were sooo busy in grad school, but yikes, we had so much time!)

  • 2. mike3550  |  June 29, 2009 at 11:54 am

    I haven’t had a chance to read Long’s book, though I have looked at his website and was waiting to actually have the book published. I look forward to reading it and reading your commentary about workflow as well.

  • 3. Jose Sean  |  July 6, 2009 at 7:41 am

    I like this post. I never had a chance to read Long,s books though…

  • 4. Trey  |  July 17, 2009 at 11:57 am

    I really enjoyed this book and would be really interested in hearing your further commentary on it.

  • 5. links « Code and Culture  |  August 7, 2009 at 6:12 am

    […] Despite the usual standard-error-centric statistics training I’ve managed to develop a decent workflow, but (as can be seen by reading my shell scripts) I still really struggle with data cleaning […]

The Culture Geeks

%d bloggers like this: