Some ways Stata is an unusual language

August 6, 2010 at 4:36 am 12 comments

| Gabriel |

As I’ve tried to learn other languages, I’ve realized that part of the difficulty isn’t that they’re hard (although in some cases they are) but that I’m used to Stata’s very distinctive paradigm and nomenclature. Some aspects of Stata are pretty standard (e.g., “while”/”foreach”/”forvalues” loops, log files, and the “file” syntax for using text files on disk), but other bits are pretty strange. Or rather, they’re strange from a computer science perspective but intuitive from a social science perspective.

Stata seems to have been designed to make sense to social scientists and if this makes it confusing to programmers, then so be it. A simple example of this is that Stata uses the word “variable” in the sense meant by social scientists. More broadly, Stata is pretty bold about defaults so as to make things easy for beginners. It presumes that anything you’re doing applies to the dataset (aka the master data)  — which is always a flat-file database. Other things that might be held in memory have a secondary status and beginning users don’t even know that they’re there. Likewise, commands distinguish between the important arguments (usually variables) and the secondary arguments, which Stata calls “options”. There’s also the very sensible assumptions about what to report and what to put in ephemeral data objects that can be accessed immediately after the primary command (but need not be stored as part of the original command, as they would in most other languages).

Note, I’m not complaining about any of this. Very few of Stata’s quirks are pointlessly arbitrary. (The only arbitrary deviation I can think of is using “*” instead of “#” for commenting). Most of Stata’s quirks are necessary in order to make it so user-friendly to social scientists. In a lot of ways R is a more conventional language than Stata, but most social scientists find Stata much easier to learn. In part because Stata is willing to deviate from the conventions of general purpose programming languages, running and interpreting a regression in Stata looks like this “reg y x” instead of this “summary(lm(y~x))” and loading a dataset looks like this “use mydata, clear” instead of this “data <- read.table(mydata.txt)”. Stata has some pretty complicated syntax (e.g., the entire Mata language) but you can get a lot done with just a handful of simple commands like “use,” “gen,” and “reg”.

Nonetheless all this means that when Stata native speakers like me learn a second programming language it can be a bit confusing. And FWIW, I worry that rumored improvements to Stata (such as allowing relational data in memory) will detract from its user-friendliness. Anyway, the point is that I love Stata and I think it’s entirely appropriate for social scientists to learn it first. I do most of my work in Stata and I teach/mentor my graduate students in Stata unless there’s a specific reason for them to learn something else. At the same time I know that many social scientists would benefit a lot from also learning other languages. For instance, people into social networks should learn R, people who want to do content analysis should learn Perl or Python, and people who want to do simulations should learn NetLogo or Java. The thing is that when you do, you’re in for a culture shock and so I’m making explicit some ways in which Stata is weird.

Do-files and Ado-files. In any other language a do-file would be called a script and an ado-file would be called a library. Also note that Stata very conveniently reads all your ado-files automatically, whereas most other languages require you to specifically load the relevant libraries into memory at the beginning of each script.

Commands, Programs, and Functions. In Stata a program is basically just a command that you wrote yourself. Stata is somewhat unusual in drawing a distinction between a command/program and a function. So in Stata a function usually means some kind of transformation that attaches its output to a variable or macro, as in “gen ln_income=log(income)”. In contrast a command/program is pretty much anything that doesn’t directly attach to an operator and includes all file operations (e.g., “use”) and estimations (e.g, “regress”). Other languages don’t really draw this distinction but consider everything a function, no matter what it does and whether the user wrote it or not. (Some languages use “primitive” to mean something like the Stata command vs. program distinction, but it’s not terribly important).

Because most languages only have functions this means that pretty much everything has to be assigned to an object via an operator. Hence Stata users would usually type “reg y x” whereas R users would usually type “myregression <- lm(y~x)”. This is because “regress” in Stata is a command whereas “lm()” in R is a function. Also note that Stata distinguishes between commands and everything else by word order syntax with the command being the first word. In contrast functions in other languages (just like Stata functions) have the function being the thing outside the parentheses and inside the parentheses goes all of the arguments, both data objects and options.

The Dataset. Stata is one of the only languages where it’s appropriate to use the definite article in reference to data. (NetLogo is arguably another case of this). In other languages it’s more appropriate to speak of “a data object” than “the dataset,” even if there only happens to be one data object in memory. For the same reason, most languages don’t “use” or “open” data, but “read” the data and assign it to an object. Another way to think about it is that only Stata has a “dataset” whereas other languages only have “matrices.” Of course, Stata/Mata also has matrices but most Stata end users don’t bother with them as they tend to be kind of a backend thing that’s usually handled by ado-files. Furthermore, in other languages (e.g., Perl) it’s common to not even load a file into memory but to process it line-by-line, which in Stata terms is kind of like a cross between the “file read/write” syntax and a “while” loop.

Variables. Stata uses the term “variable” in the statistical or social scientific meaning of the term. In other languages this would usually be called a field or vector.

Macros. What most other languages call variables, Stata calls local and global “macros.” Stata’s usage of the local vs global distinction is standard. In other languages the concept of “declaring” a variable is usually a little more explicit than it is in Stata.

Stata is extremely good about expanding macros in situ and this can spoil us Stata users. In other languages you often have to do some kind of crude work around by first using some kind of concatenate function to create a string object containing the expansion and then you use that string object. For instance, if you wanted to access a series of numbered files in Stata you could just loop over this:

use ~/project/file`i', clear 

In other languages you’d have to add a separate line for the expansion. So in R you’d loop over:

filename <- paste('~/project/file',i, sep="")
data <- read.table(filename)

[Update: Also see this Statalist post by Nick Cox on the distinction between variables and macros]

Reporting. Stata allows you to pass estimations on for further work (that’s what return macros, ereturn matrices, and postestimation commands are all about), but it assumes you probably won’t and so it is unusually generous in reporting most of the really interesting things after a command. In other languages you usually have to specifically ask to get this level of reporting. Another way to put it is that in Stata verbosity is assumed by default and can be suppressed with “quietly,” whereas in R silence is assumed by default and verbosity can be invoked by wrapping the estimation (or an object saving the estimation) in the “summary()” function.

Entry filed under: Uncategorized. Tags: , , , , .

Really big headers Zero marginal productivity

12 Comments

  • 1. Andrew  |  August 6, 2010 at 4:25 pm

    Very good write-up!

    I’ve always thought that Stata is a great way for those unfamiliar with programming to learn and get things done. The “hello world!” moment is a huge confidence booster when learning a new language. In statistical programming “hello world” is more likely to be the moment where you’re first able to regress Y on X and see that slope parameter and it’s significance level. In Stata this can be done very quickly and intuitively where it may take multiple steps in others. For instance, in R, after the beginner figures out how to read in a CSV file with all the header parameters and runs their first model <- lm(Y~X) command they are greeted with nothing but a blank input line. Even running lm(Y~X) will not display the significance of the coefficient by default and one must dig deeper.

    That said, R is a very powerful language and it's open-source nature is a huge advantage over Stata. I would encourage social scientists beginning in quantitative analysis to learn Stata first and gradually move into R as they become familiar with programming for statistics.

    • 2. gabrielrossman  |  August 6, 2010 at 6:31 pm

      thanks andrew. i definitely agree that one of the reasons people hate R is the simple issue that it assumes suppressed reporting, which makes for at least one extra step before you feel you’ve accomplished anything. people seem to think that user-friendliness is about having a GUI, but it seems like this kind of thing is equally important — and it’s telling that R Commander provides reporting by default.

      i’m a little more comfortable than you with the idea of using Stata forever since (unlike SPSS) it’s entirely possible to use it in a sophisticated fashion. i think for many very sophisticated people (demographers who use off-the-shelf data) Stata does everything they’d ever need and so learning a less user-friendly language is just a waste of effort. for people whose needs are met by Stata the only advantage of R is that it’s open source, but even that’s not as much of an advantage as it seems since except for a few primitives Stata is mostly written in ado files and so it’s free as in speech even if it’s not free as in beer.

  • 3. Gabi Huiber  |  August 7, 2010 at 1:14 am

    I agree with Andrew that this is a very nice write-up. I also share your comfort at the thought of just using Stata forever. I’m just not sure that that’s where we’re headed.

    There is a business opportunity in the amount of programming ignorance in the general PhD public. A lot of time is being wasted on pasting results from SAS/Stata/SPSS to Excel and then to Word, or on retracing your steps in the absence of proper version control. Some eventually get to know better, but not all. There are two ways to profit from this problem: one is to live with it. Another is to fix it.

    Stata is supremely good at the former. It will not make trouble if you refuse to learn about programming. The cost of this convenience is the sum of quirks you mention in your post. You may trust that people won’t notice them without a frame of reference, but I am no longer sure that you’ll win that way.

    Newer, easier languages might yet be brought to the masses. That is what software-carpentry.org is trying to do with Python, for example. If such efforts are successful, we’ll be seeing growing numbers of people who are competent enough programmers in the general sense of knowing what to look for.

    These people will find Stata strange, and will instead be at home in SciPy, or Incanter, or some other weird piece of kit that’s not even around yet.

    • 4. gabrielrossman  |  August 7, 2010 at 2:43 pm

      Gabi, good thoughts. i try pretty hard to get my grad students to use Stata elegantly and have been deliberating how to pursue more general programming for myself and my students. in fact, just yesterday i was shopping for Python manuals that emphasize general programming principles. as such i’m especially grateful for the software carpentry link, but i’m a bit reluctant to dive into it until they complete the Python 3 upgrade.

  • 5. mike  |  August 8, 2010 at 1:45 pm

    Gabriel — I completely agree with this post and it really spoke to me as someone who learned programming via Stata and then moved to other languages.

    I think that the biggest programming element missing from Stata are iterables, either lists or hashable dictionaries (using Python terminology). Although other programming languages don’t have the nice macro expansion that Stata has, some of the need for expansion is obviated by the ability to use lists or dictionaries that fulfill the same function.

  • 6. Nick Cox  |  August 9, 2010 at 12:41 pm

    A few marginal comments:

    It’s incidental to your main argument, but your presumption that Stata is in some sense designed for social scientists isn’t really correct. Stata is designed for whoever wants to use it. It so happens that the largest markets appear to be social scientists, wide sense, and biostatistics, medical statistics and epidemiology, which are of the same order in size, but Stata users come from many fields, and extend beyond academia to include business and government.

    Your characterisation of Stata as verbose is also about as wrong as it is right. Sure it is by comparison with R, but not compared with several other statistical languages (more often called packages).

    Whoever said that # was the standard way to indicate comments? Just about every language I ever used has a different way to do that.

    • 7. gabrielrossman  |  August 9, 2010 at 1:13 pm

      Professor Cox,

      With the possible exception of Bill Gould, there’s nobody better to discuss this with so thanks for sharing your thoughts (and in general for contributing so much to the language).

      On the social scientists question you’re definitely right that the language is also popular with and intended for others, especially in the life sciences. However these researchers tend to understand words like “variable” the same way I do and so it doesn’t much change the main point that Stata assumes the jargon of practicing scientists rather than computer programmers.

      As for comments, I accept that I’m probably overgeneralizing from Bash, R, Perl, and Python. To a lesser extent this may also apply to some of the other issues, like functions vs commands. Hence it might have been better to title the post “some ways Stata differs from several other languages that social and life scientists are especially likely to encounter c. 2010”

  • 8. Nick Cox  |  August 9, 2010 at 1:38 pm

    Thanks for your (too) kind comments.

    I agree with your key point about variables, which I didn’t mention in my previous post. Stata’s sense of variable is indeed the statistical sense, not the computer programming sense.

    Stata started out as a kind of statistical operating system, as Bill Gould has often emphasised, and it retains that role. Programmability strong sense wasn’t there from the outset as far as users were concerned but the ability to run scripts was.

  • 9. Nick Cox  |  August 10, 2010 at 5:52 am

    My posting
    http://www.stata.com/statalist/archive/2008-08/msg01258.html
    addresses the terms variable, macro, local, global etc as used witin Stata and without.

  • 10. Jakob Petersen  |  October 2, 2010 at 9:47 am

    V interesting comparison of stats programming packages. One comment on how results are handled; I like Stata’s ‘type a little – get a little’ principle compared to SPSS, where ‘type a little’ becomes ‘get pages of results reporting every last test possible’.


The Culture Geeks