Some ways Stata is an unusual language
| Gabriel |
As I’ve tried to learn other languages, I’ve realized that part of the difficulty isn’t that they’re hard (although in some cases they are) but that I’m used to Stata’s very distinctive paradigm and nomenclature. Some aspects of Stata are pretty standard (e.g., “while”/”foreach”/”forvalues” loops, log files, and the “file” syntax for using text files on disk), but other bits are pretty strange. Or rather, they’re strange from a computer science perspective but intuitive from a social science perspective.
Stata seems to have been designed to make sense to social scientists and if this makes it confusing to programmers, then so be it. A simple example of this is that Stata uses the word “variable” in the sense meant by social scientists. More broadly, Stata is pretty bold about defaults so as to make things easy for beginners. It presumes that anything you’re doing applies to the dataset (aka the master data) – which is always a flat-file database. Other things that might be held in memory have a secondary status and beginning users don’t even know that they’re there. Likewise, commands distinguish between the important arguments (usually variables) and the secondary arguments, which Stata calls “options”. There’s also the very sensible assumptions about what to report and what to put in ephemeral data objects that can be accessed immediately after the primary command (but need not be stored as part of the original command, as they would in most other languages).
Note, I’m not complaining about any of this. Very few of Stata’s quirks are pointlessly arbitrary. (The only arbitrary deviation I can think of is using “*” instead of “#” for commenting). Most of Stata’s quirks are necessary in order to make it so user-friendly to social scientists. In a lot of ways R is a more conventional language than Stata, but most social scientists find Stata much easier to learn. In part because Stata is willing to deviate from the conventions of general purpose programming languages, running and interpreting a regression in Stata looks like this “reg y x” instead of this “summary(lm(y~x))” and loading a dataset looks like this “use mydata, clear” instead of this “data <- read.table(mydata.txt)”. Stata has some pretty complicated syntax (e.g., the entire Mata language) but you can get a lot done with just a handful of simple commands like “use,” “gen,” and “reg”.
Nonetheless all this means that when Stata native speakers like me learn a second programming language it can be a bit confusing. And FWIW, I worry that rumored improvements to Stata (such as allowing relational data in memory) will detract from its user-friendliness. Anyway, the point is that I love Stata and I think it’s entirely appropriate for social scientists to learn it first. I do most of my work in Stata and I teach/mentor my graduate students in Stata unless there’s a specific reason for them to learn something else. At the same time I know that many social scientists would benefit a lot from also learning other languages. For instance, people into social networks should learn R, people who want to do content analysis should learn Perl or Python, and people who want to do simulations should learn NetLogo or Java. The thing is that when you do, you’re in for a culture shock and so I’m making explicit some ways in which Stata is weird.
Do-files and Ado-files. In any other language a do-file would be called a script and an ado-file would be called a library. Also note that Stata very conveniently reads all your ado-files automatically, whereas most other languages require you to specifically load the relevant libraries into memory at the beginning of each script.
Commands, Programs, and Functions. In Stata a program is basically just a command that you wrote yourself. Stata is somewhat unusual in drawing a distinction between a command/program and a function. So in Stata a function usually means some kind of transformation that attaches its output to a variable or macro, as in “gen ln_income=log(income)”. In contrast a command/program is pretty much anything that doesn’t directly attach to an operator and includes all file operations (e.g., “use”) and estimations (e.g, “regress”). Other languages don’t really draw this distinction but consider everything a function, no matter what it does and whether the user wrote it or not. (Some languages use “primitive” to mean something like the Stata command vs. program distinction, but it’s not terribly important).
Because most languages only have functions this means that pretty much everything has to be assigned to an object via an operator. Hence Stata users would usually type “reg y x” whereas R users would usually type “myregression <- lm(y~x)”. This is because “regress” in Stata is a command whereas “lm()” in R is a function. Also note that Stata distinguishes between commands and everything else by word order syntax with the command being the first word. In contrast functions in other languages (just like Stata functions) have the function being the thing outside the parentheses and inside the parentheses goes all of the arguments, both data objects and options.
The Dataset. Stata is one of the only languages where it’s appropriate to use the definite article in reference to data. (NetLogo is arguably another case of this). In other languages it’s more appropriate to speak of “a data object” than “the dataset,” even if there only happens to be one data object in memory. For the same reason, most languages don’t “use” or “open” data, but “read” the data and assign it to an object. Another way to think about it is that only Stata has a “dataset” whereas other languages only have “matrices.” Of course, Stata/Mata also has matrices but most Stata end users don’t bother with them as they tend to be kind of a backend thing that’s usually handled by ado-files. Furthermore, in other languages (e.g., Perl) it’s common to not even load a file into memory but to process it line-by-line, which in Stata terms is kind of like a cross between the “file read/write” syntax and a “while” loop.
Variables. Stata uses the term “variable” in the statistical or social scientific meaning of the term. In other languages this would usually be called a field or vector.
Macros. What most other languages call variables, Stata calls local and global “macros.” Stata’s usage of the local vs global distinction is standard. In other languages the concept of “declaring” a variable is usually a little more explicit than it is in Stata.
Stata is extremely good about expanding macros in situ and this can spoil us Stata users. In other languages you often have to do some kind of crude work around by first using some kind of concatenate function to create a string object containing the expansion and then you use that string object. For instance, if you wanted to access a series of numbered files in Stata you could just loop over this:
use ~/project/file`i', clear
In other languages you’d have to add a separate line for the expansion. So in R you’d loop over:
filename <- paste('~/project/file',i, sep="") data <- read.table(filename)
[Update: Also see this Statalist post by Nick Cox on the distinction between variables and macros]
Reporting. Stata allows you to pass estimations on for further work (that’s what return macros, ereturn matrices, and postestimation commands are all about), but it assumes you probably won’t and so it is unusually generous in reporting most of the really interesting things after a command. In other languages you usually have to specifically ask to get this level of reporting. Another way to put it is that in Stata verbosity is assumed by default and can be suppressed with “quietly,” whereas in R silence is assumed by default and verbosity can be invoked by wrapping the estimation (or an object saving the estimation) in the “summary()” function.