Posts tagged ‘Stata’

Keep the best 5 (updated)

| Gabriel |

Last year I mentioned my policy of assigning about seven quizzes and then keeping the best 5. I then had a real Rube Goldberg-esque workflow that involved piping to Perl. Several people came up with simpler ideas in the comments, but the most “why didn’t I think of that” was definitely John-Paul Ferguson’s suggestions to just use reshape. Now that I’m teaching the class again, I’ve rewritten the script to work on that logic.

Also, I’ve made the script a bit more flexible by allowing it to specify in the header how many quizzes were offered and how many to keep. To make this work I made a loop that builds a local called sumstring.

[UPDATE 11/29/2010, applied Nick Cox’s suggestions. Old code remains but is commented out]

local numberofquizzes 6
local keepbest 5

*import grades, which look like this
*uid    name    mt  q1  q2  q3
*5001   Joe     40  5   4   6
*4228   Alex    20  6   3   5
insheet using grades.txt, clear
*rescale the quizzes from raw points to proportion 
forvalues qnum=1/`numberofquizzes' {
	quietly sum q`qnum'
	replace q`qnum'=q`qnum'/`r(max)'
}
/*
*build the sumstring local (original code)
local sumstring ""
forvalues i=1/`keepbest' {
	local sumstring "`sumstring' + q`i'"
	disp "`sumstring'"
	local sumstring=subinstr("`sumstring'","+","",1)
	disp "`sumstring'"
}
*/
*reshape long, keep top few quizzes
reshape long q, i( notes uid name mt) j(qnum)
recode q .=0
gsort uid -q
by uid: drop if _n>`keepbest'
by uid: replace qnum=_n
*reshape wide, calc average
reshape wide q, i(notes uid name mt) j(qnum)
*build the sumstring local (w/ Nick Cox's suggestions)
unab sumstring : q* 
disp "`sumstring'"
local sumstring : subinstr local sumstring " " "+", all
disp "`sumstring'"
gen q_avg=(`sumstring')/`keepbest'
sort name
sum q_avg

*have a nice day

November 24, 2010 at 4:24 am 4 comments

Misc Links: Stata Networks and Mac SPSS bugfix

| Gabriel |

Two quick links that might be of interest.

  • As probably became inevitable with the creation of Mata, progress marches on in bringing social networks to Stata. Specifically, SSC is now hosting “centpow.ado,” which calculates Bonacich centrality and a few related measures directly in Stata. Thanks to Zach Neal of Michigan State for contributing this command. A few more years of this kind of progress and I can do everything entirely within Stata rather than exporting my network data, using “shell” to send the work out to R/igraph, and merging back in.
  • Last week’s Java update for OS X broke some functionality in SPSS (or PASW, or IBM SPSS, or whatever they’re calling it now). If this is a problem for you, here’s some helpful advice on how to fix it. Or you could  take my less helpful advice: switch to Stata.

October 25, 2010 at 12:54 pm 1 comment

fsx.ado, fork of fs.ado (capture ls as macro)

| Gabriel |

[Update: now hosted at ssc, just type “ssc install fsx” into Stata]

Nick Cox’s “fs.ado” command basically lets you capture the output of “ls” as a return macro. This is insanely useful if you want to do things like batching importing or appending.

For better or worse, when run without arguments “fs” shows all files, including files beginning with a dot. That is “fs” behaves more like the Bash “ls -a” command rather than hiding these files like the Bash “ls” command.

Since for most purposes Unix users are better off ignoring these files, I (with Nick’s blessing) wrote a fork of “fs” that by default suppresses the hidden files. The fork is called “fsx” as in “fs for Unix.” I haven’t tested it on Windows yet but it should run fine. However doing so is kind of pointless since Windows computers usually don’t have files beginning with dots unless they have been sharing a file system with Unix (for example, if a Mac user gives a Windows user data on a USB key). If you are interested in seeing the hidden files, you can either use the original “fs” command or use “fsx, all”.

After I write the help file I’ll post this to SSC.

BTW, as a coding note, this file has a lot of escaped quotes. I found this chokes TextMate’s syntax parser but Smultron highlights it correctly.

*! GHR and NJC 1.0 17 October 2010
* forked from fs.ado 1.0.5 (by NJC, Nov 2006)
program fsx, rclass
        syntax [anything] [, All ]
        version 8
        if `"`anything'"' == "" local anything *
        foreach f of local anything {
                if index("`f'", "/") | index("`f'", "\") ///
                 | index("`f'", ":") | inlist(substr("`f'", 1, 1), ".", "~") {
                        ParseSpec `f'
                        local files : dir "`d'" files "`f'"
                }
                else local files : dir . files "`f'"
                local files2 ""
                foreach f of local files {
                    if "`all'"=="all" {
                        local files2 "`files2' `f'"
                    }
                    else {
                        if strpos("`f'",".")!=1 {
                            local files2 "`files2' `f'"
                        }
                    }
                }
                local Files "`Files'`files2' "
        }
        DisplayInCols res 0 2 0 `Files'
        if trim(`"`Files'"') != "" {
                return local files `"`Files'"'
        }
end

program ParseSpec
        args f

        // first we need to strip off directory or folder information

        // if both "/" and "\" occur we want to know where the
        // last occurrence is -- which will be the first in
        // the reversed string
        // if only one of "/" and "\" occurs, index() will
        // return 0 in the other case

        local where1 = index(reverse("`f'"), "/")
        local where2 = index(reverse("`f'"), "\")
        if `where1' & `where2' local where = min(`where1', `where2')
        else                   local where = max(`where1', `where2')

        // map to position in original string and
        // extract the directory or folder
        local where = min(length("`f'"), 1 + length("`f'") - `where')
        local d = substr("`f'", 1, `where')

        // absolute references start with "/" or "\" or "." or "~"
        // or contain ":"
        local abs = inlist(substr("`f'", 1, 1), "/", "\", ".", "~")
        local abs = `abs' | index("`f'", ":")

        // prefix relative references
        if !`abs' local d "./`d'"

        // fix references to root
        else if "`d'" == "/" | "`d'" == "\" {
                local pwd "`c(pwd)'"
                local pwd : subinstr local pwd "\" "/", all
                local d = substr("`pwd'", 1, index("`pwd'","/"))
        }

        // absent filename list
        if "`f'" == "`d'" local f "*"
        else              local f = substr("`f'", `= `where' + 1', .)

        //  return to caller
        c_local f "`f'"
        c_local d "`d'"
end

program DisplayInCols /* sty #indent #pad #wid <list>*/
        gettoken sty    0 : 0
        gettoken indent 0 : 0
        gettoken pad    0 : 0
        gettoken wid    0 : 0

        local indent = cond(`indent'==. | `indent'<0, 0, `indent')
        local pad    = cond(`pad'==. | `pad'<1, 2, `pad')
        local wid    = cond(`wid'==. | `wid'<0, 0, `wid')

        local n : list sizeof 0
        if `n'==0 {
                exit
        }

        foreach x of local 0 {
                local wid = max(`wid', length(`"`x'"'))
        }

        local wid = `wid' + `pad'
        local cols = int((`c(linesize)'+1-`indent')/`wid')

        if `cols' < 2 {
                if `indent' {
                        local col "column(`=`indent'+1)"
                }
                foreach x of local 0 {
                        di as `sty' `col' `"`x'"'
                }
                exit
        }
        local lines = `n'/`cols'
        local lines = int(cond(`lines'>int(`lines'), `lines'+1, `lines'))

        /*
             1        lines+1      2*lines+1     ...  cols*lines+1
             2        lines+2      2*lines+2     ...  cols*lines+2
             3        lines+3      2*lines+3     ...  cols*lines+3
             ...      ...          ...           ...               ...
             lines    lines+lines  2*lines+lines ...  cols*lines+lines

             1        wid
        */

        * di "n=`n' cols=`cols' lines=`lines'"
        forvalues i=1(1)`lines' {
                local top = min((`cols')*`lines'+`i', `n')
                local col = `indent' + 1
                * di "`i'(`lines')`top'"
                forvalues j=`i'(`lines')`top' {
                        local x : word `j' of `0'
                        di as `sty' _column(`col') "`x'" _c
                        local col = `col' + `wid'
                }
                di as `sty'
        }
end

October 18, 2010 at 9:51 am 2 comments

Stata Programming Lecture [updated]

| Gabriel |

I gave my introduction to Stata programming lecture again. This time the lecture was cross-listed between my graduate statistics course and the UCLA ATS faculty seminar series. Here are the lecture notes.

October 14, 2010 at 3:50 pm 3 comments

Stata 11 Factor Variable / Margins Links

| Gabriel |

Michael Mitchell posted a brief tutorial on factor variables. (This has nothing to do with the “factor” command — think interaction terms and “xi,” not eigenvectors and structural equations). People who are already familiar with FV won’t really get anything out of the tutorial, but should serve as a very good introduction for people who are new to this syntax. Like all of Mitchell’s writing (e.g., the graphics book) it is very clearly written and generally well-suited to the user who is transitioning from novice to advanced usage.

The new Stata newsletter has a good write-up of “margins” syntax, which is useful in conjunction with factor variables for purposes of interpreting the betas (especially when there’s some nonlinearity involved). In a recent post explaining a comparable approach, I said that my impression was that margins only really works with categorical independent variables, but I’m please to see that I was mistaken and it in fact works with continuous variables as well.

September 23, 2010 at 4:01 am 1 comment

Status, Sorting, and Meritocracy

| Gabriel |

Over at OrgTheory, Fabio asked about how much turnover we expect to see in the NRC rankings. In the comments, myself and a few other people discussed the analysis of the rankings in Burris 2004 ASR. Kieran mentioned the interpretation of the data that it could all be sorting.

To see how plausible this is I wrote a simulation with 500 grad students, each of whom has a latent amount of talent that can only be observed with some noise. The students are admitted in cohorts of 15 each to 34 PhD granting departments and are strictly sorted so the (apparently) best students go to the best schools. There they work on their dissertations, the quality of which is a function of their talent, luck, and (to represent the possibility that top departments teach you more) a parameter proportional to the inverse root of the department’s rank. There is then a job market, with one job line per PhD granting department, and again, strict sorting (without even an exception for the incest taboo). I then summarize the amount of reproduction as the proportion of top 10 jobs that are taken by grad students from the top ten schools.

So how plausible is the meritocracy explanation? It turns out it’s pretty plausible. This table shows the average closure for the top 10 jobs averaged over 100 runs each for several combinations of assumptions. Each cell shows, on average, what proportion of the top 10 jobs we expect to be taken by students from the top 10 schools if we take as assumptions the row and column parameters. The rows represent different assumptions about how noisy is our observation of talent when we read an application to grad school or a job search. The columns represent a scaling parameter for how much you learn at different ranked schools. For instance, if we assume a learning parameter of “1.5,” a student at the 4th highest-ranked school would learn 1.5/(4^0.5), or .75. It turns out that unless you assume noise to be very high (something like a unit signal:noise ratio or worse), meritocracy is pretty plausible. Furthermore, if you assume that the top schools actually educate grad students better then meritocracy looks very plausible even if there’s a lot of noise.

P of top 10 jobs taken by students from top 10 schools
----------------------------------------
Noisiness |
of        |
Admission |
s and     |
Diss /    |How Much More Do You Learn at
Job       |         Top Schools
Market    |    0    .5     1   1.5     2
----------+-----------------------------
        0 |    1     1     1     1     1
       .1 |    1     1     1     1     1
       .2 |    1     1     1     1     1
       .3 | .999     1     1     1     1
       .4 | .997     1     1     1     1
       .5 | .983  .995  .999     1     1
       .6 | .966   .99  .991  .999  .999
       .7 | .915   .96  .982  .991  .995
       .8 | .867  .932  .963  .975  .986
       .9 | .817  .887  .904  .957  .977
        1 | .788  .853  .873  .919   .95
----------------------------------------

Of course, keep in mind this is all in a world of frictionless planes and perfectly spherical cows. If we assume that lots of people are choosing on other margins, or that there’s not a strict dual queue of positions and occupants (e.g., because searches are focused rather than “open”), then it gets a bit looser. Furthermore, I’m still not sure that the meritocracy model has a good explanation for the fact that academic productivity figures (citation counts, etc) have only a loose correlation with ranking.

Here’s the code, knock yourself out using different metrics of reproduction, inputting different assumptions, etc.

[Update: also see Jim Moody’s much more elaborate/realistic simulation, which gives similar results].

capture program drop socmeritocracy
program define socmeritocracy
	local gre_noise=round(`1',.001) /* size of error term, relative to standard normal, for apparenttalent=f(talent) */
	local diss_noise=round(`2',.001) /* size of error term, relative to standard normal, for dissquality=f(talent) */
	local quality=round(`3',.001) /* scaling parameter for valueadded (by quality grad school) */
	local cohortsize=round(`4',.001) /* size of annual graduate cohort (for each programs) */
	local facultylines=round(`5',.001) /* number of faculty lines (for each program)*/
	local batch `6'

	clear
	quietly set obs 500 /*create 500 BAs applying to grad school*/
	quietly gen talent=rnormal() /* draw talent from normal */
	quietly gen apparenttalent=talent + rnormal(0,`gre_noise') /*observe talent w error */
	*grad school admissions follows strict dual queue by apparent talent and dept rank
	gsort -apparenttalent
	quietly gen gradschool=1 + floor(([_n]-1)/`cohortsize')
	lab var gradschool "dept rank of grad school"
	*how much more do you actually learn at prestigious schools
	quietly gen valueadded=`quality'*(1/(gradschool^0.5))
	*how good is dissertation, as f(talent, gschool value added, noise)
	quietly gen dissquality=talent+rnormal(0,`diss_noise') + valueadded
	*grad school admissions follows strict dual queue of diss quality and dept rank (no incest taboo/preference)
	gsort -dissquality
	quietly gen placement=1 + floor(([_n]-1)/`facultylines')
	lab var placement "dept rank of 1st job"
	quietly sum gradschool
	quietly replace placement=. if placement>`r(max)' /*those not placed in PhD granting departments do not have research jobs (and may not even have finished PhD)*/
	*recode outcomes in a few ways for convenience of presentation
	quietly gen researchjob=placement
	quietly recode researchjob 0/999=1 .=0
	lab var researchjob "finished PhD and has research job"
	quietly gen gschool_type= gradschool
	quietly recode gschool_type 1/10=1 11/999=2 .=3
	quietly gen job_type= placement
	quietly recode job_type 1/10=1 11/999=2 .=3
	quietly gen job_top10= placement
	quietly recode job_top10 1/10=1 11/999=0
	lab def typology 1 "top 10" 2 "lower ranked" 3 "non-research"
	lab val gschool_type job_type typology
	if "`batch'"=="1" {
		quietly tab gschool_type job_type, matcell(xtab)
		local p_reproduction=xtab[1,1]/(xtab[1,1]+xtab[2,1])
		shell echo "`gre_noise' `diss_noise' `quality' `cohortsize' `facultylines' `p_reproduction'" >> socmeritocracyresults.txt
	}
	else {
		twoway (lowess researchjob gradschool), ytitle(Proportion Placed) xtitle(Grad School Rank)
		tab gschool_type job_type, chi2
	}
end

shell echo "gre_noise diss_noise quality cohortsize facultylines p_reproduction" > socmeritocracyresults.txt

forvalues gnoise=0(.1)1 {
	local dnoise=`gnoise'
	forvalues qualitylearning=0(.5)2 {
		forvalues i=1/100 {
			disp "`gnoise' `dnoise' `qualitylearning' 15 1 1 tick `i'"
			socmeritocracy `gnoise' `dnoise' `qualitylearning' 15 1 1
		}
	}
}

insheet using socmeritocracyresults.txt, clear delim(" ")
lab var gre_noise "Noisiness of Admissions and Diss / Job Market"
lab var quality "How Much More Do You Learn at Top Schools"
table gre_noise quality, c(m p_reproduction)

September 15, 2010 at 4:51 am 1 comment

Predicted vignettes

| Gabriel |

One of my favorite ways to interpret a complicated model is to make up hypothetical cases and see what predicted values they give you. For instance, at the end of my Oscars paper with Esparza and Bonacich, we compared predicted probabilities of Oscar nomination for the 2 x 2 of typical vs exceptional actors in typical vs exceptional films. Doing so helps make sense of a complicated model in a way that doesn’t boil down to p-value fetishism. Stata 11 has a very useful “margins” command for doing something comparable to this, but as best as I can tell, “margins” only works with categorical variables.

The way I like to do this kind of thing is to create some hypothetical cases representing various scenarios and use the “predict” command to see what the model predicts for such a scenario. (Note that you can do this in Excel, which is handy for trying to interpret published work, but it’s easier to use Stata for your own work). Below is a simple illustration of how this works using the 1978 cars dataset. In real life it’s overkill to do this for a model with only a few variables where everything is linear and there are no interactions, but this is just an illustration.

Also, note that this approach can get you into trouble if you feed it nonsensical independent variable values. So the example below includes the rather silly scenario of a car that accomplishes the formidable engineering feat of getting good mileage even though it’s very heavy. A conceptually related problem is that you have to be careful if you’re using interaction terms that are specified by the “gen” command (which is a good reason to use factor variables when possible instead of “gen” for interactions). Another way you can get in trouble is going beyond the observed range (e.g., trying to predict the price of a car that gets 100 mpg).

sysuse auto, clear

local indyvars "mpg headroom weight"
local n_hypothetical 10

*create the hypothetical data

*first, append some (empty) cases 
local biggerdata=[_N]+`n_hypothetical'
set obs `biggerdata'
gen hypothetical=0
replace hypothetical=1 in -`n_hypothetical'/L

*for each of these hypothetical cases, set all independent variables to the 
* mean
*note that you can screw up here if some of the real cases are missing data 
* on some variables but not others. in such an instance the means for the 
* analytic subset will not match those of the hypothetical cases
foreach var in `indyvars' {
	sum `var' 
	replace `var'=`r(mean)' in -`n_hypothetical'/L
}

*change selected hypothetical values to be theoretically interesting values
*i'm using a nested loop with a tick counter to allow combinations of values 
* on two dimensions 
*if you only want to play with a single variable this can be a lot simpler
* alternately you don't need to loop at all, but can just "input" or "edit"
local i=-`n_hypothetical'
forvalues mpg=0/1 {
	forvalues weight=0/4 {
		quietly replace mpg=20 + 20*`mpg' in `i'
		quietly replace weight=2000 + 500*`weight' in `i'
		local i=`i'+1
	}
}

*make sure that hypothetical cases are missing on Y and therefore don't go 
*into regression model
replace price=. in -`n_hypothetical'/L

*do the regression
preserve
*to be extra sure the hypotheticals don't bias the regression, drop them, 
keep if hypothetical==0
reg price `indyvars'
*then bring them back
restore

*create predicted values (for all cases, real and hypothetical)
* this postestimation command does a lot of the work and you should explore
* the options, which are different for the various regressions commands 
predict yhat

*create table of vignette predictions (all else held at mean or median)
preserve
keep if hypothetical==1
table mpg weight, c(m yhat)
restore

September 10, 2010 at 4:59 am 2 comments

Weeks

| Gabriel |

I use a lot of data that exists at the daily level (stored as a string like “9/3/2010”), but which I prefer to handle at the weekly level. In part this is about keeping the memory manageable but it also lets me bracket the epiphenomenal issues associated with the weekly work pattern of the music industry (e.g., records drop on Tuesdays). There are a few ways to turn a date into a week in Stata: the official way, the way I used to do it, and the way I do it now.

1. Beginning with Stata 10, there is a data type (%tw) which stores dates by the week, in contrast to %td (formerly just “%d”) which stores it as a date. This is good for some purposes but it gets messy if you’re trying to look at things that cross calendar years since the last week of the year can have a funny number of days. (Another reason I don’t use these functions is simply that I’ve been working on these scripts since Stata 9). However if you want to do it, it looks like this:

gen fp_w=wofd(date( firstplayed,"MDY"))
format fp_w %tw

2a. The way I prefer to do it is to store data as %td, but to force it to count by sevens, so effectively the %td data really stands for “the week ending in this date” or “this date give or take a few days.” Until a few days ago, I’d do this by dividing by seven and forcing it to be an integer, then multiplying again.

gen fpdate=date(firstplayed,"MDY")
gen int fp1=fpdate/7
gen fp_w=fp1*7
format fp_w %td
drop fp1

This is really ugly code and I’m not proud of it. First, note that this is a lot of code considering how little it does. I could have done this more efficiently by using the “mod(x,y)” function to subtract the remainder. Second, this only works if you’re interested in rounding off to the closest Friday and not some other day of the week.

2b. My new approach still stores as “%td” but is both more flexible and slightly simpler. In particular, it lets me define “week ending on X” where X is any day of the week I choose, here specified as the local `dow’ so I can define the end of the week once in the header and have it apply throughout several places that I do something like this. Note that Stata treats Sunday as 0, Monday as 1, etc. What I do is subtract the actual day of the week, then add the target day of the week so it substantively means “this thing happened on or a few days before this date.”

gen fp_w=date(firstplayed,"MDY")-dow(date(firstplayed,"MDY"))+`dow'
format fp_w %td

September 3, 2010 at 4:32 am 2 comments

Heads or Tails of your dta

| Gabriel |

A lot of languages (e.g., Bash and R) offer the commands/functions “head” and “tail.” These show you the first or last 5 or 10 things in a file or object. “Head” is similar to typing “list in 1/10” in Stata and you’d do so for similar reasons. Because I’m getting used to the Unix version, I wrote an ado file that lets “head” and “tail” work in Stata. Note that these Stata programs only work on the master dataset and can’t also be applied to using datasets or output like they can in Unix.

Update: see the comments for some suggestions on how to work in error messages and the like.

capture program drop head
program define head
	if [_N]<10 {
		local ten = [_N]
	}
	else {
		local ten 10
	}
	syntax [varlist(default=none)]
	list `varlist' in 1/`ten'
end

capture program drop tail
program define tail
	syntax [varlist(default=none)]
	local theend = [_N]
	local theend_min10 = [_N]-10
	if `theend_min10'<1 {
		local theend_min10 1
	}
	list `varlist' in `theend_min10'/`theend'
end

*have a nice day

August 25, 2010 at 4:44 am 5 comments

Some ways Stata is an unusual language

| Gabriel |

As I’ve tried to learn other languages, I’ve realized that part of the difficulty isn’t that they’re hard (although in some cases they are) but that I’m used to Stata’s very distinctive paradigm and nomenclature. Some aspects of Stata are pretty standard (e.g., “while”/”foreach”/”forvalues” loops, log files, and the “file” syntax for using text files on disk), but other bits are pretty strange. Or rather, they’re strange from a computer science perspective but intuitive from a social science perspective.

Stata seems to have been designed to make sense to social scientists and if this makes it confusing to programmers, then so be it. A simple example of this is that Stata uses the word “variable” in the sense meant by social scientists. More broadly, Stata is pretty bold about defaults so as to make things easy for beginners. It presumes that anything you’re doing applies to the dataset (aka the master data)  — which is always a flat-file database. Other things that might be held in memory have a secondary status and beginning users don’t even know that they’re there. Likewise, commands distinguish between the important arguments (usually variables) and the secondary arguments, which Stata calls “options”. There’s also the very sensible assumptions about what to report and what to put in ephemeral data objects that can be accessed immediately after the primary command (but need not be stored as part of the original command, as they would in most other languages).

Note, I’m not complaining about any of this. Very few of Stata’s quirks are pointlessly arbitrary. (The only arbitrary deviation I can think of is using “*” instead of “#” for commenting). Most of Stata’s quirks are necessary in order to make it so user-friendly to social scientists. In a lot of ways R is a more conventional language than Stata, but most social scientists find Stata much easier to learn. In part because Stata is willing to deviate from the conventions of general purpose programming languages, running and interpreting a regression in Stata looks like this “reg y x” instead of this “summary(lm(y~x))” and loading a dataset looks like this “use mydata, clear” instead of this “data <- read.table(mydata.txt)”. Stata has some pretty complicated syntax (e.g., the entire Mata language) but you can get a lot done with just a handful of simple commands like “use,” “gen,” and “reg”.

Nonetheless all this means that when Stata native speakers like me learn a second programming language it can be a bit confusing. And FWIW, I worry that rumored improvements to Stata (such as allowing relational data in memory) will detract from its user-friendliness. Anyway, the point is that I love Stata and I think it’s entirely appropriate for social scientists to learn it first. I do most of my work in Stata and I teach/mentor my graduate students in Stata unless there’s a specific reason for them to learn something else. At the same time I know that many social scientists would benefit a lot from also learning other languages. For instance, people into social networks should learn R, people who want to do content analysis should learn Perl or Python, and people who want to do simulations should learn NetLogo or Java. The thing is that when you do, you’re in for a culture shock and so I’m making explicit some ways in which Stata is weird.

Do-files and Ado-files. In any other language a do-file would be called a script and an ado-file would be called a library. Also note that Stata very conveniently reads all your ado-files automatically, whereas most other languages require you to specifically load the relevant libraries into memory at the beginning of each script.

Commands, Programs, and Functions. In Stata a program is basically just a command that you wrote yourself. Stata is somewhat unusual in drawing a distinction between a command/program and a function. So in Stata a function usually means some kind of transformation that attaches its output to a variable or macro, as in “gen ln_income=log(income)”. In contrast a command/program is pretty much anything that doesn’t directly attach to an operator and includes all file operations (e.g., “use”) and estimations (e.g, “regress”). Other languages don’t really draw this distinction but consider everything a function, no matter what it does and whether the user wrote it or not. (Some languages use “primitive” to mean something like the Stata command vs. program distinction, but it’s not terribly important).

Because most languages only have functions this means that pretty much everything has to be assigned to an object via an operator. Hence Stata users would usually type “reg y x” whereas R users would usually type “myregression <- lm(y~x)”. This is because “regress” in Stata is a command whereas “lm()” in R is a function. Also note that Stata distinguishes between commands and everything else by word order syntax with the command being the first word. In contrast functions in other languages (just like Stata functions) have the function being the thing outside the parentheses and inside the parentheses goes all of the arguments, both data objects and options.

The Dataset. Stata is one of the only languages where it’s appropriate to use the definite article in reference to data. (NetLogo is arguably another case of this). In other languages it’s more appropriate to speak of “a data object” than “the dataset,” even if there only happens to be one data object in memory. For the same reason, most languages don’t “use” or “open” data, but “read” the data and assign it to an object. Another way to think about it is that only Stata has a “dataset” whereas other languages only have “matrices.” Of course, Stata/Mata also has matrices but most Stata end users don’t bother with them as they tend to be kind of a backend thing that’s usually handled by ado-files. Furthermore, in other languages (e.g., Perl) it’s common to not even load a file into memory but to process it line-by-line, which in Stata terms is kind of like a cross between the “file read/write” syntax and a “while” loop.

Variables. Stata uses the term “variable” in the statistical or social scientific meaning of the term. In other languages this would usually be called a field or vector.

Macros. What most other languages call variables, Stata calls local and global “macros.” Stata’s usage of the local vs global distinction is standard. In other languages the concept of “declaring” a variable is usually a little more explicit than it is in Stata.

Stata is extremely good about expanding macros in situ and this can spoil us Stata users. In other languages you often have to do some kind of crude work around by first using some kind of concatenate function to create a string object containing the expansion and then you use that string object. For instance, if you wanted to access a series of numbered files in Stata you could just loop over this:

use ~/project/file`i', clear 

In other languages you’d have to add a separate line for the expansion. So in R you’d loop over:

filename <- paste('~/project/file',i, sep="")
data <- read.table(filename)

[Update: Also see this Statalist post by Nick Cox on the distinction between variables and macros]

Reporting. Stata allows you to pass estimations on for further work (that’s what return macros, ereturn matrices, and postestimation commands are all about), but it assumes you probably won’t and so it is unusually generous in reporting most of the really interesting things after a command. In other languages you usually have to specifically ask to get this level of reporting. Another way to put it is that in Stata verbosity is assumed by default and can be suppressed with “quietly,” whereas in R silence is assumed by default and verbosity can be invoked by wrapping the estimation (or an object saving the estimation) in the “summary()” function.

August 6, 2010 at 4:36 am 12 comments

Older Posts Newer Posts


The Culture Geeks