Posts tagged ‘macros’

Stata Programming Lecture [updated]

| Gabriel |

I gave my introduction to Stata programming lecture again. This time the lecture was cross-listed between my graduate statistics course and the UCLA ATS faculty seminar series. Here are the lecture notes.

October 14, 2010 at 3:50 pm 3 comments

Some ways Stata is an unusual language

| Gabriel |

As I’ve tried to learn other languages, I’ve realized that part of the difficulty isn’t that they’re hard (although in some cases they are) but that I’m used to Stata’s very distinctive paradigm and nomenclature. Some aspects of Stata are pretty standard (e.g., “while”/”foreach”/”forvalues” loops, log files, and the “file” syntax for using text files on disk), but other bits are pretty strange. Or rather, they’re strange from a computer science perspective but intuitive from a social science perspective.

Stata seems to have been designed to make sense to social scientists and if this makes it confusing to programmers, then so be it. A simple example of this is that Stata uses the word “variable” in the sense meant by social scientists. More broadly, Stata is pretty bold about defaults so as to make things easy for beginners. It presumes that anything you’re doing applies to the dataset (aka the master data)  — which is always a flat-file database. Other things that might be held in memory have a secondary status and beginning users don’t even know that they’re there. Likewise, commands distinguish between the important arguments (usually variables) and the secondary arguments, which Stata calls “options”. There’s also the very sensible assumptions about what to report and what to put in ephemeral data objects that can be accessed immediately after the primary command (but need not be stored as part of the original command, as they would in most other languages).

Note, I’m not complaining about any of this. Very few of Stata’s quirks are pointlessly arbitrary. (The only arbitrary deviation I can think of is using “*” instead of “#” for commenting). Most of Stata’s quirks are necessary in order to make it so user-friendly to social scientists. In a lot of ways R is a more conventional language than Stata, but most social scientists find Stata much easier to learn. In part because Stata is willing to deviate from the conventions of general purpose programming languages, running and interpreting a regression in Stata looks like this “reg y x” instead of this “summary(lm(y~x))” and loading a dataset looks like this “use mydata, clear” instead of this “data <- read.table(mydata.txt)”. Stata has some pretty complicated syntax (e.g., the entire Mata language) but you can get a lot done with just a handful of simple commands like “use,” “gen,” and “reg”.

Nonetheless all this means that when Stata native speakers like me learn a second programming language it can be a bit confusing. And FWIW, I worry that rumored improvements to Stata (such as allowing relational data in memory) will detract from its user-friendliness. Anyway, the point is that I love Stata and I think it’s entirely appropriate for social scientists to learn it first. I do most of my work in Stata and I teach/mentor my graduate students in Stata unless there’s a specific reason for them to learn something else. At the same time I know that many social scientists would benefit a lot from also learning other languages. For instance, people into social networks should learn R, people who want to do content analysis should learn Perl or Python, and people who want to do simulations should learn NetLogo or Java. The thing is that when you do, you’re in for a culture shock and so I’m making explicit some ways in which Stata is weird.

Do-files and Ado-files. In any other language a do-file would be called a script and an ado-file would be called a library. Also note that Stata very conveniently reads all your ado-files automatically, whereas most other languages require you to specifically load the relevant libraries into memory at the beginning of each script.

Commands, Programs, and Functions. In Stata a program is basically just a command that you wrote yourself. Stata is somewhat unusual in drawing a distinction between a command/program and a function. So in Stata a function usually means some kind of transformation that attaches its output to a variable or macro, as in “gen ln_income=log(income)”. In contrast a command/program is pretty much anything that doesn’t directly attach to an operator and includes all file operations (e.g., “use”) and estimations (e.g, “regress”). Other languages don’t really draw this distinction but consider everything a function, no matter what it does and whether the user wrote it or not. (Some languages use “primitive” to mean something like the Stata command vs. program distinction, but it’s not terribly important).

Because most languages only have functions this means that pretty much everything has to be assigned to an object via an operator. Hence Stata users would usually type “reg y x” whereas R users would usually type “myregression <- lm(y~x)”. This is because “regress” in Stata is a command whereas “lm()” in R is a function. Also note that Stata distinguishes between commands and everything else by word order syntax with the command being the first word. In contrast functions in other languages (just like Stata functions) have the function being the thing outside the parentheses and inside the parentheses goes all of the arguments, both data objects and options.

The Dataset. Stata is one of the only languages where it’s appropriate to use the definite article in reference to data. (NetLogo is arguably another case of this). In other languages it’s more appropriate to speak of “a data object” than “the dataset,” even if there only happens to be one data object in memory. For the same reason, most languages don’t “use” or “open” data, but “read” the data and assign it to an object. Another way to think about it is that only Stata has a “dataset” whereas other languages only have “matrices.” Of course, Stata/Mata also has matrices but most Stata end users don’t bother with them as they tend to be kind of a backend thing that’s usually handled by ado-files. Furthermore, in other languages (e.g., Perl) it’s common to not even load a file into memory but to process it line-by-line, which in Stata terms is kind of like a cross between the “file read/write” syntax and a “while” loop.

Variables. Stata uses the term “variable” in the statistical or social scientific meaning of the term. In other languages this would usually be called a field or vector.

Macros. What most other languages call variables, Stata calls local and global “macros.” Stata’s usage of the local vs global distinction is standard. In other languages the concept of “declaring” a variable is usually a little more explicit than it is in Stata.

Stata is extremely good about expanding macros in situ and this can spoil us Stata users. In other languages you often have to do some kind of crude work around by first using some kind of concatenate function to create a string object containing the expansion and then you use that string object. For instance, if you wanted to access a series of numbered files in Stata you could just loop over this:

use ~/project/file`i', clear 

In other languages you’d have to add a separate line for the expansion. So in R you’d loop over:

filename <- paste('~/project/file',i, sep="")
data <- read.table(filename)

[Update: Also see this Statalist post by Nick Cox on the distinction between variables and macros]

Reporting. Stata allows you to pass estimations on for further work (that’s what return macros, ereturn matrices, and postestimation commands are all about), but it assumes you probably won’t and so it is unusually generous in reporting most of the really interesting things after a command. In other languages you usually have to specifically ask to get this level of reporting. Another way to put it is that in Stata verbosity is assumed by default and can be suppressed with “quietly,” whereas in R silence is assumed by default and verbosity can be invoked by wrapping the estimation (or an object saving the estimation) in the “summary()” function.

August 6, 2010 at 4:36 am 12 comments

wos2tab.pl

| Gabriel |

One of my grad students is doing some citation network analysis, for which the Python script (and .exe wrapper) wos2pajek is very well-suited. (Since most network packages can read “.net” this is a good idea even if you’re not using Pajek).

However the student is also interested in node level attributes, not just the network. Unfortunately WOS queries are field-tagged which is kind of a pain to work with and the grad student horrified me by expressing the willingness to spend weeks reshaping the data by hand in Excel. (Even in grad school your time is a lot more valuable than that). To get the data into tab-delimited text, I modified an earlier script I wrote for parsing field-tagged IMDb files (in my case business.list but most of the film-level IMDb files are structured similarly). The basic approach is to read a file line-by-line and match its contents by field-tag, saving the contents in a variable named after the tag. Then when you get to the new record delimiter (in this case, a blank line), dump the contents to disk and wipe the variables. Note that since the “CR” (cited reference) field has internal carriage returns it would require a little doing to integrate into this script, which is one of the reasons you’re better off relying on wos2pajek for that functionality.

#!/usr/bin/perl
#wos2tab.pl by ghr
#this script converts field-tagged Web Of Science queries to tab-delimited text
#for creating a network from the "CR" field, see wos2pajek
#note, you can use the info extracted by this script to replicate a wos2pajek key and thus merge

use warnings; use strict;
die "usage: wos2tab.pl <wos data>\n" unless @ARGV==1;

my $rawdata = shift(@ARGV);

my $au ; #author
my $ti ; #title
my $py ; #year
my $j9 ; #j9 coding of journal title
my $dt ; #document type

# to extract another field, work it in along the lines of the existing vars
# each var must be
# 1. declared with a "my statement" (eg, lines 12-16)
# 2. added to the header with the "print OUT" statement (ie, line 29)
# 3. written into a search and store loop following an "if" statement (eg, lines 37-41)
# 4. inside the blank line match loop (ie, lines 59-66)
#  4a. add to the print statement (ie, line 60)
#  4b. add a clear statement (eg, lines 61-65)

open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n";
print OUT "au\tdt\tpy\tti\tj9\n";
while (<IN>) {
	if($_ =~ m/^AU/) {
		$au = $_;
		$au =~ s/\015?\012//; #manual chomp
		$au =~ s/^AU //; #drop leading tag
		$au =~ s/,//; #drop comma -- author only
	}
	if($_ =~ m/^DT/) {
		$dt = $_;
		$dt =~ s/\015?\012//; #manual chomp
		$dt =~ s/^DT //; #drop leading tag
	}
	if($_ =~ m/^TI/) {
		$ti = $_;
		$ti =~ s/\015?\012//; #manual chomp
		$ti =~ s/^TI //; #drop leading tag
	}
	if($_ =~ m/^J9/) {
		$j9 = $_;
		$j9 =~ s/\015?\012//; #manual chomp
		$j9 =~ s/^J9 //; #drop leading tag
	}
	if($_ =~ m/^PY/) {
		$py = $_;
		$py =~ s/\015?\012//; #manual chomp
		$py =~ s/^PY //; #drop leading tag
	}
	
	#when blank line is reached, write out and clear memory 
	if($_=~ /^$/) {
		print OUT "$au\t$dt\t$py\t$ti\t$j9\n";
		$au = "" ;
		$dt = "" ;
		$ti = "" ;
		$py = "" ;
		$j9 = "" ;
	}
}
close IN;
close OUT;
print "\ndone\n";

July 19, 2010 at 2:13 pm 6 comments

Matrices within matrices

| Gabriel |

I was playing with the multiple correspondence analysis command in Stata and noticed that some of the return matrices had not just row and column labels, but hierarchy within the table. For instance, the return matrix e(A) is organized such that columns are dimensions, the high level rows are variables, and the low level rows are variables. I thought this was interesting but couldn’t find any explicit documentation of the hierarchy issue, either with reference to mca return matrices specifically or to matrices in general. After some tinkering (and help from UCLA ATS) I figured out that you can either ignore the high level rows and call values by absolute position in the matrix or call values by using a colon like this “main:sub”. It seems to work best with labels.

Anyway, here’s what I’m talking about. Note that the last few commands are synonymous as all three call the Cartesian coordinates for people who strongly agree with the statement that “any change makes nature worse”. The only difference is in their use of labels or numbers and absolute position in the matrix versus using the “variable:value” structure:

. webuse issp93a, clear
(Selection from ISSP (1993))

. quietly mca A B C D

. mat mca_coeff=e(A)

. matlist mca_coeff

             |      dim1       dim2 
-------------+----------------------
A            |                      
agree_stro~y |  1.836627   .7274591 
       agree |  .5462399  -.2844426 
neither_ag~e | -.4467973  -1.199439 
    disagree | -1.165903   .7367824 
disagree_s~y | -1.995217   2.470026 
-------------+----------------------
B            |                      
agree_stro~y |  2.924321   1.370078 
       agree |   .641516  -.6669377 
neither_ag~e |  .3460504  -.9639179 
    disagree |  -.714126  -.2800712 
disagree_s~y | -1.353725   2.107677 
-------------+----------------------
C            |                      
agree_stro~y |  2.157782    .908553 
       agree |  .2468277  -.5916111 
neither_ag~e | -.6189958  -1.044412 
    disagree | -1.348858   .6346467 
disagree_s~y | -1.467582   3.016588 
-------------+----------------------
D            |                      
agree_stro~y |  1.203782   1.821975 
       agree | -.2211514  -.0069347 
neither_ag~e | -.3846555  -1.158694 
    disagree | -.2216352  -.2105125 
disagree_s~y |  .7077495   1.151804 

. matlist mca_coeff[11,1..2]

             |      dim1       dim2 
-------------+----------------------
C            |                      
agree_stro~y |  2.157782    .908553 

. matlist mca_coeff[11,"dim1".."dim2"]

             |      dim1       dim2 
-------------+----------------------
C            |                      
agree_stro~y |  2.157782    .908553 

. matlist mca_coeff["C:agree_strongly","dim1".."dim2"]

             |      dim1       dim2 
-------------+----------------------
C            |                      
agree_stro~y |  2.157782    .908553  

Note that this kind of thing is common in other languages. So Perl uses package qualifiers (“main::sub”) to distinguish between objects within objects (or commands within libraries). Likewise R (which treats data more like Stata matrices than Stata master data) uses lots of nested data which you can call as “main[sub]”. For instance, igraph has 9 objects nested within a graph data object.

May 18, 2010 at 5:21 am

Programming notes

| Gabriel |

I gave a lecture to my grad stats class today on Stata programming. Regular readers of the blog will have heard most of this before but I thought it would help to have it consolidated and let more recent readers catch up.

Here’s my lecture notes. [Updated link 10/14/2010]

November 12, 2009 at 3:33 pm 2 comments

Strip the spaces from a string

| Gabriel |

Because both Stata and the OS treat (non-quote-escaped) whitespace as syntax parsing, I try to keep spaces out of strings when possible and either just run the words together or put in underscores. I especially like to do this for anything having to do with file paths. On the other hand I sometimes want to keep the spaces. For instance, if I have a file with lots of pop songs (many of which have spaces in their titles) and I want to graph them, I like to have the regular spelling in the title (as displayed in the graph) but take the spaces out for the filename. I wrote a little program called “nospaces” to strip the spaces out of a string and return it as a global.

capture program drop nospaces
program define nospaces
	set more off
	local x=lower("`1'")
	local char ""
	local char "`2'"
	local nspaces=wordcount("`x'")-1
	forvalues ns=1/`nspaces' {
		quietly disp regexm("`x'","(.+) (.+)")
		local x=regexs(1)+"`char'"+regexs(2)
	}
	quietly disp "`x'"
	global x = "`x'"
end

Note that I’m too clumsy of a programmer to figure out how to get it to return a local so there’s the somewhat clumsy workaround of having it return a global called “x.” There’s no reason to use this program interactively but it could be a good routine to work into a do-file. Here’s an example of how it would work. This little program (which would only be useful if you looped it) opens a dataset, keeps a subset, and saves that subset as a file named after the keep criteria.

local thisartist "Dance Hall Crashers"
local thissong "Fight All Night" 
use alldata, clear
keep if artist=="`thisartist'" & song=="`thissong'"
nospaces "`thisartist'"
local thisartist=$x
nospaces "`thissong'"
save `thisartist'_$x.dta, replace

May 5, 2009 at 3:23 am 2 comments

Globals

| Gabriel |

I’ve found that a very effective way to write complicated do-files is to start the file with a list of globals, which I then refer to throughout the script in various way. This is useful for things that I need to use multiple times (and want to do consistently and concisely) and for things that I need to change a lot, either because I’m experimenting with different specifications or because I might migrate the project to a different computer. The three major applications are directories, specification options, and varlists. So for instance, the beginning of a file I use to clean radio data into diffusion form looks like this:

*directories
global parentpath "/Users/rossman/Documents/book/stata"
global raw        "$parentpath/rawsongs"
global clean      "$parentpath/clean"
*options/switches
global catn       15 /*min n for first cut by category*/
global catn2      5  /*min n for second cut*/
global earlywin   60 /*how many days from first add to 5th %ile are OK? */
global earlydrop  1  /*drop adds that are earlier than p5-$earlywin?*/

Using globals to describe directories is the first trick I learned with globals (thanks to some of the excellent programming tutorials at UCLA CCPR). This is particularly useful for cleaning scripts where it’s a good idea to have several directories and move around between them. I often have a raw directory, a “variables to be merged in” directory and a clean directory and I have to switch around between them a lot. Using globals ensures that I spell the directory names consistently and makes it easy to migrate the code to another computer (such as a server). Without globals I’d have to search through the code and manually change every reference to a directory. When I’m feeling really hardcore I start the cleaning script by emptying the clean directory like this

cd $clean
shell rm *.*

This ensures that every file is created from scratch and I’m not, for instance, merging on an obsolete file that was created using old data which would be a very bad thing. The next thing I use globals for are (for lack of a better term) options or switches. As described in a previous post, one of the things you can do with this is use the “if” syntax to turn entire blocks of code on or off. There are other ways to use the switches. For instance, one of the issues I deal with in my radio data is where to draw the line between outlier and data. A fairly common diffusion pattern for the second single on an album is that two or three stations add the song when the album is first released, nothing happens for a few months (because most stations are playing the actual first single), and then dozens of stations add the song on or near the single release date, several months later a few stations may add the single here and there. So there’s basically one big wave of diffusion preceded by a few outliers and followed by a few outliers and there are good theoretical reasons for expecting that the causal process for the outliers is different from the main wave. Dealing with these outliers (especially the left ones) is a judgement call and so I like to experiment with different specifications. In the sample globals I have “earlydrop” to indicate whether I’m going to drop early outliers and “earlywin” to define how early is too early. Having these written as globals rather than written directly into the code makes it much easier to experiment with different specifications and see how robust the analysis is to them.

Likewise one of the things I like to do is break out the diffusion curves by some cluster and then another cluster within that cluster. For instance, I like to break out song diffusion by station format (genre) and within format, by the stations’ corporate owner. This way I can see if, holding constant format, Clear Channel stations time adoptions differently that other stations (in fact, they don’t). Of course this raises the question of how small of a cluster is worth bothering with. If a song gets played on 100 country stations, 25 AC stations, and 10 CHR stations, it’s obviously worth modeling the diffusion on country but it’s debatable whether it’s worth doing for AC or CHR. Likewise, if 35 of the country stations are owned by Clear Channel, 10 by Cumulus, and no more than 5 by any other given company, which corporate chains are worth breaking out and which get lumped into not-elsewhere-classified. Treating the thresholds “catn” and “cat2n” as globals lets me easily experiment with where I draw the line between broken out and lumped into n.e.c.

The last thing I use globals for is varlists. Because I mostly do this in analysis files not cleaning files, I don’t have any in the file I’ve been using as an example but here’s a hypothetical example. Suppose that I had exit poll data on a state ballot initiative to create vouchers for private school tuition and my hypothesis was that homeowners vote on this issue rationally so as to preserve their property values. So for instance, people who live in good school districts near bad school districts will oppose the initiative and I measure this with “betterschools” defined as the logged ratio of your school’s test scores to the average test scores of the adjacent schools. Like a lot of sociology papers I construct this as a nested model where I add in groups of variables, starting with the usual suspects, adding in some attitude variables and occupation/sector variables, and then the additive version of my hypothesis, and finally the interactions that really test the hypothesis. Finally I’ve gotten an R+R in which a peer reviewer is absolutely convinced that Cancer’s are selfless whereas Capricorn’s are cynical and demands that I include astrological sign as a control variable and the editor’s cover letter reminds me to address the reviewer’s important concerns about birth sign. It’s much easier to create globals for each group of variables then write the regression model on the globals than it is to write the full varlist for each model. One of the reasons is that it’s easier to change. For instance, imagine that I decide to experiment with specifying education as a dummy vs. years of education, using globals means that I need only change this in “usualsuspects” and the change propagates through the models.

global usualsuspects    "edu_ba black female ln_income married nkids"
global job              "occ1 occ2 occ3 occ4 union pubsector"
global attitudes        "catholic churchattend premarsx equal4"
global home             "homeowner betterschools"
global homeinteract     "homeowner_betterschools"
global astrology        "aries taurus gemini cancer leo virgo libra scorpio sagittarius capricorn aquarius pisces"
eststo clear
eststo: logit voteyes $usualsuspects
eststo: logit voteyes $usualsuspects $job
eststo: logit voteyes $usualsuspects $job $attitudes
eststo: logit voteyes $usualsuspects $job $attitudes $home
eststo: logit voteyes $usualsuspects $job $attitudes $home $homeinteract
eststo: logit voteyes $usualsuspects $job $attitudes $home $homeinteract $astrology
esttab using regressiontable.txt , se b(3) se(3) scalars(ll) title(Table: NESTED MODELS OF VOTING FOR SCHOOL VOUCHERS) replace fixed
eststo clear

April 25, 2009 at 11:42 am


The Culture Geeks