Posts tagged ‘shell’

Some ways Stata is an unusual language

| Gabriel |

As I’ve tried to learn other languages, I’ve realized that part of the difficulty isn’t that they’re hard (although in some cases they are) but that I’m used to Stata’s very distinctive paradigm and nomenclature. Some aspects of Stata are pretty standard (e.g., “while”/”foreach”/”forvalues” loops, log files, and the “file” syntax for using text files on disk), but other bits are pretty strange. Or rather, they’re strange from a computer science perspective but intuitive from a social science perspective.

Stata seems to have been designed to make sense to social scientists and if this makes it confusing to programmers, then so be it. A simple example of this is that Stata uses the word “variable” in the sense meant by social scientists. More broadly, Stata is pretty bold about defaults so as to make things easy for beginners. It presumes that anything you’re doing applies to the dataset (aka the master data)  — which is always a flat-file database. Other things that might be held in memory have a secondary status and beginning users don’t even know that they’re there. Likewise, commands distinguish between the important arguments (usually variables) and the secondary arguments, which Stata calls “options”. There’s also the very sensible assumptions about what to report and what to put in ephemeral data objects that can be accessed immediately after the primary command (but need not be stored as part of the original command, as they would in most other languages).

Note, I’m not complaining about any of this. Very few of Stata’s quirks are pointlessly arbitrary. (The only arbitrary deviation I can think of is using “*” instead of “#” for commenting). Most of Stata’s quirks are necessary in order to make it so user-friendly to social scientists. In a lot of ways R is a more conventional language than Stata, but most social scientists find Stata much easier to learn. In part because Stata is willing to deviate from the conventions of general purpose programming languages, running and interpreting a regression in Stata looks like this “reg y x” instead of this “summary(lm(y~x))” and loading a dataset looks like this “use mydata, clear” instead of this “data <- read.table(mydata.txt)”. Stata has some pretty complicated syntax (e.g., the entire Mata language) but you can get a lot done with just a handful of simple commands like “use,” “gen,” and “reg”.

Nonetheless all this means that when Stata native speakers like me learn a second programming language it can be a bit confusing. And FWIW, I worry that rumored improvements to Stata (such as allowing relational data in memory) will detract from its user-friendliness. Anyway, the point is that I love Stata and I think it’s entirely appropriate for social scientists to learn it first. I do most of my work in Stata and I teach/mentor my graduate students in Stata unless there’s a specific reason for them to learn something else. At the same time I know that many social scientists would benefit a lot from also learning other languages. For instance, people into social networks should learn R, people who want to do content analysis should learn Perl or Python, and people who want to do simulations should learn NetLogo or Java. The thing is that when you do, you’re in for a culture shock and so I’m making explicit some ways in which Stata is weird.

Do-files and Ado-files. In any other language a do-file would be called a script and an ado-file would be called a library. Also note that Stata very conveniently reads all your ado-files automatically, whereas most other languages require you to specifically load the relevant libraries into memory at the beginning of each script.

Commands, Programs, and Functions. In Stata a program is basically just a command that you wrote yourself. Stata is somewhat unusual in drawing a distinction between a command/program and a function. So in Stata a function usually means some kind of transformation that attaches its output to a variable or macro, as in “gen ln_income=log(income)”. In contrast a command/program is pretty much anything that doesn’t directly attach to an operator and includes all file operations (e.g., “use”) and estimations (e.g, “regress”). Other languages don’t really draw this distinction but consider everything a function, no matter what it does and whether the user wrote it or not. (Some languages use “primitive” to mean something like the Stata command vs. program distinction, but it’s not terribly important).

Because most languages only have functions this means that pretty much everything has to be assigned to an object via an operator. Hence Stata users would usually type “reg y x” whereas R users would usually type “myregression <- lm(y~x)”. This is because “regress” in Stata is a command whereas “lm()” in R is a function. Also note that Stata distinguishes between commands and everything else by word order syntax with the command being the first word. In contrast functions in other languages (just like Stata functions) have the function being the thing outside the parentheses and inside the parentheses goes all of the arguments, both data objects and options.

The Dataset. Stata is one of the only languages where it’s appropriate to use the definite article in reference to data. (NetLogo is arguably another case of this). In other languages it’s more appropriate to speak of “a data object” than “the dataset,” even if there only happens to be one data object in memory. For the same reason, most languages don’t “use” or “open” data, but “read” the data and assign it to an object. Another way to think about it is that only Stata has a “dataset” whereas other languages only have “matrices.” Of course, Stata/Mata also has matrices but most Stata end users don’t bother with them as they tend to be kind of a backend thing that’s usually handled by ado-files. Furthermore, in other languages (e.g., Perl) it’s common to not even load a file into memory but to process it line-by-line, which in Stata terms is kind of like a cross between the “file read/write” syntax and a “while” loop.

Variables. Stata uses the term “variable” in the statistical or social scientific meaning of the term. In other languages this would usually be called a field or vector.

Macros. What most other languages call variables, Stata calls local and global “macros.” Stata’s usage of the local vs global distinction is standard. In other languages the concept of “declaring” a variable is usually a little more explicit than it is in Stata.

Stata is extremely good about expanding macros in situ and this can spoil us Stata users. In other languages you often have to do some kind of crude work around by first using some kind of concatenate function to create a string object containing the expansion and then you use that string object. For instance, if you wanted to access a series of numbered files in Stata you could just loop over this:

use ~/project/file`i', clear 

In other languages you’d have to add a separate line for the expansion. So in R you’d loop over:

filename <- paste('~/project/file',i, sep="")
data <- read.table(filename)

[Update: Also see this Statalist post by Nick Cox on the distinction between variables and macros]

Reporting. Stata allows you to pass estimations on for further work (that’s what return macros, ereturn matrices, and postestimation commands are all about), but it assumes you probably won’t and so it is unusually generous in reporting most of the really interesting things after a command. In other languages you usually have to specifically ask to get this level of reporting. Another way to put it is that in Stata verbosity is assumed by default and can be suppressed with “quietly,” whereas in R silence is assumed by default and verbosity can be invoked by wrapping the estimation (or an object saving the estimation) in the “summary()” function.

August 6, 2010 at 4:36 am 12 comments

importspss.ado (requires R)

| Gabriel |

Mike Gruszczynski has a post up pointing out that you can use R to translate files, for instance from SPSS to Stata. I like this a lot because it let’s you avoid using SPSS but I’d like it even better if it let you avoid using R as well.

As such I rewrote the script to work entirely from Stata. Mike wanted to do this in Bash but couldn’t figure out how to pass arguments from the shell to R. Frankly, I don’t know how to do this either which is why my solution is to have Stata write and execute an R source file so all the argument passing occurs within Stata. This follows my general philosophy of doing a lot of code mise en place in a user-friendly language so I can spend as little time as necessary in R. (Note that you could just as easily write this in Bash, but I figured this way you can a) make it cross-platform and b) attach it to “use” for a one-stop shop “import” command).

*importspss.ado
*by GHR 6/29/2010
*this script uses R to translate SPSS to Stata
*it takes as arguments the SPSS file and Stata file
*adapted from http://mikegruz.tumblr.com/post/704966440/convert-spss-to-stata-without-stat-transfer 

*DEPENDENCY: R and library(foreign) 
*if R exists but is not in PATH, change the reference to "R" in line 27 to be the specific location

capture program drop importspss
program define importspss
	set more off
	local spssfile `1'
	if "`2'"=="" {
		local statafile "`spssfile'.dta"
	}
	else {
		local statafile `2'	
	}
	local sourcefile=round(runiform()*1000)
	capture file close rsource
	file open rsource using `sourcefile'.R, write text replace
	file write rsource "library(foreign)" _n
	file write rsource `"data <- read.spss("`spssfile'", to.data.frame=TRUE)"' _n
	file write rsource `"write.dta(data, file="`statafile'")"' _n
	file close rsource
	shell R --vanilla <`sourcefile'.R
	erase `sourcefile'.R
	use `statafile', clear
end

June 29, 2010 at 3:01 pm 6 comments

Gnuplot [updated]

| Gabriel |

Here’s an updated version of my old script for piping from Stata to Gnuplot. Gnuplot is a generic graphing utility. The main thing it can do that Stata can’t is good 3D graphs. (Although the ado files “surface” and “tddens” make a good effort at adding this capability). In this version I set it to a coarser smooth, set the horizontal plane to the minimum value for Z, and allow either contour or surface.

Note that the dgrid3d smoothing algorithm in Gnuplot 4.2 (which is still the version maintained in Fink and Ubuntu) has some problems — the most annoying of which is that it can go haywire at the borders. I decided to handle this by setting it to a coarse smooth (5,5,1). However if unlike me you’re lucky enough to have Gnuplot 4.3 you have access to several much better smoothing algorithms and you may want to experiment with increasing the resolution and using one of the new algorithms by editing line 41 of this script.

*GHR 5/25/2010
*gnuplotpm3d 1.1
*pipes data to gnuplot's splot command, useful for color surface graphs
capture program drop gnuplotpm3d
program define gnuplotpm3d
syntax varlist(min=3 max=3 numeric) [ , title(string asis) xlabel(string asis) ylabel(string asis) using(string asis) contour rotate(string asis)]

disp "`rotate'"

if "`xlabel'"=="" {
local xlabel="`x'"
}
if "`ylabel'"=="" {
local ylabel="`y'"
}
tempfile gnuplotpm3cmd
tempfile data

preserve
keep `varlist'
order `varlist'
disp "`using'"
outsheet using `data'.txt
restore

*assumes that the software is running on a mac using Gnuplot.app
*if installed with fink or macports, set to the default (Unix) mode of just "gnuplot"
local gnuplot="gnuplot"
if "`c(os)'"=="MacOSX" {
	local gnuplot="exec '/Applications/Gnuplot.app/Contents/Resources/bin/gnuplot'"  /*assumes binary install, set to simply "gnuplot" if fink or macports install*/
}
if "`c(os)'"=="Windows" {
	*not tested
	local gnuplot="gnuplot.exe"  
}

file open gpcmd using `gnuplotpm3cmd', write text
file write gpcmd "cd '`c(pwd)'' ; " _n
file write gpcmd "set terminal postscript color ; set output '`using'.eps' ; set palette color positive ; " _n
file write gpcmd "set auto ; set parametric ; " _n
file write gpcmd "set dgrid3d 5,5,1 ; " _n  /*interpolation algorithm, allows tolerance for data irregularities -- set to low numbers for smooth graph and high numbers for bumpier graph*/
file write gpcmd `"set title "`title'" ; set ylabel "`ylabel'"; set xlabel "`xlabel'"; "' _n
file write gpcmd "unset contour; unset surface; " _n
if "`contour'"=="contour" {
	file write gpcmd "set view map; " _n
}
if "`rotate'"!="" & "`contour'"=="contour" {
	file write gpcmd "set view `rotate'; " _n
}
if "`contour'"=="contour" & "`rotate'"!="" {
	disp "rotate options ignored as they are incompatible with contour view"
}
file write gpcmd "set xyplane 0; " _n  /*put the xy plane right at min(z) */
file write gpcmd "set pm3d;" _n
file write gpcmd "set pm3d interpolate 10,10; " _n
file write gpcmd `"splot "`data'.txt"; "' _n
file close gpcmd

shell `gnuplot' `gnuplotpm3cmd'
end

May 25, 2010 at 3:01 pm 2 comments

Cited reference search time-series

| Gabriel |

[Update: it looks like Google redirects you to a captcha after the first 100 pages, so the script won’t work for mega-cites like DiMaggio and Powell 1983].

I was recently talking to somebody who suspected an article he wrote 30 years ago was something of a “sleeper hit” and wanted to see an actual time-series. I wrote this little script to read Google Scholar and extract the dates. You have to tell it the Google Scholar serial number for the focal cite and how many pages to collect.

For instance if you search GS for Strang and Soule’s ARS and click where it says “Cited by 493” you get the URL “http://scholar.google.com/scholar?cites=3071200965662451019&hl=en&as_sdt=2000&#8221;. The important part of the URL is the number between “cites=” and “&”. To figure out how many pages to collect divide the number of citations by 10 and round down. So the syntax to scrape for this cite would be:

bash gscholarscrape.sh 3071200965662451019 49

Here’s the time-series for citations to that article

Here’s the code. Note that with fairly little modification you could get it to also give the names citing journal or book and authors.

#!/bin/bash
# gscholarscrape.sh
# this script scrapes google scholar for references to a given cite
# GHR, rossman@soc.ucla.edu

#takes as arguments the serial number of the cite followed by the number of pages deep to scrape (# of cites / 10)
#eg, for DiMaggio and Powell ASR 1983 the syntax is
#gscholarscrape.sh 11439231157488236678 1103
for (( i = 0; i < $2; i++ )); do
	j=$(($i*10))
	curl -A "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.307.11 Safari/532.9" -o gs$1_page$i.htm "http://scholar.google.com/scholar?start=$j&hl=en&cites=$1"
done

echo "date" >  $1.txt
for (( i = 0; i < $2; i++ )); do
	perl -nle 'print for m/ ([0-9][0-9][0-9][0-9]) - /g' gs$1_page$i.htm >> $1.txt
done

# have a nice day

May 13, 2010 at 4:58 am 2 comments

Grepmerge

| Gabriel |

Over at the Orgtheory mothership, Fabio asked how to do a partial string match in Stata, specifically to see if certain keywords appear in scientific abstracts. This turns out to be hard, not because there are no appropriate functions in Stata (both strmatch() and regexm() can do it) but because Stata can only handle 244 characters in a string variable. Many of the kinds of data we’d want to do content analysis on are much bigger than this. For instance, scientific abstracts are about 2000 characters and news stories are about 10000 characters.

OW suggested using SPSS, and her advice is well-taken as she’s a master at ginormous content analysis projects. Andrew Perrin suggested using Perl and this is closer to my own sympathies. I agree that Perl is generally a good idea for content analysis, but in this case I think a simple grep will suffice.

grep "searchterm" filein.csv | cut -d "," -f 1 > fileout.csv

The way this works is you start with a csv file called filein.csv (or whatever) where the record id key is in the first column. You do a grep search for “searchterm” in that file and pipe the output to the “cut” command. The -d “,” option tells cut that the stream is comma delimited and the -f 1 option tells it to only keep the first field (which is your unique record id). The “> fileout.csv” part says to pipe the output to disk. (Note that in Unix “>” as a file operator means replace and “>>” means append). You then have a text file called fileout.csv that’s just a list of records where your search term appears. You can merge this into Stata and treat a _merge==3 as meaning that the case includes the search term.

You can also wrap the whole thing in a Stata command that takes as arguments (in order): the term to search for, the file to look for it in, the name of the key variable in the master data, and (optionally) the name of the new variable that indicates a match. However for some reason the Stata-wrapped version only works with literal strings and not regexp searches. Also note that all this is for Mac/Linux. You might be able to get it to work on Windows with CygWin or Powershell.

capture program drop grepmerge
program define grepmerge
	local searchterm	"`1'"
	local fileread	"`2'"
	local key "`3'"
	if "`4'"=="" {
		local newvar "`1'"
	}
	else {
		local newvar "`4'"
	}
	tempfile filewrite
	shell grep "`searchterm'" `fileread' | cut -d "," -f 1 > `filewrite'
	tempvar sortorder
	gen `sortorder'=[_n]
	tempfile masterdata
	save `masterdata'
	insheet using `filewrite', clear
	ren v1 `key'
	merge 1:1 `key' using `masterdata', gen(`newvar')
	sort `sortorder'
	recode `newvar' 1=.a 2=0 3=1
	notes `newvar' : "`searchterm'" appears in this case
	lab val `newvar'
end

April 29, 2010 at 12:45 pm 8 comments

Using grep (or mdfind) to reshape data

| Gabriel |

Sometimes you have cross-class data that’s arranged the opposite of how you want. For instance, suppose I have a bunch of files organized by song, and I’m interested in finding all the song files that mention a particlar radio station, say KIIS- FM. I can run the following command that finds all the song files in my song directory (or its subdirectories) and puts the names of these files in a text file called “kiis.txt”

grep -l -r ’KIIS’ ~/Documents/book/stata/rawsongs/ > kiis.txt

Of course to run it from within Stata I can prefix it with “shell”. By extension, I could then write a program around this shell command that will let me query station data from my song files (or vice versa). You could do something similar to see what news stories saved from Lexis-Nexis or scraped web pages contain a certain keyword.

Unfortunately grep is pretty slow, but you can do it faster by accessing your desktop search index. It’s basically the difference between reading a book looking for a reference versus looking the reference up in the book’s index. This is especially important if you’re searching over a lot of data — grep is fine for a few dozen files but you want indexed search if you’re looking over thousands of files, let alone your whole file system. On a Mac, you can access your Spotlight index from shell scripts (or the Terminal) with “mdfind“. The syntax is a little different than grep so the example above should be rewritten as

mdfind -onlyin ~/Documents/book/stata/rawsongs/ "KIIS" > kiis.txt

While grep is slower than mdfind, it’s also more flexible. Fortunately (as described here), you can get the best of both worlds by doing a broad search with mdfind then piping the results to grep for more refined work.

April 7, 2010 at 5:13 am 1 comment

Stata shell “command not found” errors

| Gabriel |

I like to use the shell command to pipe commands from Stata to the OS and/or other programs. For instance, graphexportpdf pipes to the Ghostscript command ps2pdf. Unfortunately I pretty often get error messages like this
/bin/bash: ps2pdf: command not found

Sometimes just restarting Stata works, but I’ve found that the only 100% reliable way to get shell to work properly is to execute the script in Stata console instead of Stata.app. You can do this from the Terminal as

exec /Applications/Stata/StataMP.app/Contents/MacOS/stata-mp foo.do

December 14, 2009 at 4:20 am 3 comments

Get the path

| Gabriel |

When you’re scripting (whether in Stata or anything else) you need to tell the script where to look for things by giving it a directory path. As previously mentioned, I think it’s a good idea to treat the path as what Stata calls a “macro” and most other languages call a “variable.” That way you can define the path at the beginning of the script and if you later decide to change the target path you can change the one macro/variable rather than combing through the script looking for each instance.

Of course, this assumes that you know what the path is, which can be hard to remember if it’s a long path. There are a few ways to get it.

From within Stata, the local `c(pwd)' holds the current path and this info is also displayed in the interactive mode interface (in the toolbar on a Mac, at the bottom of the main window on Windows).

TextWrangler has a “copy path” feature in “get info”.

From the Mac Terminal you can get the path in the clipboard with pwd | pbcopy

In Snow Leopard, you can also do it as a Finder service. Follow these instructions, except substitute this shell script:

sed -e 's/:/\//g' -e 's/\ /%20/g' -e 's,[^/]*$,,' | pbcopy

December 7, 2009 at 5:26 am 7 comments

Perl text library

| Gabriel |

I found this very useful library of perl scripts for text cleaning. You can use them even if you can’t code perl yourself, for instance to transpose a dataset just download “transpose.pl” script to your ~/scripts directory and enter the shell command:
perl ~/scripts/transpose.pl row_col.txt > col_row.txt

The transpose script is particularly useful to me as I’ve never gotten Excel’s transpose function to work and for some bizarre reason Stata’s “xpose” command only works with numeric variables. You can even use these scripts from directly in a do-file like so:

tempfile foo1
tempfile foo2
outsheet using `foo1'.txt
shell perl ~/scripts/transpose.pl `foo1'.txt > `foo2'.txt
insheet using `foo2'.txt, clear

November 30, 2009 at 4:49 am 1 comment

Copy mac files when booting from dvd

| Gabriel |

One of the frustrating things about the Mac is that there’s no such thing as a live cd (and live cds for Windows and Linux can’t read HFS disks). Of course you can boot from the installer dvd, but it doesn’t have the Finder. If you have problems booting from your internal disk and you don’t have a reasonably current backup this can induce alternating waves of panic and despair. (I’m speaking from experience. I’ve screwed up my partition table by playing with gparted. Actually, I’ve done this twice — as a dog returns to his vomit so a fool returns to his folly).

However you can still copy files because the installer dvd does have the Terminal, and the Terminal can invoke the command “cp“. Here’s how to do it.

  1. Put the dvd in and restart, tapping option so it let’s you choose the dvd.
  2. Choose a language, then instead of installing the OS, go to the Utilities menu and choose Terminal
  3. Plug in a USB drive and type “ls /Volumes”. Figure out which one is your USB drive, which one is your internal drive, and write it down. If it doesn’t recognize the USB drive you’ll need to mount.
  4. Use “cd” to navigate to your internal disk and find your most important files, which are probably in “/Volumes/Macintosh HD/Users/yournamehere/Documents”
  5. Use the “cp source target” command to copy files from the internal disk to the USB disk. To copy a directory use the -R option. For example to copy the directory “bookmanuscript” you’d use something like
  6. cp -R '/Volumes/Macintosh HD/Users/yournamehere/Documents/bookmanuscript' /Volumes/USBdisk"

October 27, 2009 at 5:07 am

Older Posts Newer Posts


The Culture Geeks