Posts Tagged Stata
Stata shell “command not found” errors
| Gabriel |
I like to use the shell command to pipe commands from Stata to the OS and/or other programs. For instance, graphexportpdf pipes to the Ghostscript command ps2pdf. Unfortunately I pretty often get error messages like this
/bin/bash: ps2pdf: command not found
Sometimes just restarting Stata works, but I’ve found that the only 100% reliable way to get shell to work properly is to execute the script in Stata console instead of Stata.app. You can do this from the Terminal as
exec /Applications/Stata/StataMP.app/Contents/MacOS/stata-mp foo.do
3 comments December 14, 2009
Stata2Pajek w vertice colors
| Gabriel |
My Stata program stata2pajek let’s you export an edge list or arc list from Stata to Pajek. However I recently wanted to make color-coded vertices and this required tweaking the syntax a bit. Because this is a relational database problem (and Stata likes flat-file), I do it by merging in a vertice level file on disk. Although I’ve written it to merge on colors it should be fairly easy to rewrite to deal with other vertice-level variables. Unfortunately the syntax is pretty awkward, which is why I’m forking it and posting it to the blog instead of updating the main stata2pajek file at SSC.
Note that the script assumes that nodes which appear in the arc list but not the vertice attributes file should be color-coded yellow. If you want yellow to have a substantive meaning you should change line 61 to be something else.
The syntax is the same as stata2pajek except there are a few more options. If you omit these options it behaves just like stata2pajek classic:
- attributefile() — the file where the vertice level variables are stored
- attributekey() — the merge key, defaults to “ego”
- color() — the variable storing the color-codes
For instance, to color-code a graph of radio stations based on when they first played “My Humps” (yes, really), I use this command:
stata2pajekalt ego alter, filename(ties_bounded_humpcolor) attributefile(humps_color) attributekey(ego) color(color)
It says to treat the data in memory as an arc list, to merge on station-level data on adoption time where the station name is stored as the variable “ego” in the file “humps_color”, and to write out the whole mess as a Pajek formatted text file called “ties_bounded_humpscolor.net”.
*1.2b GHR Dec 10, 2009
*forked from stata2pajek 1.2
capture program drop stata2pajekalt
program define stata2pajekalt
version 10
set more off
syntax varlist(min=2 max=2) [ , tiestrength(string asis) filename(string asis) EDGEs attributefile(string asis) attributekey(string asis) color(string asis)]
tempfile dataset
quietly save `dataset'
gettoken ego alter : varlist
local starchar="*"
if "`filename'"=="" {
local pajeknetfile "mypajekfile"
}
else {
local pajeknetfile "`filename'"
}
if "`attributekey'"=="" {
local attributekey "`ego'"
}
capture file close pajeknetfile
file open pajeknetfile using `pajeknetfile'.net, write text replace
use `dataset', clear
drop `ego'
ren `alter' `ego'
append using `dataset'
keep `ego'
contract `ego'
ren `ego' verticelabel
drop _freq
sort verticelabel
gen number=[_n]
order number verticelabel
if "`attributefile'"!="" {
ren verticelabel `attributekey'
merge `attributekey' using `attributefile'
drop if _merge==2
ren `attributekey' verticelabel
drop _merge
keep number verticelabel `color'
sort verticelabel
}
tempfile verticelabels
quietly save `verticelabels', replace
local nvertices=[_N]
file write pajeknetfile "`starchar'Vertices `nvertices'" _n
if "`attributefile'"=="" {
forvalues x=1/`nvertices' {
local c2=verticelabel in `x'
file write pajeknetfile `"`x' "`c2'""' _n
}
}
else {
forvalues x=1/`nvertices' {
local c2=verticelabel in `x'
local colvalue=`color' in `x'
if "`colvalue'"=="" {
local colvalue "Yellow"
}
file write pajeknetfile `"`x' "`c2'" ic `colvalue'"' _n
}
}
use `dataset', clear
ren `ego' verticelabel
sort verticelabel
quietly merge verticelabel using `verticelabels'
quietly keep if _merge==3
drop _merge verticelabel
ren number `ego'
ren `alter' verticelabel
sort verticelabel
quietly merge verticelabel using `verticelabels'
quietly keep if _merge==3
drop _merge verticelabel
ren number `alter'
order `ego' `alter' `tiestrength'
keep `ego' `alter' `tiestrength'
local narcs=[_N]
if "`edges'"=="edges" {
file write pajeknetfile `"`starchar'Edges"' _n
}
else {
file write pajeknetfile `"`starchar'Arcs"' _n
}
sort `ego' `alter'
if "`tiestrength'"~="" {
forvalues x=1/`narcs' {
local c1=`ego' in `x'
local c2=`alter' in `x'
local c3=`tiestrength' in `x'
file write pajeknetfile "`c1' `c2' `c3'" _n
}
}
else {
forvalues x=1/`narcs' {
local c1=`ego' in `x'
local c2=`alter' in `x'
file write pajeknetfile "`c1' `c2'" _n
}
}
file close pajeknetfile
*ensure that it's windows (CRLF) text format
if "$S_OS"~="Windows" {
filefilter `pajeknetfile'.net tmp, from(\M) to(\W) replace
shell mv tmp `pajeknetfile'.net
filefilter `pajeknetfile'.net tmp, from(\U) to(\W) replace
shell mv tmp `pajeknetfile'.net
}
use `dataset', clear
disp "Your output is saved as"
disp "`c(pwd)'`c(dirsep)'`pajeknetfile'.net"
end
2 comments December 10, 2009
the missing e(se) return matrix
| Gabriel |
One of the weird things with Stata’s postestimation returns is that it gives you the betas no problem, but there’s no obvious return for the standard error. The closest thing it gives you is e(V), the variance-covariance matrix, which you have to process to get standard error.
sysuse auto, clear
reg weight foreign length
* save the return matrices
matrix betas=e(b)
matrix varcovar=e(V)
*take the root of the diagonal of var-covar
mata: st_matrix("se",sqrt(diagonal(st_matrix("varcovar")))')
*combine and label the beta & se vectors into a single matrix
matrix results=betas\se
matrix rowname results = beta se
matrix list results
(Note: I’ve just discovered the “sourcecode” WordPress format, which should avoid some previous problems I’ve had with putting code directly in posts].
2 comments December 8, 2009
Drop the entirely missing variables
| Gabriel |
One of the purely technical frustrations with GSS is that it’s hard to figure out if a particular question was in a particular wave. The SDA server is pretty good about kicking these from extracts, but it still leaves some in. The other day I was playing with the 2008 wave and got sick of saying “oooh, that looks interesting” only to find the variable was missing for my wave, which happened repeatedly. (For instance, I thought it would be fun to run Biblical literalism against the Israel feeling thermometer, but no dice, at least for the 2008 wave).
To get rid of these phantom variables, I wrote this little loop that drops variables with entirely missing data (or that are coded as strings, see below):
foreach varname of varlist * {
quietly sum `varname'
if `r(N)'==0 {
drop `varname'
disp "dropped `varname' for too much missing data"
}
}
Unfortunately “sum” thinks that string variables have no observations so this will also drop strings. There’s a workaround, but it involves the “ds” command, which works in Stata 11 but has been deprecated and so may not work in future versions.
ds, not(type string)
foreach varname of varlist `r(varlist)' {
quietly sum `varname'
if `r(N)'==0 {
drop `varname'
disp "dropped `varname' for too much missing data"
}
}
7 comments December 2, 2009
Perl text library
| Gabriel |
I found this very useful library of perl scripts for text cleaning. You can use them even if you can’t code perl yourself, for instance to transpose a dataset just download “transpose.pl” script to your ~/scripts directory and enter the shell command:
perl ~/scripts/transpose.pl row_col.txt > col_row.txt
The transpose script is particularly useful to me as I’ve never gotten Excel’s transpose function to work and for some bizarre reason Stata’s “xpose” command only works with numeric variables. You can even use these scripts from directly in a do-file like so:
tempfile foo1
tempfile foo2
outsheet using `foo1'.txt
shell perl ~/scripts/transpose.pl `foo1'.txt > `foo2'.txt
insheet using `foo2'.txt, clear
1 comment November 30, 2009
Programming notes
| Gabriel |
I gave a lecture to my grad stats class today on Stata programming. Regular readers of the blog will have heard most of this before but I thought it would help to have it consolidated and let more recent readers catch up.
Here’s my lecture notes.
2 comments November 12, 2009
Team Sorting
| Gabriel |
Tyler Cowen links to an NBER paper by Hoxby that shows that in recent decades, status sorting has gotten more intense for college. Cowen asks “is this a more general prediction in a superstars model?” The archetypal superstar system is Hollywood, and here’s my quick and dirty stab at answering Tyler’s question for that field. Faulkner and Anderson’s 1987 AJS showed that there is a lot of quality sorting in Hollywood, but they didn’t give a time trend. As shown in my forthcoming ASR with Esparza and Bonacich, there are big team spillovers so this is something we ought to care about.
I’m reusing the dataset from our paper, which is a subset of IMDB for Oscar eligible films (basically, theatrically-released non-porn) from 1936-2005. If I were doing it for publication I’d do it better (i.e., I’d allow the data to have more structure and I’d build confidence intervals from randomness), but for exploratory purposes the simplest way to measure sorting is to see if a given film had at least one prior Oscar nominee writer, director, and actor. From that I can calculate the odds-ratio of having an elite peer in the other occupation.
Overall, a movie that has at least one prior nominee writer is 7.3 times more likely than other films to have a prior nominee director and 4.4 times more likely to have a prior nominee cast. A cast with a prior nominee is 6.5 times more likely to have a prior nominee director. Of course we already knew there was a lot of sorting from Faulker and Anderson, the question suggested by Hoxby/Cowen is what are the effects over time?
This little table shows odds-ratios for cast-director, writer-director, and writer-cast. Big numbers mean more intense sorting.
...+--------------------------------------+
...| decade cd wd wc |
...|--------------------------------------|
1. | 1936-1945 6.545898 6.452388 4.306554 |
2. | 1946-1955 9.407476 6.425553 5.368151 |
3. | 1956-1965 12.09229 8.741302 6.720059 |
4. | 1966-1975 4.697238 5.399081 4.781106 |
5. | 1976-1985 4.113508 6.984528 4.450109 |
6. | 1986-1995 4.923809 7.599852 3.301461 |
7. | 1996-2005 4.826018 12.35915 3.641975 |
+-----------------------------------------+
The trend is a little complicated. For collaborations between Oscar-nominated casts on the one-hand and either writers or directors, the sorting is most intense in the 1946-1955 decade and especially the 1956-1965 decade. My guess is that this is tied to the decline of the studio system and/or the peak power of MCA. The odds-ratio of good director for nom vs non-nom writers also has a jump around the end of the studio system, but it seems there’s a second jump starting in the 80s. My guess is that this is an artifact of the increasing number of writer-directors (see Baker and Faulkner AJS 1991), but it’s an empirical question.
Putting aside the writer-director thing, it seems that sorting is not growing stronger in Hollywood. My guess is that ever more intense sorting is not a logical necessity of superstar markets, but has to do with contingencies, such as the rise of a national market for elite education in Hoxby’s case or the machinations of Lew Wasserman and Olivia deHavilland in my case.
The Stata code is below. (sorry that wordpress won’t preserve the whitespace). The data consists of film-level data with dummies for having at least one prior nominee for the three occupations.
global parentpath "/Users/rossman/Documents/oscars"
capture program drop makedecade
program define makedecade
gen decade=year
recode decade 1900/1935=. 1936/1945=1 1946/1955=2 1956/1965=3 1966/1975=4 1976/1985=5 1986/1995=6 1996/2005=7
capture lab drop decade
lab def decade 1 "1936-1945" 2 "1946-1955" 3 "1956-1965" 4 "1966-1975" 5 "1976-1985" 6 "1986-1995" 7 "1996-2005"
lab val decade decade
end
cd $parentpath
capture log close
log using $parentpath/sorting_analysis.log, replace
use sorting, clear
makedecade
*do odds-ratio of working w oscar nom, by own status
capture program drop allstar
program define allstar
preserve
if "`1'"!="" {
keep if decade==`1'
}
tabulate cast director, matcell(CD)
local pooled_cd=(CD[2,2]*CD[1,1])/(CD[1,2]*CD[2,1])
tabulate writers director, matcell(WD)
local pooled_wd=(WD[2,2]*WD[1,1])/(WD[1,2]*WD[2,1])
tabulate writers cast, matcell(WC)
local pooled_wc=(WC[2,2]*WC[1,1])/(WC[1,2]*WC[2,1])
shell echo "`pooled_cd' `pooled_wd' `pooled_wc' `1'" >> sortingresults.txt
restore
end
shell echo "cd wd wc decade" > sortingresults.txt
quietly allstar
forvalues t=1/7 {
quietly allstar `t'
}
insheet using sortingresults.txt, delimiter(" ") names clear
lab val decade decade
*have a nice day
4 comments November 8, 2009
Stata 11 FV and margin
| Gabriel |
Yesterday I attended the ATS workshop on the new factor variables and margin syntax in Stata 11. Despite the usual statistical usage of the word “factor,” this has nothing to do eigenvectors and multi-dimensional scaling but is really about dummy sets and interactions. I might still be missing something, but it seems like the factor variables syntax is only an incremental improvement over the old “xi” syntax, mostly because it’s more elegant.
However the margin command is really impressive and should go a long way to making nonlinear models (including logit) more intelligible. I think a big reason people have p-fetishism is because with a lot of models it’s difficult to understand effects size. For this reason I like to close my results section with predicted values for various vignettes. I had been doing this in Excel or Numbers but “margin” will make this much easier, especially if I continue to experiment with specifications. (In general, I find that if you’re doing something once, GUI is faster than scripting, but we never just do something once so scripting is better in the long run). Anyway, it’s a very promising command.
My only reservation about both “factor variables” and “margin” is the value labeling. First, (like “xi”) neither command carries through value labels so you have to remember what occupation 3 is instead of it saying “sales.” Second, the numbers aren’t even consistent between factor and margin. Factor shows the value of the underlying variable whereas margin numbers the categories sequentially. So for instance, your basic dummy would be “0″for no and “1″ for yes in factor variables because that’s how it’s stored in memory and “1″ for no and “2″ for yes in margin because “no” is the first category. What is this, SPSS? Anyway, margin is a very useful command, but it would be even more useful if the command itself or some kind of postestimation or wrapper ado file made the output more intuitive. Not that I’m volunteering to write it. Help us Ben Jann, you’re our only hope!
7 comments October 29, 2009
Shufflevar
| Gabriel |
Sometimes you face a situation where it’s really hard to see what the null is because the data structure is really complicated and there is all sorts of nonlinearity, etc. Analyses of non-sparse square network matrices can use the quadratic assignment procedure, but you can do something similar with other data structures, including bipartite networks.
A good null keeps everything constant, but shows what associations we would expect were association random. The simplest way to do this is to keep the actual variable vectors but randomly sort one of the vectors. So for instance, you could keep the actual income distribution and the actual values of peoples’ education, race, etc, but randomly assign actual incomes to people.
Fernandez, Castilla, and Moore used what was basically this approach to build a null distribution of the effects of employment referrals. Since then Ezra Zuckerman has used it in several papers on Hollywood to measure the strength of repeat collaboration. I myself am using it in some of my current radio work to understand how much corporate clustering we’d expect to see in the diffusion of pop songs under the null hypothesis that radio corporations don’t actually practice central coordination.
I wrote a little program that takes the argument of the variable you want shuffled. It has a similar application as bsample, and like bsample it’s best used as part of a loop.
capture program drop shufflevar program define shufflevar local shufflevar `1' tempvar oldsortorder gen `oldsortorder'=[_n] tempvar newsortorder gen `newsortorder'=uniform() sort `newsortorder' capture drop `shufflevar'_shuffled gen `shufflevar'_shuffled=`shufflevar'[_n-1] replace `shufflevar'_shuffled=`shufflevar'[_N] in 1/1 sort `oldsortorder' drop `newsortorder' `oldsortorder' end
Here’s an example to show how much clustering of “y” you’d expect to see by “clusterid” if we keep the observed distributions of “y” and “clusterid” but break any association between them:
shell echo "run rho" > _results_shuffled.txt
forvalues run=1/1000 {
disp "iteration # `run' of 1000"
quietly shufflevar clusterid
quietly xtreg y, re i(clusterid_shuffled)
shell echo "`run' `e(rho)'" >> _results_shuffled.txt
}
insheet using _results_shuffled.txt, names clear delimiter(" ")
histogram rho
sum rho
(Note that “shell echo” only works with Mac/Unix, Windows users should try postfile).
2 comments October 26, 2009
Probability distributions
| Gabriel |
I wrote this little demo for my stats class to show how normal distributions result from complex processes that sum the constituent parts whereas count distributions result from complex processes where a single constituent failure is catastrophic.
*this do-file is a simple demo of how you statistical distributions are built up from additive vs sudden-death causation
*this is entirely based on a simulated coin toss -- the function "round(uniform())"
*one either counts how many heads out of 10 tosses or how long a streak of heads lasts
*I'm building up from this simple function for pedagogical purposes, in actual programming there are much more direct functions like rnormal()
*1. The normal distribution
*Failure is an additive setback
clear
set obs 1000
forvalues var=1/10 {
quietly gen x`var'=.
}
forvalues row=1/1000 {
forvalues var=1/10 {
quietly replace x`var'=round(uniform()) in `row'
}
}
gen sumheads=x1+x2+x3+x4+x5+x6+x7+x8+x9+x10
order sumheads
lab var sumheads "How Many Heads Out of 10 Flips"
*show five examples
list in 1/5
histogram sumheads, discrete normal
graph export sumheads.png, replace
*2. Count distribution
*Failure is catastrophic
clear
set obs 1000
forvalues var=1/30 {
quietly gen x`var'=.
}
gen streak=0
lab var streak "consecutive heads before first tails"
gen fail=0
forvalues row=1/1000 {
forvalues var=1/30 {
quietly replace x`var'=round(uniform()) in `row'
quietly replace fail=1 if x`var'==0
quietly replace streak=`var' if fail==0
}
quietly replace fail=. in `row'/`row'
}
quietly replace streak=0 if x1==0
*show five partial examples
list streak x1 x2 x3 x4 x5 in 1/5
histogram streak, discrete
graph export streakheads.png, replace
*have a nice day
Add comment October 12, 2009