Posts tagged ‘Stata’

Christmas in July

| Gabriel |

Has it been two years already? Holy moly, Stata 12 looks awesome.

The headline feature is structural equation modeling. It comes with a graphic model builder, which even an “only scripting is replicable” zealot like me can appreciate as it helps you learn complicated command syntax. (I feel the same way about graphs). I had actually been thinking of working SEM into my next paper and was thinking through the logistics of getting a copy of M+, learning the syntax (again), etc. Now I can do it within Stata. I look forward to reading more papers that use SEM without really understanding the assumptions.

Probably the most satisfying new feature to me though is contour plots. Ever since I got interested in writing simulations a few years ago, I have been wanting to make heat maps in Stata. I’ve spent many hours writing code that can pipe to gnuplot and, not being satisfied with that, I (with some help from Lisa) have spent yet more time working on another script that can pipe to the wireframe function in R’s lattice library. Now I’m very happy to say that I will not finish writing this ado-file and submitting it to SSC as Stata 12 contains what looks to be really good native heat plots.

I’m thinking the set of commands I will feel most guilty about not using more often, is margin plots, which extends the margin command from Stata 11. In addition to the headline new features there’s a bunch of little stuff, including fixes to get more compatibility between the estimation commands and the ancillary commands (e.g., better “predict” support for count models and “svy” support for “xtmixed”). Also, Windows users should be pleased to hear that they can now do PDFs natively.

[Update: Also see Jeremy’s post on Stata 12. He closes with a pretty funny metaphor of stats packages to cell phone brands.]

June 27, 2011 at 12:46 pm 7 comments

Which of my cites is missing?

| Gabriel |

I was working on my book (in Lyx) and it drove me crazy that at the top of the bibliography was a missing citation. Finding the referent to this missing citation manually was easier said than done and ultimately I gave up and had the computer do it. These suggestions are provided rather inelegantly as a “log” spread across two languages. However you could pretty easily work them into an argument-passing script written in just one language. Likewise, it should be easy to modify them for use with plain vanilla LaTeX if need be.

First, I pulled all the citations from the book manuscript and all the keys from my Bibtex files.

grep '^key ' book.lyx | sort | uniq -u | perl -pe 's/^key "([^"]+)"/$1/' > cites.txt
grep '^\@' ~/Documents/latexfiles/ghrcites_manual.bib | perl -pe 's/\@.+{(.+),/$1/' > bibclean.txt
grep '^\@' ~/Documents/latexfiles/ghrcites_zotero.bib | perl -pe 's/\@.+{(.+),/$1/' >> bibclean.txt

Then in Stata I merged the two files and looked for Bibtex keys that appear in the manuscript but not the Bibtex files. [Update, see the comments for a better way to do this.] From that point it was easy to add the citations to the Bibtex files (or correct the spelling of the keys in the manuscript).

insheet using bibclean.txt, clear
tempfile x
save `x'
insheet using cites.txt, clear
merge 1:1 v1 using `x'
list if _merge==1

June 1, 2011 at 4:37 am 2 comments

Stata for Mac PDF and “set graphics on”

| Gabriel |

I recently noted that graph exporting to PDF in Stata for Mac is fixed. Turns out that this is only partially true. It works and creates beautiful output, but unlike the other “graph export” options it only works if you have “set graphics on” in the Stata GUI. If you’re running it as Stata console or have graphics set off in Stata GUI, it simply doesn’t work. (I do this when batching a lot of graphs as it is faster and less distracting).

My understanding is that this has something to do with how Stata relies on Mac’s Quartz driver to render PDF so it’s not really feasible to fix. So basically you have three options:

1) Do it in the GUI with “set graphics on” and accept the CPU performance hit and distraction of all the graphs rendering.

2) Use my graphexportpdf ado file or the “graph print” command with CUPS-PDF as the print driver.

3) Stick to using EPS

May 16, 2011 at 4:14 am 4 comments

Simulations, numlist, and order of operations

| Gabriel |

I’ve been programming another simulation and as is typical am batching it through various combinations of parameter values, recording the results each time. In making such heavy (and recursive) use of the forvalues loop I noticed some issues with numlist and orders of operation in algorithms.

First, Stata’s numlist expression (as in the “forvalues” syntax) introduces weird rounding errors, especially if specified as fractions. Thus it is preferable to count by integers then scale down to the fractional value within the loop. This is also useful if you want to save each run of the simulation as a file as it lets you avoid fractional filenames.

So instead of this:

forvalues i=0(.01)1 {
	replace x=sin(`i')
	save sin`i'.dta, replace
}

Do this:

forvalues i=0/100 {
	local i_scaled=`i'/100
	replace x=sin(`i_scaled')
	save sin`i'.dta, replace
}

Another issue with numlist is that it can introduce infintessimal errors so that evaluating “1==1” comes back false. If you have a situation like this you need to make the comparison operator fuzzy. So instead of just writing the expression “if y==x” you would use the expression

if y>x-.0001 & y<x+.0001

Finally, I’ve noticed that when you are running nested loops the number of operations grows exponentially and so it makes a big difference in what order you do things. In particular, you want to arrange operations so they are repeated the least numbers of times. For instance, suppose you have batched a simulation over three parameters (x, y, and z) and saved each combination in its own dataset with the convention “results_x_y_z” and you wish to append the results in such a way that the parameter values are variables in the new appended dataset. The simple (but slow) way to run the append is like this:

clear
gen x=.
gen y=.
gen z=.
forvalues x=1/100 {
	forvalues y=1/100 {
		forvalues z=1/100 {
			append using results_`x'_`y'_`z'
			recode x .=`x'
			recode y .=`y'
			recode z .=`z'
		}
	}
}

Unfortunately this is really slow. The following code has the same number of lines but it involves about half as many operations for the computer to do. In the first version there are four commands that are each run 100^3 times. The second version has two commands that run 100^3 times, one command that runs 100^2 times, and one command that runs 100 times.

clear
gen x=.
gen y=.
gen z=.
forvalues x=1/100 {
	forvalues y=1/100 {
		forvalues z=1/100 {
			append using results_`x'_`y'_`z'
			recode z .=`z'
		}
		recode y .=`y'
	}
	recode x .=`x'
}

April 26, 2011 at 4:46 am 2 comments

Stata for Mac PDF fixed in Stata 11.2

| Gabriel |

About a year ago, I got frustrated with Stata’s “graph export foo.pdf” command, which at the time gave hideous output. Apparently the problem was that Stata used the same code to write to disk as to write to screen. As a work-around, I wrote graphexportpdf.ado, which is basically a wrapper to pipe Stata-generated eps files through Ghostscript.

I am happy to report that the revision notes for Stata 11.2 include this line, from the section about Stata for Mac:

33.  Graphs exported as PDF files are now exported with increased resolution.

That is to say, they fixed it. I tested it and it creates beautiful output and does so very quickly. Thanks StataCorp!

I highly recommend that Mac users of Stata 11.2 and higher use the native PDF capabilities through the standard “graph export foo.pdf” syntax. Graphexportpdf.ado may still be useful for Mac users of versions 10 and earlier and to Linux users (who don’t have Quartz but usually have Ghostscript as part of a LaTeX distro).

Finally, remember that “graph export foo.pdf” is a Mac only option so if you want your code to be portable you should treat it like this:

if "`c(os)'"=="MacOSX" { 
  graph export mygraph.pdf, replace
} 
else {
  graph export mygraph.eps, replace
}

April 11, 2011 at 4:29 pm 3 comments

Escaped quotes and syntax highlighting

| Gabriel |

Quotes usually delimit where a string with embedded spaces begin, but sometimes you want the quote to be literal and this requires escaping it. To recycle an example I’ve used before, suppose you wanted to display:

Beavis said "Fire! Fire!"

To get Stata to display this, you would escape the quotes by encompassing them in left and right apostrophes (just like calling a local) so the command would be:

disp `"Beavis said "Fire! Fire!""'

This is a trivial example, but a more realistic application is you might want to put some things that involve quotes inside a local and since the content of the local is itself delimited by quotes you’ll need to escape them.

OK, easy enough, but the problem is that most external text editors don’t appreciate this nuance of Stata syntax and end up showing the rest of the document as quoted text, effectively making the syntax highlighting useless. (Stata’s internal editor doesn’t suffer this problem, but I’m in the habit of using TextMate since prior to Stata 11 the editor didn’t have highlighting and it still doesn’t do code-folding). The solution is to let two syntax-parsing wrongs make a right by putting a single quote in a comment, which Stata will ignore but which the text editor will parse as closing a previous hanging quote. It works like this:

disp `"Beavis said "Fire! Fire!""'
* " this line exists only to let the text editor's parser know that everything is back to normal
disp "see, it works. this quoted text should show up as quoted whereas the word 'disp' appears as a command"

February 7, 2011 at 5:05 am 5 comments

Shufflevar update

| Gabriel |

Thanks to Elizabeth Blankenspoor (Michigan) I corrected a bug with the “cluster” option in shufflevar. I’ve submitted the update to SSC but it’s also here:

*1.1 GHR January 24, 2011

*changelog
*1.1 -- fixed bug that let one case per "cluster" be misallocated (thanks to Elizabeth Blankenspoor)

capture program drop shufflevar
program define shufflevar
	version 10
	syntax varlist(min=1) [ , Joint DROPold cluster(varname)]
	tempvar oldsortorder
	gen `oldsortorder'=[_n]
	if "`cluster'"!="" {
		local bystatement "by `cluster': "
	}
	else {
		local bystatement ""
	}
	if "`joint'"=="joint" {
		tempvar newsortorder
		gen `newsortorder'=uniform()
		sort `cluster' `newsortorder'
		foreach var in `varlist' {
			capture drop `var'_shuffled
			quietly {
				`bystatement' gen `var'_shuffled=`var'[_n-1]
				`bystatement' replace `var'_shuffled=`var'[_N] if _n==1
			}
			if "`dropold'"=="dropold" {
				drop `var'
			}
		}
		sort `oldsortorder'
		drop `newsortorder' `oldsortorder'
	}
	else {
		foreach var in `varlist' {
			tempvar newsortorder
			gen `newsortorder'=uniform()
			sort `cluster' `newsortorder'
			capture drop `var'_shuffled
			quietly {
				`bystatement' gen `var'_shuffled=`var'[_n-1]
				`bystatement' replace `var'_shuffled=`var'[_N] if _n==1
			}
			drop `newsortorder'
			if "`dropold'"=="dropold" {
				drop `var'
			}
		}
		sort `oldsortorder'
		drop `oldsortorder'
	}
end

January 24, 2011 at 7:30 pm 6 comments

Growl in R and Stata

| Gabriel |

Growl is a system notification tool for Mac that lets applications, the system itself, or hardware display brief notification, usually in the top-right corner. The translucent floating look reminds me of the more recent versions of KDE.

Anyway, one of the things it’s good for is letting programs run in the background and let you know when something noteworthy has happened. Of course, a large statistics batch would qualify. I was running a 10 minute job in the R package igraph and got tired of checking to see when it was done so I found this tip. In a nutshell, it says to download Growl, including the command-line tool GrowlNotify from the “Extras” folder, then create this R function.

growl <- function(m = 'Hello world')
system(paste('growlnotify -a R -m \'',m,'\' -t \'R is calling\'; echo \'\a\' ', sep=''))

Since the R function “system()” is equivalent to the Stata command “shell,” I realized this would work in Stata as well and so I wrote this ado file.* You call it just by typing “growl”. It takes as an (optional) argument whatever you’d like to see displayed in Growl, such as “Done with first analysis” or “All finished” but by default displays “Stata needs attention.” Note that while Stata already bounces in the dock when it completes a script or hits an error, you can also have Growl appear at various points during the run of a script.

I’ve only tested it with StataMP, but I’d appreciate it if people who use Growl and other versions of Stata would post their results in the comments. If it proves robust I’ll submit it to SSC.

Here’s the Stata code. The most important line is #32 and if you were doing it by hand you’d do it as a one-liner but the rest of this stuff allows argument passing and compatibility with the different versions (small, IC, SE, MP) of Stata:

*1.0 GHR Jan 19, 2011
capture program drop growl
program define growl
	version 10
	set more off
	syntax [anything]

	if "`anything'"=="" {
		local message "Stata needs attention"		
	}
	else {
		local message "`anything'"
	}
	
	local appversion "Stata"
	if "`c(flavor)'"=="Small" {
		local appversion "smStata"
	}
	else {
		if `c(SE)'==1 {
			if `c(MP)'==1 {
				local appversion "StataMP"
			}
			else {
				local appversion "StataSE"
			}
		}
		else {
			local appversion "Stata"
		}
	}
	shell growlnotify -a `appversion' -m \ "`message'" \ `appversion' \
end

*Most programming/scripting languages and a few GUI applications have a similar system call and you can likewise get Growl to work with them. For instance, Perl also has a function called system(). In shell scripting of course you can just use the “growlnotify” command directly.

January 20, 2011 at 5:10 am 10 comments

Stata tv

| Gabriel |

Stata Daily links to a bunch of instructional YouTube videos on Stata basics provided by U Minnesota.

Similarly, UCLA ATS has a Stata starter kit, which includes videos.

Stata Tidbit of the Week also has lots of screencasts. His videos tend to be on slightly more advanced issues than the other two sites so people who are already used to Stata should go here but absolute beginners should start with the Minnesota or UCLA material.

December 14, 2010 at 4:37 am

Conditioning on a Collider Between a Dummy and a Continuous Variable

| Gabriel |

In a post last year, I described a logical fallacy of sample truncation that helpful commenters explained to me is known in the literature as conditioning on a collider. As is common, I illustrated the issue with two continuous variables, where censorship is a function of the sum. (Specifically, I used the example of physical attractiveness and acting ability for a latent population of aspiring actresses and an observed population of working actresses to explain the paradox that Megan Fox was considered both “sexiest” and “worst” actress in a reader poll).

In revising my notes for grad stats this year, I generalized the problem to cases where at least one of the variables is categorical. For instance, college admissions is a censorship process (only especially attractive applicants become matriculants) and attractiveness to admissions officers is a function of both categorical (legacy, athlete, artist or musician, underrepresented ethnic group, in-state for public schools or out-of-state for private schools, etc) and continuous distinctions (mostly SAT and grades).

For simplicity, we can restrict the issue just to SAT and legacy. (See various empirical studies and counterfactual extrapolations by Espenshade and his collaborators for how it works with the various other things that determine admissions.) Among college applicant pools, the children of alumni to prestigious schools tend to score about a hundred points higher on the SAT than do other high school students. Thus the applicant pool looks something like this.

However, many prestigious colleges have policies of preferring legacy applicants. In practice this mean that the child of an alum can still be admitted with an SAT score about 150 points below non-legacy students. Thus admission is a function of both SAT (a continuous variable) and legacy (a dummy variable). This implies the paradox that the SAT scores of legacies are about half a sigma above average for the applicant pool but about a full sigma below average in the freshman class, as seen in this graph.

Here’s the code.

clear
set obs 1000
gen legacy=0
replace legacy=1 in 1/500
lab def legacy 0 "Non-legacy" 1 "Legacy"
lab val legacy legacy
gen sat=0
replace sat=round(rnormal(1100,250)) if legacy==1
replace sat=round(rnormal(1000,250)) if legacy==0
lab var sat "SAT score"
recode sat -1000/0=0 1600/20000=1600 /*top code and bottom code*/
graph box sat, over(legacy) ylabel(0(200)1600) title(Applicants)
graph export collider_collegeapplicants.png, replace
graph export collider_collegeapplicants.eps, replace
ttest sat, by (legacy)
keep if (sat>1400 & legacy==0) | (sat>1250 & legacy==1)
graph box sat, over(legacy) ylabel(0(200)1600) title(Admits)
graph export collider_collegeadmits.png, replace
graph export collider_collegeadmits.eps, replace
ttest sat, by (legacy)
*have a nice day

November 30, 2010 at 4:40 am 5 comments

Older Posts Newer Posts


The Culture Geeks