Posts tagged ‘graphs’
Caveat Smoother
| Gabriel |
All metrics and models are assumption-laden, but some are more assumption-laden than others. Among the worst offenders are smoothers, which as the name implies, assume that the underlying reality is smooth. If the underlying reality has discontinuity then the smoother will obfuscate this in the course of trying to smooth out “noise.” This can actually have big theoretical implications. Most notably, there was always a lot of zig-zag in the fossil record but traditionally people assumed it was just noise and so they smoothed it out. Then Gould came up with the theory of punctuated equilibrium and said that evolution substantively works through bursts, which at a data level is equivalent to saying that the zig zags are signal, not noise.
Here’s an illustration. Let’s simulate a dataset that, by assumption, follows a step function. To keep it simple we’ll have no noise at all, just the underlying step function. Now, let’s apply a LOWESS smooth on the time-series. As you can see, the smoothed trend is basically an s-curve even though we know by assumption that the underlying structure of causation is a step.
set obs 50 gen t=[_n] gen x=0 in 1/30 replace x=1 in 31/50 twoway (lowess x t) (scatter x t, msymbol(circle_hollow)), legend(off)
Moral of the story, think carefully about whether the smoother is theoretically appropriate. If there are substantive reasons to expect discontinuities then it probably ain’t. For a similar reason you may want to not just assume a linear effect or even a polynomial specification in regression but compare various transformations (e.g., linear splines vs quadratics) and see what fits best.
Recursively building graph commands
| Gabriel |
I really like multi-line graphs and scatterplots where the marker color/style reflects categories. Such graphs are both more compact than just having multiple graphs and they make it easier to compare different things. The way you do this is with “twoway,” a lot of parentheses, and the “if” condition. For example:
twoway (kdensity x if year==1985) (kdensity x if year==1990) (kdensity x if year==1995) (kdensity x if year==2000) (kdensity x if year==2005), legend(order(1 "1985" 2 "1990" 3 "1995" 4 "2000" 5 "2005"))
Unfortunately such graphs can be difficult to script. This is especially so for the legends, which by default show the variable name rather than the selection criteria. I handle this by recursively looping over a local, which in the case of the legend involves embedded quote marks.
capture program drop multigraph program define multigraph local var `1' local interval `2' local command "" local legend "" local legendtick=1 forvalues i=1985(`interval')2005 { local command "`command' (kdensity `var' if year==`i')" local legend `legend' `legendtick' `" `i' "' *" local legendtick=`legendtick'+1 } disp "twoway ""`command'" ", legend:" "`legend'" twoway `command' , legend(order(`legend')) end
To replicate the hard-coded command above, you’d call it like this:
multigraph x 5
So Long Arial
| Gabriel |
Now that I’ve been looking into setting my graphs in a serif font instead of the default Arial, I find that changing fonts in Stata graphs is surprisingly difficult. It’s not that it’s intrinsically difficult, just confusing to learn because Stata’s handling of fonts breaks with the general Stata convention of specifying options as part of a command and is instead more similar to the Gnuplot style of changing device preferences then executing a command targeting the device. One implication of this is that there’s no option to change the graph font in the graph GUI interface (which is how I usually learn new bits of command-line syntax).
Another issue is that graphs aren’t WYSIWYG. Rather the interactive display graph can look different from the graph saved to disk and that in turn can be inconsistent depending on what file format you use. To avoid confusion I just set everything at once, like this:
local graphfont "Palatino" graph set eps fontface `graphfont' graph set eps fontfaceserif `graphfont' graph set eps /*echo back preferences*/ graph set window fontface `graphfont' graph set window fontfaceserif `graphfont' graph set window /*echo back preferences*/
One of the oddities is that there is no set of PDF options. Rather (at least on a Mac) you control the PDF device as part of the display device (“window”). My understanding is that Stata for Mac relies on the low level OS X PDF support for creating PDFs, and this would explain why it considers PDF to be part of the display device rather than one of the file type devices (as well as why it won’t make PDFs if you suppress screen rendering, why Stata for Mac got PDF support earlier than the other platforms, and why the PDFs looked like this until they fixed it in 11.2 and 12). Note that this means that your PDF will use the EPS fonts if you use my graphexportpdf.ado script and the display fonts if you use the base “graph export foo.pdf” syntax.
I haven’t checked, but I wouldn’t be surprised if the PDF preferences in Stata for Windows are controlled by the EPS preferences rather than the display preferences. In any case, I recommend just setting all devices the same so it won’t matter.
(Thanks to Eric Booth, whose StataList post, helped me figure this out).
Adding elements to graphs as a slideshow
| Gabriel |
One of the tricks to a successful presentation is to limit what your audience sees so they don’t get ahead of you and also to preserve a general sense of timing and flow. This helps keep the audience’s attention and also is good for focusing expectations in such a way that the next bit is counter-intuitive and therefore interesting. Nothing is so boring as sitting in a talk and seeing ten bullet points and realizing that the speaker is only on bullet number three.
Similarly if you’re using graphs in a talk (which you should as much as possible since they read better than tables), you may only want to reveal part of a graph as you talk about it, then reveal the next bit when you’re ready. The most obvious way to do this is to just crop the graph or cover it with boxes that match the background or something. Unfortunately that’s ugly and clunky and doesn’t work if the graph elements are tightly commingled. Another way to do it is to generate two graphs, one of which has the elements and the other of which doesn’t. The problem with this is that the graphs don’t match up properly. For instance, if you have a line graph and you keep adding lines to it, the legend will first appear and then grow larger, crowding out the graph itself.
Ideally, what you want is a set of graphs that are completely identical except some elements are missing in one version which are added in the other version. You can then line the graphs up, talk about the first set of elements, and then do a smooth transition to the version with the full set of elements. Here’s an example from the talk I gave yesterday. In order to explain crossover I first show the song’s native formats then dissolve to also show the crossover formats.
Here’s how I did it. The basic trick is that Stata can create transparent graph elements by setting the color to “none”. You do the exact same graph multiple times, you just set colors to be transparent when you want to conceal elements. That is, the code in lines 10–14 is identical to that in lines 17–21 except that lines 13 and 14 set line color to “none” instead of Stata’s standard s2color scheme.
use final_f, clear keep if artist=="SARA BAREILLES" drop if format=="All" | format=="Other" sum date local maxdate=`r(max)' local mindate=`r(min)' local interval=(`maxdate'-`mindate')/10 local interval=round(`interval',7) twoway (line Nt_inc_p date if format=="AAA_Rock", lwidth(thick) lcolor(navy)) /* */ (line Nt_inc_p date if format=="Hot_AC", lwidth(thick) lcolor(maroon)) /* */ (line Nt_inc_p date if format=="Top_40", lwidth(thick) lcolor(none)) /* */ (line Nt_inc_p date if format=="Mainstream_AC", lwidth(thick) lcolor(none)) /* */ , xtitle("") xmtick(`mindate'(7)`maxdate') xlabel(`mindate'(`interval')`maxdate', labsize(vsmall) angle(forty_five) format(%tdMon_dd,_CCYY)) legend(order (1 "AAA Rock" 2 "Hot AC" 3 "Top 40" 4 "Mainstream AC")) graphregion(fcolor(white)) graph export $images/sarabareilles_lovesong_1.pdf, replace twoway (line Nt_inc_p date if format=="AAA_Rock", lwidth(thick) lcolor(navy)) /* */ (line Nt_inc_p date if format=="Hot_AC", lwidth(thick) lcolor(maroon)) /* */ (line Nt_inc_p date if format=="Top_40", lwidth(thick) lcolor(dkorange)) /* */ (line Nt_inc_p date if format=="Mainstream_AC", lwidth(thick) lcolor(forest_green)) /* */ , xtitle("") xmtick(`mindate'(7)`maxdate') xlabel(`mindate'(`interval')`maxdate', labsize(vsmall) angle(forty_five) format(%tdMon_dd,_CCYY)) legend(order (1 "AAA Rock" 2 "Hot AC" 3 "Top 40" 4 "Mainstream AC")) graphregion(fcolor(white)) graph export $images/sarabareilles_lovesong_2.pdf, replace
Christmas in July
| Gabriel |
Has it been two years already? Holy moly, Stata 12 looks awesome.
The headline feature is structural equation modeling. It comes with a graphic model builder, which even an “only scripting is replicable” zealot like me can appreciate as it helps you learn complicated command syntax. (I feel the same way about graphs). I had actually been thinking of working SEM into my next paper and was thinking through the logistics of getting a copy of M+, learning the syntax (again), etc. Now I can do it within Stata. I look forward to reading more papers that use SEM without really understanding the assumptions.
Probably the most satisfying new feature to me though is contour plots. Ever since I got interested in writing simulations a few years ago, I have been wanting to make heat maps in Stata. I’ve spent many hours writing code that can pipe to gnuplot and, not being satisfied with that, I (with some help from Lisa) have spent yet more time working on another script that can pipe to the wireframe function in R’s lattice library. Now I’m very happy to say that I will not finish writing this ado-file and submitting it to SSC as Stata 12 contains what looks to be really good native heat plots.
I’m thinking the set of commands I will feel most guilty about not using more often, is margin plots, which extends the margin command from Stata 11. In addition to the headline new features there’s a bunch of little stuff, including fixes to get more compatibility between the estimation commands and the ancillary commands (e.g., better “predict” support for count models and “svy” support for “xtmixed”). Also, Windows users should be pleased to hear that they can now do PDFs natively.
[Update: Also see Jeremy’s post on Stata 12. He closes with a pretty funny metaphor of stats packages to cell phone brands.]
Stata for Mac PDF and “set graphics on”
| Gabriel |
I recently noted that graph exporting to PDF in Stata for Mac is fixed. Turns out that this is only partially true. It works and creates beautiful output, but unlike the other “graph export” options it only works if you have “set graphics on” in the Stata GUI. If you’re running it as Stata console or have graphics set off in Stata GUI, it simply doesn’t work. (I do this when batching a lot of graphs as it is faster and less distracting).
My understanding is that this has something to do with how Stata relies on Mac’s Quartz driver to render PDF so it’s not really feasible to fix. So basically you have three options:
1) Do it in the GUI with “set graphics on” and accept the CPU performance hit and distraction of all the graphs rendering.
2) Use my graphexportpdf ado file or the “graph print” command with CUPS-PDF as the print driver.
3) Stick to using EPS
Another R bleg about loops and graph devices
| Gabriel |
I’m having another problem with R and was hoping somebody could help me out in the comments. Long story short, I can make graphs properly if I do them one at a time, but when I try to loop it I get device errors. In particular it creates the PDFs but they are either empty or corrupted.
Here is the log, which first fails to do it with a loop, then does it right for one case where I manually assign the variables rather than looping.
> # File-Name: surfacegraphs.R > # Date: 2011-04-25 > # Author: Gabriel Rossman > # Purpose: graph from Stata > # Packages Used: lattice > # note, wireframe code from lisa > timestamp() ##------ Tue Apr 26 10:44:09 2011 ------## > library(lattice) > > histopath <- '~/Documents/project/histograms' > image2 <- '~/Documents/project/images/histograms' > > timestamp() ##------ Tue Apr 26 10:44:10 2011 ------## > > #create surface histograms, showing how population evolves over time > # parameters held constant > setwd(histopath) > for(d in 0:10) { + for(p in 0:5) { + d10 <- d*10 + p100 <- p*100 + datafile <- paste(histopath,'/d',d10,'p',p100,'.txt', sep="") + dataobject <- read.table(file=datafile,header=TRUE) + pdfcolor <- paste(image2,'/hist_color_d',d10,'p',p100,'.pdf', sep="") + pdfgrey <- paste(image2,'/hist_grey_d',d10,'p',p100,'.pdf', sep="") + pdf(pdfcolor) + wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE) + dev.off() + + pdf(pdfgrey) + wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE, par.settings=standard.theme(color=FALSE)) + dev.off() + } + } There were 50 or more warnings (use warnings() to see the first 50) > timestamp() ##------ Tue Apr 26 10:44:12 2011 ------## > > #loop doesn't work > # seems to be the dev.off() > #try a few manually > d10 <- 0 > p100 <- 0 > datafile <- paste(histopath,'/d',d10,'p',p100,'.txt', sep="") > dataobject <- read.table(file=datafile,header=TRUE) > pdfcolor <- paste(image2,'/hist_color_d',d10,'p',p100,'.pdf', sep="") > pdfgrey <- paste(image2,'/hist_grey_d',d10,'p',p100,'.pdf', sep="") > pdf(pdfcolor) > wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE) > dev.off() null device 1 > > > timestamp() ##------ Tue Apr 26 10:44:14 2011 ------## > >
The warnings start like this and go on from there:
> warnings() Warning messages: 1: In min(x) : no non-missing arguments to min; returning Inf 2: In max(x) : no non-missing arguments to max; returning -Inf 3: In min(x) : no non-missing arguments to min; returning Inf 4: In max(x) : no non-missing arguments to max; returning -Inf 5: In min(x) : no non-missing arguments to min; returning Inf
Any ideas?
Do I just need to hard-code the graphs I really want rather than batching them?
[Update]
As Michal suggested, I needed to wrap wireframe in print. Here’s an example of the output (for a baseline simulation).
hist_color_d0p0
Stata for Mac PDF fixed in Stata 11.2
| Gabriel |
About a year ago, I got frustrated with Stata’s “graph export foo.pdf” command, which at the time gave hideous output. Apparently the problem was that Stata used the same code to write to disk as to write to screen. As a work-around, I wrote graphexportpdf.ado, which is basically a wrapper to pipe Stata-generated eps files through Ghostscript.
I am happy to report that the revision notes for Stata 11.2 include this line, from the section about Stata for Mac:
33. Graphs exported as PDF files are now exported with increased resolution.
That is to say, they fixed it. I tested it and it creates beautiful output and does so very quickly. Thanks StataCorp!
I highly recommend that Mac users of Stata 11.2 and higher use the native PDF capabilities through the standard “graph export foo.pdf” syntax. Graphexportpdf.ado may still be useful for Mac users of versions 10 and earlier and to Linux users (who don’t have Quartz but usually have Ghostscript as part of a LaTeX distro).
Finally, remember that “graph export foo.pdf” is a Mac only option so if you want your code to be portable you should treat it like this:
if "`c(os)'"=="MacOSX" { graph export mygraph.pdf, replace } else { graph export mygraph.eps, replace }
Gnuplot [updated]
| Gabriel |
Here’s an updated version of my old script for piping from Stata to Gnuplot. Gnuplot is a generic graphing utility. The main thing it can do that Stata can’t is good 3D graphs. (Although the ado files “surface” and “tddens” make a good effort at adding this capability). In this version I set it to a coarser smooth, set the horizontal plane to the minimum value for Z, and allow either contour or surface.
Note that the dgrid3d smoothing algorithm in Gnuplot 4.2 (which is still the version maintained in Fink and Ubuntu) has some problems — the most annoying of which is that it can go haywire at the borders. I decided to handle this by setting it to a coarse smooth (5,5,1). However if unlike me you’re lucky enough to have Gnuplot 4.3 you have access to several much better smoothing algorithms and you may want to experiment with increasing the resolution and using one of the new algorithms by editing line 41 of this script.
*GHR 5/25/2010 *gnuplotpm3d 1.1 *pipes data to gnuplot's splot command, useful for color surface graphs capture program drop gnuplotpm3d program define gnuplotpm3d syntax varlist(min=3 max=3 numeric) [ , title(string asis) xlabel(string asis) ylabel(string asis) using(string asis) contour rotate(string asis)] disp "`rotate'" if "`xlabel'"=="" { local xlabel="`x'" } if "`ylabel'"=="" { local ylabel="`y'" } tempfile gnuplotpm3cmd tempfile data preserve keep `varlist' order `varlist' disp "`using'" outsheet using `data'.txt restore *assumes that the software is running on a mac using Gnuplot.app *if installed with fink or macports, set to the default (Unix) mode of just "gnuplot" local gnuplot="gnuplot" if "`c(os)'"=="MacOSX" { local gnuplot="exec '/Applications/Gnuplot.app/Contents/Resources/bin/gnuplot'" /*assumes binary install, set to simply "gnuplot" if fink or macports install*/ } if "`c(os)'"=="Windows" { *not tested local gnuplot="gnuplot.exe" } file open gpcmd using `gnuplotpm3cmd', write text file write gpcmd "cd '`c(pwd)'' ; " _n file write gpcmd "set terminal postscript color ; set output '`using'.eps' ; set palette color positive ; " _n file write gpcmd "set auto ; set parametric ; " _n file write gpcmd "set dgrid3d 5,5,1 ; " _n /*interpolation algorithm, allows tolerance for data irregularities -- set to low numbers for smooth graph and high numbers for bumpier graph*/ file write gpcmd `"set title "`title'" ; set ylabel "`ylabel'"; set xlabel "`xlabel'"; "' _n file write gpcmd "unset contour; unset surface; " _n if "`contour'"=="contour" { file write gpcmd "set view map; " _n } if "`rotate'"!="" & "`contour'"=="contour" { file write gpcmd "set view `rotate'; " _n } if "`contour'"=="contour" & "`rotate'"!="" { disp "rotate options ignored as they are incompatible with contour view" } file write gpcmd "set xyplane 0; " _n /*put the xy plane right at min(z) */ file write gpcmd "set pm3d;" _n file write gpcmd "set pm3d interpolate 10,10; " _n file write gpcmd `"splot "`data'.txt"; "' _n file close gpcmd shell `gnuplot' `gnuplotpm3cmd' end
Network Graphs in Native Stata Code
| Gabriel |
My approach to handling networks in Stata is to not handle them in Stata. Rather I take an approach inspired by the Unix philosophy and export the data, then call an R script to do what I need, and in some cases use perl to clean the output for importing back into Stata. Since R/igraph has great network tools, this is a very flexible and powerful workflow. However it’s better suited for scripting than exploratory interactive work and it’s frustrating and slow-going as R syntax is really different from Stata.
Fortunately, Rense Corten has created “netplot.” The upside is that it has a simple syntax and is entirely Stata native code so there’s much less hassle than piping to R. The downside is that while it renders fast for small networks, it doesn’t scale as well to medium or large networks as the dedicated SNA packages. The current version only offers MDS and circle as layout algorithms, but future versions may work in K-K or F-R. Here’s a sample plot of a mid-sized random graph (note this took about 30 seconds to render, whereas R/igraph could do it in less than a second).
Anyway, I congratulate Dr. Corten on the accomplishment. I’m really excited about “netplot” in part as a proof of concept towards developing Stata native SNA capabilities but even more because it will make it much more convenient to generate SNA graphs from Stata. For now I still prefer the rendering speed and flexibility of R/igraph, but I really like that I can do a very good initial version natively as it makes exploratory work more convenient. Keep in mind though I have a considerable sunk cost in learning to pipe to R/igraph, someone who has yet to make such an investment might very well prefer to rely on netplot exclusively for graphing as it’s extremely easy to use and gives output that’s great for many purposes.
To install “netplot” type “ssc install netplot”. Also see this blog post and this manual.
Recent Comments