## Posts tagged ‘graphs’

### Caveat Smoother

| Gabriel |

All metrics and models are assumption-laden, but some are more assumption-laden than others. Among the worst offenders are smoothers, which as the name implies, assume that the underlying reality is smooth. If the underlying reality has discontinuity then the smoother will obfuscate this in the course of trying to smooth out “noise.” This can actually have big theoretical implications. Most notably, there was always a lot of zig-zag in the fossil record but traditionally people assumed it was just noise and so they smoothed it out. Then Gould came up with the theory of punctuated equilibrium and said that evolution substantively works through bursts, which at a data level is equivalent to saying that the zig zags are signal, not noise.

Here’s an illustration. Let’s simulate a dataset that, by assumption, follows a step function. To keep it simple we’ll have no noise at all, just the underlying step function. Now, let’s apply a LOWESS smooth on the time-series. As you can see, the smoothed trend is basically an s-curve even though we know by assumption that the underlying structure of causation is a step.

```set obs 50
gen t=[_n]
gen x=0 in 1/30
replace x=1 in 31/50
twoway (lowess x t) (scatter x t, msymbol(circle_hollow)), legend(off)```

Moral of the story, think carefully about whether the smoother is theoretically appropriate. If there are substantive reasons to expect discontinuities then it probably ain’t. For a similar reason you may want to not just assume a linear effect or even a polynomial specification in regression but compare various transformations (e.g., linear splines vs quadratics) and see what fits best.

### Recursively building graph commands

| Gabriel |

I really like multi-line graphs and scatterplots where the marker color/style reflects categories. Such graphs are both more compact than just having multiple graphs and they make it easier to compare different things. The way you do this is with “twoway,” a lot of parentheses, and the “if” condition. For example:

`twoway (kdensity x if year==1985) (kdensity x if year==1990) (kdensity x if year==1995) (kdensity x if year==2000) (kdensity x if year==2005), legend(order(1 "1985" 2 "1990" 3 "1995" 4 "2000" 5 "2005"))`

Unfortunately such graphs can be difficult to script. This is especially so for the legends, which by default show the variable name rather than the selection criteria. I handle this by recursively looping over a local, which in the case of the legend involves embedded quote marks.

```capture program drop multigraph
program define multigraph
local var         `1'
local interval	  `2'

local command ""
local legend ""
local legendtick=1
forvalues i=1985(`interval')2005 {
local command "`command' (kdensity `var' if year==`i')"
local legend `legend' `legendtick' `" `i' "'
*"
local legendtick=`legendtick'+1
}
disp "twoway ""`command'" ", legend:" "`legend'"
twoway `command' , legend(order(`legend'))
end```

To replicate the hard-coded command above, you’d call it like this:

`multigraph x 5`

### So Long Arial

| Gabriel |

Now that I’ve been looking into setting my graphs in a serif font instead of the default Arial, I find that changing fonts in Stata graphs is surprisingly difficult. It’s not that it’s intrinsically difficult, just confusing to learn because Stata’s handling of fonts breaks with the general Stata convention of specifying options as part of a command and is instead more similar to the Gnuplot style of changing device preferences then executing a command targeting the device. One implication of this is that there’s no option to change the graph font in the graph GUI interface (which is how I usually learn new bits of command-line syntax).

Another issue is that graphs aren’t WYSIWYG. Rather the interactive display graph can look different from the graph saved to disk and that in turn can be inconsistent depending on what file format you use. To avoid confusion I just set everything at once, like this:

```local graphfont "Palatino"
graph set eps fontface `graphfont'
graph set eps fontfaceserif `graphfont'
graph set eps  /*echo back preferences*/

graph set window fontface `graphfont'
graph set window fontfaceserif `graphfont'
graph set window /*echo back preferences*/
```

One of the oddities is that there is no set of PDF options. Rather (at least on a Mac) you control the PDF device as part of the display device (“window”). My understanding is that Stata for Mac relies on the low level OS X PDF support for creating PDFs, and this would explain why it considers PDF to be part of the display device rather than one of the file type devices (as well as why it won’t make PDFs if you suppress screen rendering, why Stata for Mac got PDF support earlier than the other platforms, and why the PDFs looked like this until they fixed it in 11.2 and 12). Note that this means that your PDF will use the EPS fonts if you use my graphexportpdf.ado script and the display fonts if you use the base “graph export foo.pdf” syntax.

I haven’t checked, but I wouldn’t be surprised if the PDF preferences in Stata for Windows are controlled by the EPS preferences rather than the display preferences. In any case, I recommend just setting all devices the same so it won’t matter.

(Thanks to Eric Booth, whose StataList post, helped me figure this out).

### Adding elements to graphs as a slideshow

| Gabriel |

One of the tricks to a successful presentation is to limit what your audience sees so they don’t get ahead of you and also to preserve a general sense of timing and flow. This helps keep the audience’s attention and also is good for focusing expectations in such a way that the next bit is counter-intuitive and therefore interesting. Nothing is so boring as sitting in a talk and seeing ten bullet points and realizing that the speaker is only on bullet number three.

Similarly if you’re using graphs in a talk (which you should as much as possible since they read better than tables), you may only want to reveal part of a graph as you talk about it, then reveal the next bit when you’re ready. The most obvious way to do this is to just crop the graph or cover it with boxes that match the background or something. Unfortunately that’s ugly and clunky and doesn’t work if the graph elements are tightly commingled. Another way to do it is to generate two graphs, one of which has the elements and the other of which doesn’t. The problem with this is that the graphs don’t match up properly. For instance, if you have a line graph and you keep adding lines to it, the legend will first appear and then grow larger, crowding out the graph itself.

Ideally, what you want is a set of graphs that are completely identical except some elements are missing in one version which are added in the other version. You can then line the graphs up, talk about the first set of elements, and then do a smooth transition to the version with the full set of elements. Here’s an example from the talk I gave yesterday. In order to explain crossover I first show the song’s native formats then dissolve to also show the crossover formats.

Here’s how I did it. The basic trick is that Stata can create transparent graph elements by setting the color to “none”. You do the exact same graph multiple times, you just set colors to be transparent when you want to conceal elements. That is, the code in lines 10–14 is identical to that in lines 17–21 except that lines 13 and 14 set line color to “none” instead of Stata’s standard s2color scheme.

```use final_f, clear
keep if artist=="SARA BAREILLES"
drop if format=="All" | format=="Other"
sum date
local maxdate=`r(max)'
local mindate=`r(min)'
local interval=(`maxdate'-`mindate')/10
local interval=round(`interval',7)

twoway (line Nt_inc_p date if format=="AAA_Rock", lwidth(thick) lcolor(navy)) /*
*/ (line Nt_inc_p date if format=="Hot_AC", lwidth(thick) lcolor(maroon)) /*
*/ (line Nt_inc_p date if format=="Top_40", lwidth(thick) lcolor(none)) /*
*/ (line Nt_inc_p date if format=="Mainstream_AC", lwidth(thick) lcolor(none)) /*
*/ , xtitle("") xmtick(`mindate'(7)`maxdate') xlabel(`mindate'(`interval')`maxdate', labsize(vsmall) angle(forty_five) format(%tdMon_dd,_CCYY)) legend(order (1 "AAA Rock" 2 "Hot AC" 3 "Top 40" 4 "Mainstream AC"))  graphregion(fcolor(white))
graph export \$images/sarabareilles_lovesong_1.pdf, replace

twoway (line Nt_inc_p date if format=="AAA_Rock", lwidth(thick) lcolor(navy)) /*
*/ (line Nt_inc_p date if format=="Hot_AC", lwidth(thick) lcolor(maroon)) /*
*/ (line Nt_inc_p date if format=="Top_40", lwidth(thick) lcolor(dkorange)) /*
*/ (line Nt_inc_p date if format=="Mainstream_AC", lwidth(thick) lcolor(forest_green)) /*
*/ , xtitle("") xmtick(`mindate'(7)`maxdate') xlabel(`mindate'(`interval')`maxdate', labsize(vsmall) angle(forty_five) format(%tdMon_dd,_CCYY)) legend(order (1 "AAA Rock" 2 "Hot AC" 3 "Top 40" 4 "Mainstream AC")) graphregion(fcolor(white))
graph export \$images/sarabareilles_lovesong_2.pdf, replace ```

### Christmas in July

| Gabriel |

Has it been two years already? Holy moly, Stata 12 looks awesome.

The headline feature is structural equation modeling. It comes with a graphic model builder, which even an “only scripting is replicable” zealot like me can appreciate as it helps you learn complicated command syntax. (I feel the same way about graphs). I had actually been thinking of working SEM into my next paper and was thinking through the logistics of getting a copy of M+, learning the syntax (again), etc. Now I can do it within Stata. I look forward to reading more papers that use SEM without really understanding the assumptions.

Probably the most satisfying new feature to me though is contour plots. Ever since I got interested in writing simulations a few years ago, I have been wanting to make heat maps in Stata. I’ve spent many hours writing code that can pipe to gnuplot and, not being satisfied with that, I (with some help from Lisa) have spent yet more time working on another script that can pipe to the wireframe function in R’s lattice library. Now I’m very happy to say that I will not finish writing this ado-file and submitting it to SSC as Stata 12 contains what looks to be really good native heat plots.

I’m thinking the set of commands I will feel most guilty about not using more often, is margin plots, which extends the margin command from Stata 11. In addition to the headline new features there’s a bunch of little stuff, including fixes to get more compatibility between the estimation commands and the ancillary commands (e.g., better “predict” support for count models and “svy” support for “xtmixed”). Also, Windows users should be pleased to hear that they can now do PDFs natively.

[Update: Also see Jeremy’s post on Stata 12. He closes with a pretty funny metaphor of stats packages to cell phone brands.]

### Stata for Mac PDF and “set graphics on”

| Gabriel |

I recently noted that graph exporting to PDF in Stata for Mac is fixed. Turns out that this is only partially true. It works and creates beautiful output, but unlike the other “graph export” options it only works if you have “set graphics on” in the Stata GUI. If you’re running it as Stata console or have graphics set off in Stata GUI, it simply doesn’t work. (I do this when batching a lot of graphs as it is faster and less distracting).

My understanding is that this has something to do with how Stata relies on Mac’s Quartz driver to render PDF so it’s not really feasible to fix. So basically you have three options:

1) Do it in the GUI with “set graphics on” and accept the CPU performance hit and distraction of all the graphs rendering.

2) Use my graphexportpdf ado file or the “graph print” command with CUPS-PDF as the print driver.

3) Stick to using EPS

### Another R bleg about loops and graph devices

| Gabriel |

I’m having another problem with R and was hoping somebody could help me out in the comments. Long story short, I can make graphs properly if I do them one at a time, but when I try to loop it I get device errors. In particular it creates the PDFs but they are either empty or corrupted.

Here is the log, which first fails to do it with a loop, then does it right for one case where I manually assign the variables rather than looping.

```> # File-Name:       surfacegraphs.R
> # Date:            2011-04-25
> # Author:          Gabriel Rossman
> # Purpose:         graph from Stata
> # Packages Used:   lattice
> # note, wireframe code from lisa
> timestamp()
##------ Tue Apr 26 10:44:09 2011 ------##
> library(lattice)
>
> histopath <- '~/Documents/project/histograms'
> image2 <- '~/Documents/project/images/histograms'
>
> timestamp()
##------ Tue Apr 26 10:44:10 2011 ------##
>
> #create surface histograms, showing how population evolves over time
> #  parameters held constant
> setwd(histopath)
> for(d in 0:10) {
+ 	for(p in 0:5) {
+ 		d10 <- d*10
+ 		p100 <- p*100
+ 		datafile <- paste(histopath,'/d',d10,'p',p100,'.txt', sep="")
+ 		pdfcolor <- paste(image2,'/hist_color_d',d10,'p',p100,'.pdf', sep="")
+ 		pdfgrey <- paste(image2,'/hist_grey_d',d10,'p',p100,'.pdf', sep="")
+ 		pdf(pdfcolor)
+ 		dev.off()
+
+ 		pdf(pdfgrey)
+ 		dev.off()
+ 	}
+ }
There were 50 or more warnings (use warnings() to see the first 50)
> timestamp()
##------ Tue Apr 26 10:44:12 2011 ------##
>
> #loop doesn't work
> #  seems to be the dev.off()
> #try a few manually
> d10 <- 0
> p100 <- 0
> datafile <- paste(histopath,'/d',d10,'p',p100,'.txt', sep="")
> pdfcolor <- paste(image2,'/hist_color_d',d10,'p',p100,'.pdf', sep="")
> pdfgrey <- paste(image2,'/hist_grey_d',d10,'p',p100,'.pdf', sep="")
> pdf(pdfcolor)
> dev.off()
null device
1
>
>
> timestamp()
##------ Tue Apr 26 10:44:14 2011 ------##
>
>
```

The warnings start like this and go on from there:

```> warnings()
Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
5: In min(x) : no non-missing arguments to min; returning Inf
```

Any ideas?
Do I just need to hard-code the graphs I really want rather than batching them?

[Update]
As Michal suggested, I needed to wrap wireframe in print. Here’s an example of the output (for a baseline simulation).
hist_color_d0p0