Posts Tagged graphs
some R baby steps
| Gabriel |
I’ve tried to learn R a few times but the syntax has always been opaque to my Stata-centric mind. I think the trick is to realize that two of the key differences are that:
- whereas handles only come up in a few Stata commands (e.g., file, postfile, log), they are very important in R, what with all the “<-” statements
- in R there’s a much fuzzier line between commands and functions than in Stata. What I mean by this is both the superficial thing of all the parentheses and also the more substantive issue that often you don’t put them one to a line and they just do something (like Stata commands) but you usually put them many to a line and feed them into something else (like Stata functions). Related to this is that the typical Stata line has the syntax “verb object, adverb” whereas the typical R line has the syntax “object <- verb(object2, adverb)”
The two combine in an obvious way with something as simple as opening a dataset, which is just use file in Stata but is filehandle <- read.table(“file”) in R, that is, there’s not a read.table() command but a read.table() function and you feed this function to a handle. (And people say R isn’t intuitive!)
At least I think that’s a good way to think about the basic syntax — I suck at R and I really could be totally wrong about this. (Pierre or Kieran please correct me).
Anyway, I wrote my first useful R file the other day. It reads my Pajek formatted network data on top 40 radio stations and does a graph.
# File-Name: testgraph.R
# Date: 2009-11-20
# Author: Gabriel Rossman
# Purpose: graph CHR station network
# Data Used: ties_bounded.net
# Packages Used: igraph
library(igraph)
setwd("~/Documents/Sjt/radio/survey")
chrnet <- read.graph("ties.net", c("pajek"))
pdf("~/Documents/book/images/chrnetworkbounded.pdf")
plot.igraph(chrnet, layout=layout.fruchterman.reingold, vertex.size=2, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
dev.off()
The weird thing is that it works fine in R.app but breaks when I try to R run from the Terminal, regardless of whether I try to do it all in one line or first invoke R and then feed it the script. [Update: the issue is a 32 bit library and 64 bit R, the simple solution is to invoke "R32" rather than just plain "R". see the comments for details]. Here’s a session with both problems:
gabriel-rossmans-macbook-2:~ rossman$ Rscript ~/Documents/book/stata/testgraph.R
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared library '/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so':
dlopen(/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so, 10): Symbol not found: ___gmpz_clear
Referenced from: /Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so
Expected in: dynamic lookup
Error : .onLoad failed in 'loadNamespace' for 'igraph'
Error: package/namespace load failed for 'igraph'
Execution halted
gabriel-rossmans-macbook-2:~ rossman$ R 'source("~/Documents/book/stata/testgraph.R")'
ARGUMENT 'source("~/Documents/book/stata/testgraph.R")' __ignored__
R version 2.10.0 (2009-10-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> source("~/Documents/book/stata/testgraph.R")
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared library '/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so':
dlopen(/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so, 10): Symbol not found: ___gmpz_clear
Referenced from: /Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so
Expected in: dynamic lookup
Error : .onLoad failed in 'loadNamespace' for 'igraph'
Error: package/namespace load failed for 'igraph'
>
The problem seems to be that R (terminal) can’t find the igraph library. This is weird because R.app has no trouble finding it. Furthermore, I get the same error even if I make sure igraph is installed directly from R (Terminal) in the same R session:
chooseCRANmirror(graphics = FALSE)
install.packages("igraph")
source("/Users/rossman/Documents/book/stata/testgraph.R")
I guess that’s another difference with Stata, StataConsole knows where the ado library is. I’d like to be able to use the Terminal mode for R as this would let me to reach my Nirvana-like goal of having a single script that does everything without any proximate human intervention. So I’ll just ask? How do I get R (Terminal) to run as reliably as R.app? Is this a naive question?
Or would it be better to try to feed R.app a “source” script from the command line? Much how like I can do this for Stata to launch a do-file into the Stata GUI
exec /Applications/Stata/StataMP.app/Contents/MacOS/stataMP ~/Documents/book/stata/import.do
6 comments November 24, 2009
Surface / contour graphs with tddens.ado
| Gabriel |
I have previously complained about the lack of surface / contour graphs in Stata, first providing native code for some thoroughly fug graphs, then code that pipes to gnuplot. The latter solution produces nice output but is a) a pain to install on Mac or Windows and b) doesn’t adhere to Stata’s graph conventions.
I was actually considering cleaning up my gnuplot pipe and submitting it to SSC, but no longer, for there is a new ado file by Austin Nichols that does this natively and produces absolutely gorgeous graphs. The only advantage of piping things to gnuplot is that gnuplot is faster, but I think the advantages of doing everything natively within Stata make the extra couple seconds well worth it.
Fire up your copy of Stata and type
ssc install tddens
Here’s some sample output:


On behalf of Stata users everywhere, thank you Dr. Nichols.
Add comment September 8, 2009
Journey to the True Tales of the IMDB!
| Gabriel |
Following up on yesterday’s post, check out this paper by Herr and his colleagues that graphs IMDB and provides some basic descriptions of the network. You can also see a zoomable version of their truly gorgeous visualization. Finally the answer to that age old question, what do you get for the quantitative cultural sociologist who has everything?
The authors are affiliated with the Cyberinfrastructure for Network Science Center at Indiana. Although Indiana sociology has a well-deserved reputation for hardcore quant research, CNS is at the school of Information. Following the logic I learned from reading Marvel comics as a kid I can only speculate that something about the pesticide run-off in the drinking water gives scholars at Indiana superhuman abilities to code.
Also of note is that CNS provides the cross-platform open source package Network Workbench. I was a little skeptical because it’s written in Java (which tends to be slow) but I got it to create a PageRank vector of a huge dataset in six minutes, which isn’t bad at all. I may have more to say about this program in the future as I plan to tinker with it.
Add comment June 19, 2009
Gnuplot and Stata
| Gabriel |
Stata has had pretty good graphs for a few years, but it can’t do everything, for instance contours. This means that if you’re interested in doing these graphs you either need to export to a GUI spreadsheet or charting package (preferably Excel, which is very versatile, despite the hideous default settings of yellow and pink lines on gray background) or you need to pipe things to a program that can batch it, like R or Gnuplot. I think batching is generally preferable to GUI for the simple reason that we usually don’t do things once but repeat it with subtle variations about a million times. As such I’ve been experimenting with Gnuplot though I readily admit that the same things would work (and perhaps work better with R).
The first step is getting it. Since the whole point is to have Stata pipe to it it should be running on the same OS rather than on a separate computer or in a VM (unless you’re really comfortable with ssh and that sort of thing). If you’re using Linux, gnuplot might already be installed and if not you can probably get it with your package manager either directly or by installing a program (eg, Octave) that has it as a dependency. Likewise you can download the source code and binaries, including Windows (“Win32″).
Unfortunately it’s a bit harder to get for the Mac. As with a lot of Unix software the standard solution is to use Fink but as mentioned before I can’t get Fink to work consistently. I have found a binary of Gnuplot, but since it’s not obvious where to look I’ll share it:
- Download Octave (basically a clone of Matlab)
- Open the disk image for Octave but you don’t need to install it if you don’t want it.
- In the dmg open “/extras” and open the Gnuplot disk image.
- Install Gnuplot.
Furthermore, I recommend that you use a text editor to open (or create) the file “~/.bashrc” and add the line:
alias gnuplot="exec '/Applications/Gnuplot.app/Contents/Resources/bin/gnuplot'"
The next step is to write a do-file that pipes your data and command from Stata to Gnuplot. The easy part is to outsheet the relevant subset of your data to tab-delimited text. Next you need to write the Gnuplot command. You could simply write it all into a single shell command, but I think it’s easier to write it to a text file then use the shell command to have Gnuplot execute the text file. (Why is it easier? Because it makes it much easier to make specific parts of the Gnuplot command contingent on options you specify from within Stata). This requires use of Stata’s “file” command. One trick is that because both Gnuplot and Stata read quotes you have to escape the embedded quotes to get Stata to treat them as literal and therefore pass them on to Gnuplot. A simple example of Stata’s use of escaping quotes is here:
disp `"Then Beavis said "Fire!""'
Whereas Gnuplot has a minimalist graph command (just “plot,” “splot,” or “replot” followed by the filename) Stata puts almost everything in the graph command itself. Gnuplot handles all these options by a series of “set” commands that you specify before running the graph. Because Gnuplot recognizes the semicolon delimiter you can put several commands on one line. So you “set terminal” (which in batch mode should be a file format like postscript but in interactive mode might be a display protocol like X11), set the title and axes titles, set the saving path, set the type of graph, etc. If this is confusing just think of it as changing the preferences, then doing a graph based on the preferences rather than (as Stata does) specifying everything on an ad hoc basis.
Below I’ve written a Stata program that will send data to Gnuplot, write a Gnuplot command, and execute that Gnuplot command. Because I’m too lazy to make it flexible right now the command is only good for doing contour plots (using the pm3d command which produces a color gradient rather the topographic lines of the standard contour command). However it’s entirely feasible to write a generic wrapper to pipe Stata to Gnuplot. The easy-to-program but hard-to-use approach would just have the user write the Gnuplot command from within Stata but facilitate piping the data and command. The hard-to-program but easy-to-use approach would be to write a wrapper that translates Stata syntax to Gnuplot syntax. Anyway, here’s my much more limited command for making a contour plot, followed by a sample of the output.
capture program drop gnuplotpm3dprogram define gnuplotpm3dcapture program drop gnuplotpm3d program define gnuplotpm3d syntax varlist(min=3 max=3 numeric) [ , title(string asis) xlabel(string asis) ylabel(string asis) using(string asis)] if "`xlabel'"~="" { local xlabel="`x'" } if "`ylabel'"~="" { local ylabel="`y'" } local d1=subinstr("`c(current_date)'"," ","",.) local d2=subinstr("`c(current_time)'",":","",.) local d="`d1'`d2'" preserve keep `varlist' order `varlist' outsheet using `using'.txt, replace restore local gnuplot="exec '/Applications/Gnuplot.app/Contents/Resources/bin/gnuplot'" /*this line is necessary if Stata can't find gnuplot, non-mac users may need to edit it*/ file open gpcmd using gnuplotpm3cmd`d', write text file write gpcmd "cd '`c(pwd)'' ; " _n file write gpcmd "set terminal postscript color ; set output '`using'.eps' ; set palette color positive ; " _n file write gpcmd "set auto ; set parametric ; set dgrid3d ; " _n file write gpcmd `"set title "`title'" ; set ylabel "`ylabel'"; set xlabel "`xlabel'"; "' _n file write gpcmd "unset contour; unset surface; " _n file write gpcmd "set view map; set pm3d; set pm3d interpolate 10,10; " _n file write gpcmd `"splot "`using'.txt"; "' _n file close gpcmd shell `gnuplot' gnuplotpm3cmd`d' end
To test it out I generated some noise data and ran the program on it generating the graph below
set obs 12 gen x=[_n]/3 save test, replace ren x y cross using test gen z=5*y+(tan(x+y))/x gnuplotpm3d x y z, using(test) title(Test Graph)

3 comments June 1, 2009
Contour graphs in Stata
| Gabriel |
Stata has a lot of graphing capabilities but it can’t do contour maps (and can only do monochrome surface maps through an ado file). Contour maps are kind of like a scatterplot except that they have three dimensions, where the color-coding stands in for the z-axis. They’re really useful for graphically illustrating complex nonlinear functions (which is why you see them a lot in hard science journals and really hardcore techy network stuff). Usually x and y are some parameter and z is some metric that’s calculated by feeding x and y into a nonlinear equation or simulation. Surface maps are similar but they show the z-axis literally instead of through color-coding. Grusky the Younger has used these to show ginormous occupational reproduction xtabs (x is dad’s job, y is son’s job, z is frequency of that particular combo).
Anyway, these graphs are really useful for certain things but they don’t work in Stata (or Numbers or OpenOffice). On the other hand Excel, R, and Gnuplot have all been able to do this forever. (Gnumeric can do contour but not surface). Anyway, this kind of sucks because I work in Stata and it’s a hassle to a) export a table to a Gnumeric or Excel or b) script Stata to push the graphing to R or GnuPlot. Since I need to do these graphs a lot for exploratory purposes I want to be able to do a quick and dirty draft directly within Stata. (I don’t mind using the other software for publication quality stuff but as every quant knows you do the exploratory stuff a thousand times before you’re ready to set it as publication quality). There’s a pretty good ado file for surface graphs (just type “findit surface”) but I find color-coding much easier to read than 3-D so I want contour graphs and I want it in Stata.
Anyway, I wrote this code to create draft contour graphs. The command syntax is just “crudecontour x y z” where x and y are the axes and z is the color-coding. The program automatically breaks z into quartiles and shows it color-coded from blue (low z) to red (high z).
Note that this little program revels in two distinct aspects of mediocrity. First, it expects the dataset to have exactly one cell for each combination of x and y. If there’s a missing cell it just plots it as white space (unlike the good packages which will impute). Second, it produces graphs that look a like a video game from 1979. On the other hand, it’s native (have you tried to install Gnuplot on a Mac?) and it took five minutes to code. It’s good enough for exploratory work but you definitely want to use something else for the final version. I still haven’t given up yet on scripting Stata to push it to R or Gnuplot because I’d really rather batch this than do it in a GUI spreadsheet.
capture program drop crudecontourcapture program drop crudecontour program define crudecontour set more off local x `1' local y `2' local z `3' /*color-coding variable*/ quietly sum `z', detail local z25=`r(p25)' local z50=`r(p50)' local z75=`r(p75)' twoway /* */ (scatter `x' `y' if `z'<`z25', mcolor(blue) msize(huge) msymbol(square)) /* */ (scatter `x' `y' if `z'>=`z25' & `z'<`z50', mcolor(green) msize(huge) msymbol(square)) /* */ (scatter `x' `y' if `z'>=`z50' & `z'<`z75', mcolor(yellow) msize(huge) msymbol(square)) /* */ (scatter `x' `y' if `z'>=`z75', mcolor(red) msize(huge) msymbol(square))/* */ , legend(order(1 "1Q `z'" 2 "2Q `z'" 3 "3Q `z'" 4 "4Q `z'")) end
3 comments May 29, 2009
Stata graphs in pdf fixed
| Gabriel |
Yesterday I complained that Stata makes ugly pdf graphs. Today I decided to fix it. Not only does this code make graphs that aren’t ugly, but it should port to Stata for Linux or Solaris (the current pdf feature only “works” on Stata for mac). This little program first saves a graph as eps (which Stata does well) and then uses the commnd line utility “ps2pdf” to make the pdf. Deleting the eps file is optional.
This program runs a little slower than the native pdf export but on the other hand it’s usually faster to load a pdf than an eps into a graphics program and I tend to be more impatient about that sort of thing (which I do interactively) than I am about running Stata (which I usually batch).
(Update 4/27 3:34pm, I have added the program to the ssc directory. So instead of copying the code below you can just type “ssc install graphexportpdf, replace”. Thanks to Kit Baum for some help with the syntax parsing)
*1.0.1 GHR April 27, 2009
program graphexportpdf
version 10
set more off
syntax anything(name=filename) [, DROPeps replace]
local extension=regexm("`filename'","(.+)\.pdf")
if `extension'==1 {
disp "{text:note, the file extension .pdf is allowed but not necessary}"
local filename=regexs(1)
}
if "`replace'"=="replace" {
disp "{text:note, replace option is always on with graphexportpdf}"
}
graph export "`filename'.eps", replace
if "$S_OS"=="Windows" {
disp "{error:Sorry, this command only works properly with Mac, Linux, and Solaris. Although I can't make a pdf for you, I have generated an eps file that you can convert to pdf with programs like ghostscript or acrobat distiller}"
}
else {
shell ps2pdf -dAutoPositionEPSFiles=true -dPreserveEPSInfo=true -dAutoRotatePages=/None -dEPSCrop=true "`filename'.eps" "`filename'.pdf"
if "`dropeps'"=="dropeps" | "`dropeps'"=="drop" {
shell rm "`filename'.eps"
}
}
end
Add comment April 27, 2009
Stata graphs and pdf go together like mustard and ice cream
| Gabriel |
[Update, see this follow-up post where I offer a solution]
I was embedding Stata graphs in Lyx and I noticed that the graphs came out kind of fug, with a lot of gratuitous jaggedness. I experimented with it a little and I realized that the problem was that I was having Stata save the graphs as pdf and for some reason it doesn’t do it very well. The top half of this image is part of my graph when I saved it as eps and the bottom half is saved as pdf. (both versions are zoomed in to 400% but you can see it even at 100%).

I really don’t understand why the pdf looks so fug. It’s not fuzzy or jaggy so it’s clearly a vector-graphic not a bitmap, it’s just a vector graphic with very few nodes and gratuitous line thickness. As far as I can tell this is a problem with Stata, not OS X, because if I have Stata create an eps and then use Preview (or for that matter, Lyx/LaTeX) to convert it to pdf, the results look beautiful. Likewise I’ve always been impressed with the deep integration of pdf into OS X. In any case, in the future I’m doing all my Stata graphs as eps even though in some ways pdf is more convenient (such as native integration with “quick look”).
1 comment April 26, 2009
Graphing novels
| Gabriel |
Via Andrew Gelman, I just found this link to TextArc, a package for automated social network visualization of books. Gelman says that (as is often true of visualizations) it’s beautiful and a technical achievement but it’s not clear what the theoretical payoff is. I wasn’t aware of TextArc at the time, but a few years ago I suggested some possibilities for this kind of analysis in a post on orgtheory in which I graphed R. Kelly’s “Trapped in the Closet.” My main suggestion at the time was, and still is, that social network analysis could be a very effective way to distinguish between two basic types of literature:
- episodic works. In these works a single protagonist or small group of protagonists continually meets new people in each chapter. For example The Odyssey or The Adventures of Huckleberry Finn. The network structure for these works will be a star, with a single hub consisting of the main protagonist and a bunch of small cliques each radiating from the hub but not connected to each other. There would also be a steep power-law distribution for centrality.
- serial works. In these works an ensemble of several characters, both major and minor, all seem to know each other but often with low tie strength. For example Neal Stephenson’s Baroque Cycle. This kind of work would have a small world structure. Centrality would follow a poisson but there’d be much higher dispersion for the frequency of repeated interaction across edges.
An engine like TextArc could be very powerful at coding for these things (although I’d want to tweak it so it only draws networks among proper nouns, which would be easy enough with a concordance and/or some regular expression filters.) Of course as Gelman asked, what’s the point?
I can think of several. First, there might be strong associations between network structure and genre. Second, we might imagine that the prevalence of these network structures might change over time. A related issue would be to distinguish between literary fiction and popular fiction. My impression is that currently episodic fiction is popular and serial fiction is literary (especially in the medium of television), but in other eras it was the opposite. A good coding method would allow you to track the relationship between formal structure, genre, prestige, and time.
3 comments April 26, 2009
Workflow and literate programming
| Gabriel |
In a thread over on scatterplot, olderwoman asks for advice on automating the table-making in a report with oodles of cross-tabs. Kieran describes his hardcore geek workflow of using Sweave to integrate R code directly into LaTeX so that he regenerates things on the fly. Although it’s a much less mature project, he mentions that StatWeave is a more portable solution that should generalize this approach to Stata. This sounds very promising but it’s not exactly user-friendly, and let’s face it, the kind of people who like writing raw LaTeX code probably already use R.
The workflow solution I’ve been developing (especially for writing a book) is much simpler but works pretty well for me. Basically, instead of weaving the write-up and math together, I have two files. As long as I execute them in the proper order this is equivalent to using a literate programming solution like StatWeave but much simpler.
The first is a Stata file called graphMMDDYY.do.* This do-file generates all the tables and figures in the book and saves them as png, pdf, or tex. The second are a series of Lyx files called chapterX_MMDDYY.lyx that contain both the actual writing (you know, of words, not code) and pointers to the figures and such. Text-based file formats like html, lyx, and tex don’t directly contain graphics but instead just point to them.** Unlike pasting a graphic into a Word document, when you update the thing they are pointing to it automatically propagates the update. (Of course this can be a disadvantage if you want to know how it used to be). So if I have a lyx file for chapter one that points to a file called ~/book/graphics/graph1_1.png, and then I use Stata to update the graph, then the next time I open Lyx and generate a PDF it’s going to have the new version of the graph.
So my solution gives similar results to the StatWeave approach. My way has two advantages. First, it has a much lower learning curve. Second, it makes it easier to use the same (or overlapping) sets of figures in two files. For instance imagine you wanted to try dozens of specifications for exploratory.pdf but only a few specifications for article.pdf. On the other hand there are two disadvantages to my way. One is that you have to remember to run the do-file before the lyx or tex file, but that’s no big deal. The other is that mine requires more effort on the marginal basis to sync the two files. With StatWeave, you just type the code generating a figure directly into the write-up file. With my way you first write the Stata code generating the figure into the do-file and you then go into the lyx or tex file and create a place-holder targeting the file generated by the Stata code.
*Most of my filenames end with the date because I save a new version of important files every day to facilitate debugging and/or buyer’s remorse about edits. This habit predates Time Machine but I still think it’s a good practice since Time Machine deletes really old versions of your files unless you have a truly ginormous backup disk.
**In the last few years most word processors, including Word, have switched from the old binary file formats to an xml based format. If you’ve ever created a webpage that has both html and jpg files this will be very familiar. However (for some very good reasons) they hide this from the user and make it feel like you’re still using a binary. I don’t know if you can do this on a PC, but on a mac you can select “show package contents” to see inside one of these documents and internally it works much the way I’m describing Lyx. However because Word, Pages, and OpenOffice all keep their own copy of the png file in the package rather than linking the source they would not propagate changes from the original. I’m sure there are ways to get regular word processors to do this (back in System 7 Macs had a clumsy attempt at this called “publish and subscribe”) but I’m happy with the Lyx approach.
Add comment April 7, 2009
Commercial visualization
| Gabriel |
Today Ad Age has a brief story on the use of data visualization techniques in marketing. Many of the things they describe/link are aesthetically appealing and entertaining and this is no small accomplishment. However if you go by Tufte’s standards they don’t accomplish the main purpose of visualization, which is to concisely convey large, complex, and often hyperplex data. In contrast, things like Social Explorer, Pajek graphs, the Herr et. al. network graph of the IMDB, and some of the projects at Microsoft Research are beautiful and informative. However I don’t want to be too gung ho about insisting that visualization meet the Tufte-test — people like ducks and if they see them in ads or music videos it may lead them to be more interested in more serious graphs. Certainly I’m all for anything that leads to popular attention and private sector employment opportunities for culture quants.
Add comment March 20, 2009