Workflow and literate programming
| Gabriel |
In a thread over on scatterplot, olderwoman asks for advice on automating the table-making in a report with oodles of cross-tabs. Kieran describes his hardcore geek workflow of using Sweave to integrate R code directly into LaTeX so that he regenerates things on the fly. Although it’s a much less mature project, he mentions that StatWeave is a more portable solution that should generalize this approach to Stata. This sounds very promising but it’s not exactly user-friendly, and let’s face it, the kind of people who like writing raw LaTeX code probably already use R.
The workflow solution I’ve been developing (especially for writing a book) is much simpler but works pretty well for me. Basically, instead of weaving the write-up and math together, I have two files. As long as I execute them in the proper order this is equivalent to using a literate programming solution like StatWeave but much simpler.
The first is a Stata file called graphMMDDYY.do.* This do-file generates all the tables and figures in the book and saves them as png, pdf, or tex. The second are a series of Lyx files called chapterX_MMDDYY.lyx that contain both the actual writing (you know, of words, not code) and pointers to the figures and such. Text-based file formats like html, lyx, and tex don’t directly contain graphics but instead just point to them.** Unlike pasting a graphic into a Word document, when you update the thing they are pointing to it automatically propagates the update. (Of course this can be a disadvantage if you want to know how it used to be). So if I have a lyx file for chapter one that points to a file called ~/book/graphics/graph1_1.png, and then I use Stata to update the graph, then the next time I open Lyx and generate a PDF it’s going to have the new version of the graph.
So my solution gives similar results to the StatWeave approach. My way has two advantages. First, it has a much lower learning curve. Second, it makes it easier to use the same (or overlapping) sets of figures in two files. For instance imagine you wanted to try dozens of specifications for exploratory.pdf but only a few specifications for article.pdf. On the other hand there are two disadvantages to my way. One is that you have to remember to run the do-file before the lyx or tex file, but that’s no big deal. The other is that mine requires more effort on the marginal basis to sync the two files. With StatWeave, you just type the code generating a figure directly into the write-up file. With my way you first write the Stata code generating the figure into the do-file and you then go into the lyx or tex file and create a place-holder targeting the file generated by the Stata code.
*Most of my filenames end with the date because I save a new version of important files every day to facilitate debugging and/or buyer’s remorse about edits. This habit predates Time Machine but I still think it’s a good practice since Time Machine deletes really old versions of your files unless you have a truly ginormous backup disk.
**In the last few years most word processors, including Word, have switched from the old binary file formats to an xml based format. If you’ve ever created a webpage that has both html and jpg files this will be very familiar. However (for some very good reasons) they hide this from the user and make it feel like you’re still using a binary. I don’t know if you can do this on a PC, but on a mac you can select “show package contents” to see inside one of these documents and internally it works much the way I’m describing Lyx. However because Word, Pages, and OpenOffice all keep their own copy of the png file in the package rather than linking the source they would not propagate changes from the original. I’m sure there are ways to get regular word processors to do this (back in System 7 Macs had a clumsy attempt at this called “publish and subscribe”) but I’m happy with the Lyx approach.