| Gabriel |
I’ve found that a very effective way to write complicated do-files is to start the file with a list of globals, which I then refer to throughout the script in various way. This is useful for things that I need to use multiple times (and want to do consistently and concisely) and for things that I need to change a lot, either because I’m experimenting with different specifications or because I might migrate the project to a different computer. The three major applications are directories, specification options, and varlists. So for instance, the beginning of a file I use to clean radio data into diffusion form looks like this:
*directories global parentpath "/Users/rossman/Documents/book/stata" global raw "$parentpath/rawsongs" global clean "$parentpath/clean" *options/switches global catn 15 /*min n for first cut by category*/ global catn2 5 /*min n for second cut*/ global earlywin 60 /*how many days from first add to 5th %ile are OK? */ global earlydrop 1 /*drop adds that are earlier than p5-$earlywin?*/
Using globals to describe directories is the first trick I learned with globals (thanks to some of the excellent programming tutorials at UCLA CCPR). This is particularly useful for cleaning scripts where it’s a good idea to have several directories and move around between them. I often have a raw directory, a “variables to be merged in” directory and a clean directory and I have to switch around between them a lot. Using globals ensures that I spell the directory names consistently and makes it easy to migrate the code to another computer (such as a server). Without globals I’d have to search through the code and manually change every reference to a directory. When I’m feeling really hardcore I start the cleaning script by emptying the clean directory like this
cd $clean shell rm *.*
This ensures that every file is created from scratch and I’m not, for instance, merging on an obsolete file that was created using old data which would be a very bad thing. The next thing I use globals for are (for lack of a better term) options or switches. As described in a previous post, one of the things you can do with this is use the “if” syntax to turn entire blocks of code on or off. There are other ways to use the switches. For instance, one of the issues I deal with in my radio data is where to draw the line between outlier and data. A fairly common diffusion pattern for the second single on an album is that two or three stations add the song when the album is first released, nothing happens for a few months (because most stations are playing the actual first single), and then dozens of stations add the song on or near the single release date, several months later a few stations may add the single here and there. So there’s basically one big wave of diffusion preceded by a few outliers and followed by a few outliers and there are good theoretical reasons for expecting that the causal process for the outliers is different from the main wave. Dealing with these outliers (especially the left ones) is a judgement call and so I like to experiment with different specifications. In the sample globals I have “earlydrop” to indicate whether I’m going to drop early outliers and “earlywin” to define how early is too early. Having these written as globals rather than written directly into the code makes it much easier to experiment with different specifications and see how robust the analysis is to them.
Likewise one of the things I like to do is break out the diffusion curves by some cluster and then another cluster within that cluster. For instance, I like to break out song diffusion by station format (genre) and within format, by the stations’ corporate owner. This way I can see if, holding constant format, Clear Channel stations time adoptions differently that other stations (in fact, they don’t). Of course this raises the question of how small of a cluster is worth bothering with. If a song gets played on 100 country stations, 25 AC stations, and 10 CHR stations, it’s obviously worth modeling the diffusion on country but it’s debatable whether it’s worth doing for AC or CHR. Likewise, if 35 of the country stations are owned by Clear Channel, 10 by Cumulus, and no more than 5 by any other given company, which corporate chains are worth breaking out and which get lumped into not-elsewhere-classified. Treating the thresholds “catn” and “cat2n” as globals lets me easily experiment with where I draw the line between broken out and lumped into n.e.c.
The last thing I use globals for is varlists. Because I mostly do this in analysis files not cleaning files, I don’t have any in the file I’ve been using as an example but here’s a hypothetical example. Suppose that I had exit poll data on a state ballot initiative to create vouchers for private school tuition and my hypothesis was that homeowners vote on this issue rationally so as to preserve their property values. So for instance, people who live in good school districts near bad school districts will oppose the initiative and I measure this with “betterschools” defined as the logged ratio of your school’s test scores to the average test scores of the adjacent schools. Like a lot of sociology papers I construct this as a nested model where I add in groups of variables, starting with the usual suspects, adding in some attitude variables and occupation/sector variables, and then the additive version of my hypothesis, and finally the interactions that really test the hypothesis. Finally I’ve gotten an R+R in which a peer reviewer is absolutely convinced that Cancer’s are selfless whereas Capricorn’s are cynical and demands that I include astrological sign as a control variable and the editor’s cover letter reminds me to address the reviewer’s important concerns about birth sign. It’s much easier to create globals for each group of variables then write the regression model on the globals than it is to write the full varlist for each model. One of the reasons is that it’s easier to change. For instance, imagine that I decide to experiment with specifying education as a dummy vs. years of education, using globals means that I need only change this in “usualsuspects” and the change propagates through the models.
global usualsuspects "edu_ba black female ln_income married nkids" global job "occ1 occ2 occ3 occ4 union pubsector" global attitudes "catholic churchattend premarsx equal4" global home "homeowner betterschools" global homeinteract "homeowner_betterschools" global astrology "aries taurus gemini cancer leo virgo libra scorpio sagittarius capricorn aquarius pisces" eststo clear eststo: logit voteyes $usualsuspects eststo: logit voteyes $usualsuspects $job eststo: logit voteyes $usualsuspects $job $attitudes eststo: logit voteyes $usualsuspects $job $attitudes $home eststo: logit voteyes $usualsuspects $job $attitudes $home $homeinteract eststo: logit voteyes $usualsuspects $job $attitudes $home $homeinteract $astrology esttab using regressiontable.txt , se b(3) se(3) scalars(ll) title(Table: NESTED MODELS OF VOTING FOR SCHOOL VOUCHERS) replace fixed eststo clear