Archive for April, 2009

Why did you do that?

| Gabriel |

In a previous post, I applied diffusion methods to interpret the conversion of the Roman empire, today I’m thinking about the conversion of one particular Roman and what it can teach us about the problem of accounts for action. In Confessions, Augustine of Hippo describes his conversion to Christianity and makes important contributions to theology and philosophy. The book is important to the history of Western thought both for its impact on Christian doctrine and (my concern) that it was the first introspective memoir. Augustine tells us much more about how he felt and why he did things than about what he actually did. Most obviously, he frequently laments his lust but doesn’t give us any of the dirt. After an introductory prayer, the book begins by telling us that he’s not completely positive that he remembers it, but he’s pretty sure that he was a sinner as a baby and it goes on like that from there. A typical line about his boyhood goes “For in thy eyes, what was more infamous than I was already, since I displeased even my own kind and deceived, with endless lies, my tutor, my masters and parents–all from a love of play, a craving for frivolous spectacles, a stage-struck restlessness to imitate what I saw in these shows?”

Contrast this with this passage from Gallic Wars, “When Caesar was informed by spies that the Helvetii had already conveyed three parts of their forces across that river, but that the fourth part was left behind on this side of the Saone, he set out from the camp with three legions during the third watch, and came up with that division which had not yet crossed the river.” Caesar’s memoir is an extreme case of all plot, no character, but most other ancient works were similar. Xenophon’s Anabasis is also written in third person and focuses on plot. Xenophon never describes his motives or feelings in the narrator’s voice but only in the dialogue when he answers direct criticisms of his leadership by other soldiers at assembly. (In a few places Xenophon does provide character portraits of other people, most notably a sycophantic obituary for Cyrus and a hilariously nasty obituary for Menon). The closest thing you get to prose emphasizing personality and mental states prior to Augustine are Plutarch’s Lives and 1st and 2nd Samuel but these are biographies not autobiographies and they mostly take a “show, don’t tell” approach.

Anyway, much of Confessions consists of Augustine explaining his actions, which can broadly be categorized as sinning and conversion. His explanation for his sins is primarily concupiscence (i.e., original sin has distorted human nature such that men are depraved) and secondarily social contingencies such that his parents emphasized his social advancement over his moral education or that he was trying to impress other young miscreants. His explanation for his conversion is more complex. On one level he emphasizes social connections to Christianity. His father was a pagan but his mother, Monica, was a Christian and gave him an early education in Christianity which he rejected as a young man encountering the sophistication of pagan philosophy. Later as a young professor of rhetoric in Milan he saw Ambrose preach. Augustine was an intellectual snob who had until then thought of Christianity as embarrassing simple-minded, so to encounter a sophisticated and articulate bishop was very impressive to him and he became close to Ambrose. Meanwhile Monica and some of Augustine’s friends continued to push him to Christianity. It was only after Augustine came under the tutelage of Ambrose and returned to being close to his mother that he heard a voice in the garden saying “Take and read” whereby he opened Paul’s letter to the Romans, read a few sentences, and experienced a religious epiphany after which he consented to be baptized and ordained (dumping both his girlfriend and his fiancee in the process). Although Augustine tells us everything we need to know about the gradual influence exerted by Monica and Ambrose, he emphasizes the incident in the garden as the moment when he was converted.

To me Confessions illustrates both the potential and the problems of methodologies (such as in-depth interviews) that rely on actors giving accounts for the meanings of their actions. Note that Augustine is the best case scenario for accounts of action as he was not your average social science study respondent interviewed over the course of an hour but one of the world’s greatest philosophers who wrote an entire book of profound introspection. Yet even Augustine’s account of his conversion is self-evidently problematic.

Augustine recognizes the influence of his mother and friends. Likewise, he describes the gradual process by which he became intellectually disenchanted with Manichean dualism and interested in Christianity as consistent with neo-Platonist monism. Nonetheless he emphasizes the moment of grace in the garden. I have a feeling that if you put your digital voice recorder on the table and interviewed Augustine he’d give you very different accounts depending on whether you asked “when did you become a Christian,” “how did you become a Christian,” or “why did you become a Christian.” The first question he’d just tell you about the garden, the second question he’d tell you about Monica and Ambrose but still close with the garden, and the third question he’d give an entirely non-biographical answer about neo-Platonism.

One way to interpret Augustine’s emphasis of the garden is that he is following a cultural script. In this interpretation he is trying to make sense of his own religious experience as comparable to the prototypical conversion, Paul hearing a voice on the road to Damascus (which itself echoes such passages from Tanakh as Moses encountering the burning bush). The accepted cultural script of conversion is not to debate religion rationally for decades before finally giving in to it, but to have an epiphany where God’s grace opens your heart. Even if the rational debate was vastly more important to Augustine, that’s not how it’s supposed to go and so he emphasizes the comparatively minor incident in the garden which fits the cultural script much better. In this respect Augustine is like many modern people (especially evangelicals but also other traditions as in Christensen’s testimony affirming his Mormonism) who are raised in their church but nonetheless can construct “conversion” narratives of the point in which they personally affirmed the religion of their upbringing.

Another interpretation (which is compatible with the first) is that Augustine really did have a religious epiphany in the garden but this epiphany was only the final stage of a process overwhelming mundane and gradual. A lot of work on cognition recently has established that, like fortune, insight favors the prepared. We subjectively experience insight as a sudden revelation of an often complex idea with all the parts hanging together fully-formed. However this only comes as the culmination of a long period of rumination. So Augustine had been thinking about Christianity and neo-Platonism for decades before he had an insight that synthesized these thoughts and finally brought him to Jesus. At the moment it probably did feel subjectively to Augustine like his mind had experienced a qualitative shift whereas his previous thinking to that point had been only evolutionary.

The same thing applies to much more mundane insights than the religious epiphany of a saint. I subjectively experience the basic concepts for most of my study designs as conceptual insights rather than things that I develop slowly. For instance, when I was in grad school I experienced a burst of insight of a complete methodology involving (what I later learned already existed and was called) cross-classified fixed-effects model. It subjectively came to me all at once, but this was after I had been thinking fairly intensively about the meaning of fixed-effects for over two years. If I were trained in a theoretical tradition that emphasized cultural scripts of creative genius over the accumulation of knowledge I would probably emphasize the moment of insight when the method came to me and ignore the long period of thinking and tinkering that led up to it.

Anyway, my point is that even someone as brilliant as Augustine is incapable of really completely understanding his own motives, in part because both cultural scripts and the subjective experience of cognition push him to emphasize certain narratives over others. If we can’t take Augustine’s testimony about the most important decision of his life at face value it gets even trickier to interpret transcribed in-depth interviews, let alone closed-form GSS attitude questions that all start out with “do you strongly agree, somewhat agree, neither agree nor disagree, …” You can get really cynical about it and adopt a hyper-structural perspective (e.g., early Marxism, exchange theory or networks in sociology, expressed preferences in economics), where the actor’s account of subjective experience drops out almost entirely and all we care about is action. On the other hand even if you are that cynical about the causes of action, the cultural scripts are fascinating objects of study in of themselves. Certainly it’s very interesting that conversion narratives often culminate in an epiphany, even if we think that conversion is actually a process involving influences through social networks and gradual rumination.


April 30, 2009 at 6:15 am 2 comments

Have a nice day

| Gabriel |

In the comments, Mike3550 noticed that many of my scripts end:

*have a nice day

Believe it or not, this is actually a useful trick, the reason being that Stata only executes lines that end in a carriage return. Including a gratuitous comment at the end of the script is an easy way to ensure that Stata executes the last real command. If “*have a nice day” is too sappy for you  it works just as well to use “*it puts the lotion on its skin or else it gets the hose again.”

April 29, 2009 at 6:37 am 1 comment

R, Stata and “non-rectangular” data

| Pierre |

Thanks Gabriel for letting me join you on this blog. For those who don’t know me, my name is Pierre, I am a graduate student in sociology at Princeton and I’ve been doing work on organizations, culture and economic sociology (guess who’s on my committee). Recently, I’ve become interested in diffusion processes — in quite unrelated domains: the emergence of new composers and their adoption in orchestra repertoires, the evolution of attitudes towards financial risk, the diffusion of stock-ownership and the recent stock-market booms and busts.

When Gabriel asked me if I wanted to post on this Stata/soc-of-culture-oriented blog, I first told him I was actually slowly moving away from Stata and using R more and more frequently… which is going to be the topic of my first post. I am not trying to start the first flamewar of “Code and culture” — rather I’d like to argue that both languages have their own strengths and weaknesses; the important thing for me is not to come to a definitive conclusion (“Stata is better than R” or vice versa) and only use one package while discarding the other, but to identify conditions under which R or Stata are more or less painful to use for the type of data analysis I am working on.

People usually emphasize graphics functions and the number of high-quality user-contributed packages for cutting-edge models as being R’s greatest strengths over other statistical packages. I have to say I don’t run very often into R estimation functions for which I can’t find an equivalent Stata command. And while I agree that R-generated graphs can be amazingly cool, Stata has become much better in recent years. For me, R is particularly useful when I need to manipulate certain kinds of data and turn them into a “rectangular” dataset:

Stata is perfect for “rectangular” data, when the dataset fits nicely inside a rectangle of observations (rows) and variables (colums) and when the conceptual difference between rows and columns is clear — this is what a dataset will have to look like just before running a regression. But things can get complicated when the raw dataset I need to manipulate is not already “rectangular”: this may include network data and multilevel data — even when the ultimate goal is to turn these messy-looking data, sometimes originating from multiple sources, into a nice rectangular dataset that can be analyzed with a simple linear model… Sure, Stata has a few powerful built-in commands (although I’d be embarrassed to say how many times I had to recheck the proper syntax for “reshape” in the Stata help). But sometimes egen, merge, expand, collapse and reshape won’t do the trick… and I find myself sorting, looping, using, saving and merging until I realize (too late of course!) that Stata can be a horrible, horrible tool when it comes to manipulating datasets that are not already rectangular. R on the other hand has two features that make it a great tool for data management:

  1. R can have multiple objects loaded in memory at the same time. Stata on the other hand can only work on one dataset at a time — which is not just inefficient (you always need to write the data into temporary files and read a new file to switch from one dataset to another), it can also  unnecessarily add lines to the code and create confusion.
  2. R can easily handle multiple types of objects: vectors, matrices, arrays, data frames (i.e. datasets), lists, functions… Stata on the other hand is mostly designed to work on datasets: most commands take variables or variable lists as input; and when Stata tries to handle other types of objects (matrices, scalars, macros, ado files…), Stata uses distinct commands each with a different syntax (e.g. “matrix define”, “scalar”, “local”, “global”, “program define” instead of “generate”…) and sometimes a completely different language (Mata for matrix operations — which I have never had the patience to learn). R on the other hand handles these objects in a simple and consistent manner (for example it uses the same assignment operator “<-” for a matrix, a vector, an array, a list or a function…) and can extract elements which are seamlessly “converted” into other object types (e.g. a column of a matrix, or coefficients/standard errors from a model are by definition vectors, which can be treated as such and added as variables in a data frame, without even using any special command à la “svmat”).

In my next post, I’ll try to explain why I keep using Stata despite all this…

April 28, 2009 at 12:57 pm 9 comments

Stata graphs in pdf fixed

| Gabriel |

Yesterday I complained that Stata makes ugly pdf graphs. Today I decided to fix it. Not only does this code make graphs that aren’t ugly, but it should port to Stata for Linux or Solaris (the current pdf feature only “works” on Stata for mac). This little program first saves a graph as eps (which Stata does well) and then uses the commnd line utility “ps2pdf” to make the pdf. Deleting the eps file is optional.

This program runs a little slower than the native pdf export but on the other hand it’s usually faster to load a pdf than an eps into a graphics program and I tend to be more impatient about that sort of thing (which I do interactively) than I am about running Stata (which I usually batch).

(Update 4/27 3:34pm, I have added the program to the ssc directory. So instead of copying the code below you can just type “ssc install graphexportpdf, replace”. Thanks to Kit Baum for some help with the syntax parsing)

*1.0.1 GHR April 27, 2009
program graphexportpdf
	version 10
	set more off
	syntax anything(name=filename) [, DROPeps replace]
	local extension=regexm("`filename'","(.+)\.pdf")
	if `extension'==1 {
		disp "{text:note, the file extension .pdf is allowed but not necessary}"
		local filename=regexs(1)
	if "`replace'"=="replace" {
		disp "{text:note, replace option is always on with graphexportpdf}"
	graph export "`filename'.eps", replace
	if "$S_OS"=="Windows" {
		disp "{error:Sorry, this command only works properly with Mac, Linux, and Solaris. Although I can't make a pdf for you, I have generated an eps file that you can convert to pdf with programs like ghostscript or acrobat distiller}"
	else {
		shell ps2pdf -dAutoPositionEPSFiles=true -dPreserveEPSInfo=true -dAutoRotatePages=/None -dEPSCrop=true "`filename'.eps" "`filename'.pdf"
		if "`dropeps'"=="dropeps" | "`dropeps'"=="drop" {
			shell rm "`filename'.eps"

April 27, 2009 at 11:37 am 3 comments

Did you control for this?

| Gabriel |

Probably because we deal with observational data (as compared to randomized experiments), sociologists tend to believe that the secret to a good model is throwing so many controls into it that the journal would need to have a centerfold to print the table. This is partly based on the idea that the goal of analysis is to strive for an r-squared of 1.0 or that it’s fun to treat control variables like a gift bag of theoretically irrelevant but substantively cute micro-findings.

The real problem though is based on a misconception of how control variables relate to hypothesis testing. To avoid spurious findings you do not need to control for everything that is plausibly related to the dependent variable. What you do need to control for is everything that is plausibly related to the dependent variable and the independent variable. If a variable is uncorrelated with your independent variable then you don’t need to control for it even if it’s very good at explaining the dependent variable.

You can demonstrate this with a very simple simulation (code is below) by generating an independent variable and three controls.

  • Goodcontrol is related to both X and Y
  • Stupidcontrol1 is related to Y but not X
  • Stupidcontrol2 is related to X but not Y

Compared to a simple bivariate regression, both goodcontrol and stupidcontrol1 greatly improve the overall fit of the model. (Stupidcontrol2 doesn’t so most people would drop it). But controls aren’t about improving fit, they’re about avoiding spurious results. So what we really should care about is not how r-squared changes or the size of beta-stupidcontrol1, but how beta-x changes. Here the results do differ. Introducing stupidcontrol1 leaves beta-x essentially unchanged but introducing goodcontrol changes beta-x appreciably. The moral of the story, it’s only worthwhile to control for variables that are correlated with the independent variable and the dependent variable.

Note that in this simulation we know by assumption which variables are good and which are stupid. In real life we often don’t know this, and just as importantly peer reviewers don’t know either. Whether it’s worth it to leave stupidcontrol1 and stupidcontrol2 in the model purely for “cover your ass” effect is an exercise left to the reader. However if a reviewer asks you to include “stupidcontrol1” and it proves unfeasible to collect this data then you can always try reminding the reviewer that you agree that it is probably a good predictor of the outcome, but your results would nonetheless be robust to its inclusion because there is no reason to believe that it is also correlated with your independent variable. I’m not guaranteeing that the reviewers would believe this, only that they should believe this so long as they agree with your assumption that stupidcontrol1 will have a low correlation with X.

set obs 10000
gen x=rnormal()
gen goodcontrol=rnormal()/2+(x/2)
gen stupidcontrol1=rnormal()
gen stupidcontrol2=rnormal()/2+(x/2)
gen error=rnormal()
gen y=x+goodcontrol+stupidcontrol1+error
eststo clear
quietly eststo: reg y x
quietly eststo: reg y x goodcontrol
quietly eststo: reg y x stupidcontrol1
quietly eststo: reg y x stupidcontrol2
quietly eststo: reg y x goodcontrol stupidcontrol1 stupidcontrol2
esttab, scalars (r2) nodepvars nomtitles
*have a nice day

April 27, 2009 at 5:29 am 9 comments

Stata graphs and pdf go together like mustard and ice cream

| Gabriel |

[Update, see this follow-up post where I offer a solution]
[Update 2, this issue is fixed as of Stata 11.2 (3/30/11), “graph export foo.pdf” now works great on Macs]

I was embedding Stata graphs in Lyx and I noticed that the graphs came out kind of fug, with a lot of gratuitous jaggedness. I experimented with it a little and I realized that the problem was that I was having Stata save the graphs as pdf and for some reason it doesn’t do it very well. The top half of this image is part of my graph when I saved it as eps and the bottom half is saved as pdf. (both versions are zoomed in to 400% but you can see it even at 100%).

eps vs pdf

I really don’t understand why the pdf looks so fug. It’s not fuzzy or jaggy so it’s clearly a vector-graphic not a bitmap, it’s just a vector graphic with very few nodes and gratuitous line thickness. As far as I can tell this is a problem with Stata, not OS X, because if I have Stata create an eps and then use Preview (or for that matter, Lyx/LaTeX) to convert it to pdf, the results look beautiful. Likewise I’ve always been impressed with the deep integration of pdf into OS X. In any case, in the future I’m doing all my Stata graphs as eps even though in some ways pdf is more convenient (such as native integration with “quick look”).

April 26, 2009 at 10:45 pm 3 comments

Graphing novels

| Gabriel |

Via Andrew Gelman, I just found this link to TextArc, a package for automated social network visualization of books. Gelman says that (as is often true of visualizations) it’s beautiful and a technical achievement but it’s not clear what the theoretical payoff is. I wasn’t aware of TextArc at the time, but a few years ago I suggested some possibilities for this kind of analysis in a post on orgtheory in which I graphed R. Kelly’s “Trapped in the Closet.” My main suggestion at the time was, and still is, that social network analysis could be a very effective way to distinguish between two basic types of literature:

  1. episodic works. In these works a single protagonist or small group of protagonists continually meets new people in each chapter. For example The Odyssey or The Adventures of Huckleberry Finn. The network structure for these works will be a star, with a single hub consisting of the main protagonist and a bunch of small cliques each radiating from the hub but not connected to each other. There would also be a steep power-law distribution for centrality.
  2. serial works.  In these works an ensemble of several characters, both major and minor, all seem to know each other but often with low tie strength. For example Neal Stephenson’s Baroque Cycle. This kind of work would have a small world structure. Centrality would follow a poisson but there’d be much higher dispersion for the frequency of repeated interaction across edges.

An engine like TextArc could be very powerful at coding for these things (although I’d want to tweak it so it only draws networks among proper nouns, which would be easy enough with a concordance and/or some regular expression filters.) Of course as Gelman asked, what’s the point?

I can think of several. First, there might be strong associations between network structure and genre. Second, we might imagine that the prevalence of these network structures might change over time. A related issue would be to distinguish between literary fiction and popular fiction. My impression is that currently episodic fiction is popular and serial fiction is literary (especially in the medium of television), but in other eras it was the opposite. A good coding method would allow you to track the relationship between formal structure, genre, prestige, and time.

April 26, 2009 at 1:12 pm 3 comments


| Gabriel |

I’ve found that a very effective way to write complicated do-files is to start the file with a list of globals, which I then refer to throughout the script in various way. This is useful for things that I need to use multiple times (and want to do consistently and concisely) and for things that I need to change a lot, either because I’m experimenting with different specifications or because I might migrate the project to a different computer. The three major applications are directories, specification options, and varlists. So for instance, the beginning of a file I use to clean radio data into diffusion form looks like this:

global parentpath "/Users/rossman/Documents/book/stata"
global raw        "$parentpath/rawsongs"
global clean      "$parentpath/clean"
global catn       15 /*min n for first cut by category*/
global catn2      5  /*min n for second cut*/
global earlywin   60 /*how many days from first add to 5th %ile are OK? */
global earlydrop  1  /*drop adds that are earlier than p5-$earlywin?*/

Using globals to describe directories is the first trick I learned with globals (thanks to some of the excellent programming tutorials at UCLA CCPR). This is particularly useful for cleaning scripts where it’s a good idea to have several directories and move around between them. I often have a raw directory, a “variables to be merged in” directory and a clean directory and I have to switch around between them a lot. Using globals ensures that I spell the directory names consistently and makes it easy to migrate the code to another computer (such as a server). Without globals I’d have to search through the code and manually change every reference to a directory. When I’m feeling really hardcore I start the cleaning script by emptying the clean directory like this

cd $clean
shell rm *.*

This ensures that every file is created from scratch and I’m not, for instance, merging on an obsolete file that was created using old data which would be a very bad thing. The next thing I use globals for are (for lack of a better term) options or switches. As described in a previous post, one of the things you can do with this is use the “if” syntax to turn entire blocks of code on or off. There are other ways to use the switches. For instance, one of the issues I deal with in my radio data is where to draw the line between outlier and data. A fairly common diffusion pattern for the second single on an album is that two or three stations add the song when the album is first released, nothing happens for a few months (because most stations are playing the actual first single), and then dozens of stations add the song on or near the single release date, several months later a few stations may add the single here and there. So there’s basically one big wave of diffusion preceded by a few outliers and followed by a few outliers and there are good theoretical reasons for expecting that the causal process for the outliers is different from the main wave. Dealing with these outliers (especially the left ones) is a judgement call and so I like to experiment with different specifications. In the sample globals I have “earlydrop” to indicate whether I’m going to drop early outliers and “earlywin” to define how early is too early. Having these written as globals rather than written directly into the code makes it much easier to experiment with different specifications and see how robust the analysis is to them.

Likewise one of the things I like to do is break out the diffusion curves by some cluster and then another cluster within that cluster. For instance, I like to break out song diffusion by station format (genre) and within format, by the stations’ corporate owner. This way I can see if, holding constant format, Clear Channel stations time adoptions differently that other stations (in fact, they don’t). Of course this raises the question of how small of a cluster is worth bothering with. If a song gets played on 100 country stations, 25 AC stations, and 10 CHR stations, it’s obviously worth modeling the diffusion on country but it’s debatable whether it’s worth doing for AC or CHR. Likewise, if 35 of the country stations are owned by Clear Channel, 10 by Cumulus, and no more than 5 by any other given company, which corporate chains are worth breaking out and which get lumped into not-elsewhere-classified. Treating the thresholds “catn” and “cat2n” as globals lets me easily experiment with where I draw the line between broken out and lumped into n.e.c.

The last thing I use globals for is varlists. Because I mostly do this in analysis files not cleaning files, I don’t have any in the file I’ve been using as an example but here’s a hypothetical example. Suppose that I had exit poll data on a state ballot initiative to create vouchers for private school tuition and my hypothesis was that homeowners vote on this issue rationally so as to preserve their property values. So for instance, people who live in good school districts near bad school districts will oppose the initiative and I measure this with “betterschools” defined as the logged ratio of your school’s test scores to the average test scores of the adjacent schools. Like a lot of sociology papers I construct this as a nested model where I add in groups of variables, starting with the usual suspects, adding in some attitude variables and occupation/sector variables, and then the additive version of my hypothesis, and finally the interactions that really test the hypothesis. Finally I’ve gotten an R+R in which a peer reviewer is absolutely convinced that Cancer’s are selfless whereas Capricorn’s are cynical and demands that I include astrological sign as a control variable and the editor’s cover letter reminds me to address the reviewer’s important concerns about birth sign. It’s much easier to create globals for each group of variables then write the regression model on the globals than it is to write the full varlist for each model. One of the reasons is that it’s easier to change. For instance, imagine that I decide to experiment with specifying education as a dummy vs. years of education, using globals means that I need only change this in “usualsuspects” and the change propagates through the models.

global usualsuspects    "edu_ba black female ln_income married nkids"
global job              "occ1 occ2 occ3 occ4 union pubsector"
global attitudes        "catholic churchattend premarsx equal4"
global home             "homeowner betterschools"
global homeinteract     "homeowner_betterschools"
global astrology        "aries taurus gemini cancer leo virgo libra scorpio sagittarius capricorn aquarius pisces"
eststo clear
eststo: logit voteyes $usualsuspects
eststo: logit voteyes $usualsuspects $job
eststo: logit voteyes $usualsuspects $job $attitudes
eststo: logit voteyes $usualsuspects $job $attitudes $home
eststo: logit voteyes $usualsuspects $job $attitudes $home $homeinteract
eststo: logit voteyes $usualsuspects $job $attitudes $home $homeinteract $astrology
esttab using regressiontable.txt , se b(3) se(3) scalars(ll) title(Table: NESTED MODELS OF VOTING FOR SCHOOL VOUCHERS) replace fixed
eststo clear

April 25, 2009 at 11:42 am

Choosing a (Mac) text editor with Stata

| Gabriel |

I love TextWrangler (free) but I was a little frustrated that it doesn’t allow code folding. (I should note that it does support everything else on my text editor wish list, most notably regular expressions, syntax highlighting, and pushing). I’d seen code folding in action with html editors like Bluefish and this struck me as a great feature. If you’re not familiar with it, code folding is when you hide some block of code, usually a subroutine or loop. TextWrangler’s big brother, BBEdit ($49 educational, $125 commercial) does offer code folding but it’s only really useful if the syntax files are written to make it work because the program has to be able to recognize what a loop looks like. Unfortunately BBEdit doesn’t come with Stata syntax files and the excellent TextWrangler Stata language file written by dataninja doesn’t support BBEdit only features like code folding. I looked to see if I could figure out how to write code folding syntax into dataninja’s language module but I couldn’t find any documentation about code folding in the developer kit and in any case I’m not that talented. Aaargh!

In desperation I’ve considered using another text editor, even though I really like TextWrangler. Apparently Kate (free) works pretty well with Stata but there’s no mac version available. (In theory I could recompile it using Fink but that never works for me). Likewise Notepad++ (free) has an excellent Stata syntax file and I highly recommend it to people who use Windows, but there’s no Mac version so I’d have to run it through Crossover/Wine and again that’s a hassle (the key bindings are different and you lose access to the native file browser and applescript integration). UltraEdit ($49) also has good Stata support and apparently it will be ported to Mac/Linux, but it’s not going to be out for a few months.

Editra (free) is a very well-featured and cross-platform editor, but there’s no Stata syntax file yet, nor can I figure out how to write one. One minor limitation I’ve noticed is that Editra can’t handle extremely long rows, but I only ever used extremely long rows for file list globals and there’s a better way to do that. A nice feature is that the language support is in the app package (as compared to “~/Library/Application Support”) which makes it easier to run off a key. Likewise Smultron keeps the syntax in the app package and is well-suited to run off a key. It has excellent Stata highlighting but no code folding. Smultron is the only editor I’ve seen that comes with the Stata syntax file included so it might be a good choice for beginners who don’t want to fiddle with the language preferences, libraries, and that sort of thing to install a user-written file.

Currently the best option is looking like TextMate ($45 educational, $53 commercial). Timothy Beatty at York University has put together a bundle that integrates it beautifully with Stata (note that the bundle assumes you have MP and requires some light editing for some of the features to work with other versions). Something I didn’t expect to like as much as I did is that every open file has its own window (rather than a tab drawer like TW) and this makes it much easier to compare two similar files, though it would get unwieldy with dozens of open files. On the other hand I still prefer certain features of TextWrangler. For instance, it’s much easy to execute a multi-file find/replace in TextWrangler than it is in TextMate (which requires you to first set up a “project” then apply the batch to the project). Both the tab thing and the batch thing have something in common which is that TextWrangler is better suited for cleaning multiple data files, something I do a lot of. However for coding it’s looking like TextMate. I’ve been using it for about a week while working with a very complex file and so far I’ve been very happy with its code folding, (limited) syntax completion, (excellent) syntax highlighting, etc. Some of the other editors I’ve mentioned could be this good in principle (and already are for some languages), but they would need the as-of-yet unwritten syntax files to do so for Stata.

(btw, here are the definitive thoughts on using text editors with Stata for various platforms).

April 15, 2009 at 6:30 am 2 comments

The greatest diffusion story ever told

| Gabriel |

While I do most of my serious work on trivial pop culture things like pop songs, I’m always interested in the big questions of western civ. As such, I figured I’d post something topical for Easter that I’d been playing around with. According to Mark and Matthew,* Jesus returned from the dead and told the eleven remaining apostles to convert the nations. In Acts they basically strikes out in a different direction to try to achieve that goal. Whatever else it may be, this is the beginning of a very interesting diffusion story.

The apostles established local churches and became the first bishops. Their churches survived them and new bishops succeeded them. For instance, the Apostle Peter was executed in 68 and succeeded  by Linus as bishop of Rome. These churches ordained new clergy who in turn established churches in other territories. In the Catholic and Orthodox churches (and I believe the Anglican Communion as well) the concept of “apostolic succession” is a key aspect of legitimacy and so churches have maintained lists of bishops and you can use this to date the point at which Christianity became somewhat institutionalized in a city. (Yes, a city, the Christians converted the urban population much earlier than the rural population. The word “pagan” is actually Latin for “hick”.) Although these Christian sources are partial we have some independent confirmation for the early spread of Christianity, most notably an early 2nd century letter from Pliny to Trajan. In this letter Pliny described his efforts to suppress Christianity in Northern Turkey and even relates how he tortured two Christian women so they would tell him about a mass. Likewise several Roman historians from the early empire mention rumors that various nobles had taken up “Jewish superstitions” and there is even a famous second century piece of Roman graffiti mocking Christians.

In any case, the Christian doctrine of apostolic succession can be understood as diffusion data. Diffusion is just prevalence over time and so if you count the number of cities with a bishop that’s what you get. Unfortunately the data isn’t as fine-grained as we might like and so for most cities I could only find the establishment date to the nearest century. Nonetheless this gives us a crude idea of how Christianity spread. In this graph I show how many cities had bishops by century.


The first thing to note is that this graph is basically an s-curve. By the end of the first century there were only 33 bishops, about triple the number of the original apostles. By the fall of the Roman empire in the West there were hundreds of bishops but Europe didn’t become fully Christianized until the 11th century when the Swedes finally gave up on Odin and Thor. (Of course by this time almost all of the old Eastern empire was now Islamic; you win some, you lose some). Note though that interpreting this as an s-curve is a bit tricky because one of the assumptions of an s-curve is a constant population (in this case a population of cities). In fact it’s not clear what geographic boundaries we’d expect for the population or how many cities big enough to deserve a bishop there were within these boundaries. For instance, there seems to have been a severe population decline in the third century and again during late antiquity (all those Germans were partly filling a demographic vacuum in the Mediterranean). We might thus expect that the number of decent sized cities would shrink during these periods as some cities shrank to the level of small towns or were abandoned altogether.

The second thing to note in looking at the curve is the dog that didn’t bark. Christianity was officially (but very haphazardly) persecuted until the edict of Milan in 313, briefly subjected to (rather mild) persecution again under Julian the Apostate, and then finally given a state monopoly under Theodosius in 391. Thus Christianity experienced a very different policy regime between the 3rd century (when it was most intensely persecuted) and the 4th century (when we see the beginnings of caesaropapism) and you might expect this to affect the diffusion. However the hazard doesn’t change. There’s a very smooth s-curve both before and after Constantine. I think the implication is that the emperors of the dominate did not so much drive Christianity as respond to it.

This seems to be a special case of what Lieberson calls “riding the wave,” which is when prominent people adopt a trend shortly before it peaks. Lieberson provides numerous examples of movie stars who were either born with a name or adopted a stage name shortly before that name peaked in popularity.  If you look only at a small time range it’s easy to misunderstand the prominent people as driving the trend when in fact they are often just responding to it like everybody else. Likewise it seems more appropriate to say that the church caused the Edict of Milan than vice versa.

*Update: I checked and it turns out this part of Mark (the earliest Gospel) probably wasn’t part of the original but was added a generation or two later. The undisputed part of Mark ends with the empty tomb. This is noteworthy because the disputed passage in Mark includes the phrase “go and make disciples of the nations.” Jews used “the nations” to mean non-Jews and this passage may have been added to bolster support for the relatively late decision advocated by Paul to evangelize the gentiles.

April 12, 2009 at 6:22 am 5 comments

Older Posts

The Culture Geeks