Archive for April, 2009

Why did you do that?

| Gabriel |

In a previous post, I applied diffusion methods to interpret the conversion of the Roman empire, today I’m thinking about the conversion of one particular Roman and what it can teach us about the problem of accounts for action. In Confessions, Augustine of Hippo describes his conversion to Christianity and makes important contributions to theology and philosophy. The book is important to the history of Western thought both for its impact on Christian doctrine and (my concern) that it was the first introspective memoir. Augustine tells us much more about how he felt and why he did things than about what he actually did. Most obviously, he frequently laments his lust but doesn’t give us any of the dirt. After an introductory prayer, the book begins by telling us that he’s not completely positive that he remembers it, but he’s pretty sure that he was a sinner as a baby and it goes on like that from there. A typical line about his boyhood goes “For in thy eyes, what was more infamous than I was already, since I displeased even my own kind and deceived, with endless lies, my tutor, my masters and parents–all from a love of play, a craving for frivolous spectacles, a stage-struck restlessness to imitate what I saw in these shows?”

Contrast this with this passage from Gallic Wars, “When Caesar was informed by spies that the Helvetii had already conveyed three parts of their forces across that river, but that the fourth part was left behind on this side of the Saone, he set out from the camp with three legions during the third watch, and came up with that division which had not yet crossed the river.” Caesar’s memoir is an extreme case of all plot, no character, but most other ancient works were similar. Xenophon’s Anabasis is also written in third person and focuses on plot. Xenophon never describes his motives or feelings in the narrator’s voice but only in the dialogue when he answers direct criticisms of his leadership by other soldiers at assembly. (In a few places Xenophon does provide character portraits of other people, most notably a sycophantic obituary for Cyrus and a hilariously nasty obituary for Menon). The closest thing you get to prose emphasizing personality and mental states prior to Augustine are Plutarch’s Lives and 1st and 2nd Samuel but these are biographies not autobiographies and they mostly take a “show, don’t tell” approach.

Anyway, much of Confessions consists of Augustine explaining his actions, which can broadly be categorized as sinning and conversion. His explanation for his sins is primarily concupiscence (i.e., original sin has distorted human nature such that men are depraved) and secondarily social contingencies such that his parents emphasized his social advancement over his moral education or that he was trying to impress other young miscreants. His explanation for his conversion is more complex. On one level he emphasizes social connections to Christianity. His father was a pagan but his mother, Monica, was a Christian and gave him an early education in Christianity which he rejected as a young man encountering the sophistication of pagan philosophy. Later as a young professor of rhetoric in Milan he saw Ambrose preach. Augustine was an intellectual snob who had until then thought of Christianity as embarrassing simple-minded, so to encounter a sophisticated and articulate bishop was very impressive to him and he became close to Ambrose. Meanwhile Monica and some of Augustine’s friends continued to push him to Christianity. It was only after Augustine came under the tutelage of Ambrose and returned to being close to his mother that he heard a voice in the garden saying “Take and read” whereby he opened Paul’s letter to the Romans, read a few sentences, and experienced a religious epiphany after which he consented to be baptized and ordained (dumping both his girlfriend and his fiancee in the process). Although Augustine tells us everything we need to know about the gradual influence exerted by Monica and Ambrose, he emphasizes the incident in the garden as the moment when he was converted.

To me Confessions illustrates both the potential and the problems of methodologies (such as in-depth interviews) that rely on actors giving accounts for the meanings of their actions. Note that Augustine is the best case scenario for accounts of action as he was not your average social science study respondent interviewed over the course of an hour but one of the world’s greatest philosophers who wrote an entire book of profound introspection. Yet even Augustine’s account of his conversion is self-evidently problematic.

Augustine recognizes the influence of his mother and friends. Likewise, he describes the gradual process by which he became intellectually disenchanted with Manichean dualism and interested in Christianity as consistent with neo-Platonist monism. Nonetheless he emphasizes the moment of grace in the garden. I have a feeling that if you put your digital voice recorder on the table and interviewed Augustine he’d give you very different accounts depending on whether you asked “when did you become a Christian,” “how did you become a Christian,” or “why did you become a Christian.” The first question he’d just tell you about the garden, the second question he’d tell you about Monica and Ambrose but still close with the garden, and the third question he’d give an entirely non-biographical answer about neo-Platonism.

One way to interpret Augustine’s emphasis of the garden is that he is following a cultural script. In this interpretation he is trying to make sense of his own religious experience as comparable to the prototypical conversion, Paul hearing a voice on the road to Damascus (which itself echoes such passages from Tanakh as Moses encountering the burning bush). The accepted cultural script of conversion is not to debate religion rationally for decades before finally giving in to it, but to have an epiphany where God’s grace opens your heart. Even if the rational debate was vastly more important to Augustine, that’s not how it’s supposed to go and so he emphasizes the comparatively minor incident in the garden which fits the cultural script much better. In this respect Augustine is like many modern people (especially evangelicals but also other traditions as in Christensen’s testimony affirming his Mormonism) who are raised in their church but nonetheless can construct “conversion” narratives of the point in which they personally affirmed the religion of their upbringing.

Another interpretation (which is compatible with the first) is that Augustine really did have a religious epiphany in the garden but this epiphany was only the final stage of a process overwhelming mundane and gradual. A lot of work on cognition recently has established that, like fortune, insight favors the prepared. We subjectively experience insight as a sudden revelation of an often complex idea with all the parts hanging together fully-formed. However this only comes as the culmination of a long period of rumination. So Augustine had been thinking about Christianity and neo-Platonism for decades before he had an insight that synthesized these thoughts and finally brought him to Jesus. At the moment it probably did feel subjectively to Augustine like his mind had experienced a qualitative shift whereas his previous thinking to that point had been only evolutionary.

The same thing applies to much more mundane insights than the religious epiphany of a saint. I subjectively experience the basic concepts for most of my study designs as conceptual insights rather than things that I develop slowly. For instance, when I was in grad school I experienced a burst of insight of a complete methodology involving (what I later learned already existed and was called) cross-classified fixed-effects model. It subjectively came to me all at once, but this was after I had been thinking fairly intensively about the meaning of fixed-effects for over two years. If I were trained in a theoretical tradition that emphasized cultural scripts of creative genius over the accumulation of knowledge I would probably emphasize the moment of insight when the method came to me and ignore the long period of thinking and tinkering that led up to it.

Anyway, my point is that even someone as brilliant as Augustine is incapable of really completely understanding his own motives, in part because both cultural scripts and the subjective experience of cognition push him to emphasize certain narratives over others. If we can’t take Augustine’s testimony about the most important decision of his life at face value it gets even trickier to interpret transcribed in-depth interviews, let alone closed-form GSS attitude questions that all start out with “do you strongly agree, somewhat agree, neither agree nor disagree, …” You can get really cynical about it and adopt a hyper-structural perspective (e.g., early Marxism, exchange theory or networks in sociology, expressed preferences in economics), where the actor’s account of subjective experience drops out almost entirely and all we care about is action. On the other hand even if you are that cynical about the causes of action, the cultural scripts are fascinating objects of study in of themselves. Certainly it’s very interesting that conversion narratives often culminate in an epiphany, even if we think that conversion is actually a process involving influences through social networks and gradual rumination.

April 30, 2009 at 6:15 am 2 comments

Have a nice day

| Gabriel |

In the comments, Mike3550 noticed that many of my scripts end:

*have a nice day

Believe it or not, this is actually a useful trick, the reason being that Stata only executes lines that end in a carriage return. Including a gratuitous comment at the end of the script is an easy way to ensure that Stata executes the last real command. If “*have a nice day” is too sappy for you  it works just as well to use “*it puts the lotion on its skin or else it gets the hose again.”

April 29, 2009 at 6:37 am 1 comment

R, Stata and “non-rectangular” data

| Pierre |

Thanks Gabriel for letting me join you on this blog. For those who don’t know me, my name is Pierre, I am a graduate student in sociology at Princeton and I’ve been doing work on organizations, culture and economic sociology (guess who’s on my committee). Recently, I’ve become interested in diffusion processes — in quite unrelated domains: the emergence of new composers and their adoption in orchestra repertoires, the evolution of attitudes towards financial risk, the diffusion of stock-ownership and the recent stock-market booms and busts.

When Gabriel asked me if I wanted to post on this Stata/soc-of-culture-oriented blog, I first told him I was actually slowly moving away from Stata and using R more and more frequently… which is going to be the topic of my first post. I am not trying to start the first flamewar of “Code and culture” — rather I’d like to argue that both languages have their own strengths and weaknesses; the important thing for me is not to come to a definitive conclusion (“Stata is better than R” or vice versa) and only use one package while discarding the other, but to identify conditions under which R or Stata are more or less painful to use for the type of data analysis I am working on.

People usually emphasize graphics functions and the number of high-quality user-contributed packages for cutting-edge models as being R’s greatest strengths over other statistical packages. I have to say I don’t run very often into R estimation functions for which I can’t find an equivalent Stata command. And while I agree that R-generated graphs can be amazingly cool, Stata has become much better in recent years. For me, R is particularly useful when I need to manipulate certain kinds of data and turn them into a “rectangular” dataset:

Stata is perfect for “rectangular” data, when the dataset fits nicely inside a rectangle of observations (rows) and variables (colums) and when the conceptual difference between rows and columns is clear — this is what a dataset will have to look like just before running a regression. But things can get complicated when the raw dataset I need to manipulate is not already “rectangular”: this may include network data and multilevel data — even when the ultimate goal is to turn these messy-looking data, sometimes originating from multiple sources, into a nice rectangular dataset that can be analyzed with a simple linear model… Sure, Stata has a few powerful built-in commands (although I’d be embarrassed to say how many times I had to recheck the proper syntax for “reshape” in the Stata help). But sometimes egen, merge, expand, collapse and reshape won’t do the trick… and I find myself sorting, looping, using, saving and merging until I realize (too late of course!) that Stata can be a horrible, horrible tool when it comes to manipulating datasets that are not already rectangular. R on the other hand has two features that make it a great tool for data management:

  1. R can have multiple objects loaded in memory at the same time. Stata on the other hand can only work on one dataset at a time — which is not just inefficient (you always need to write the data into temporary files and read a new file to switch from one dataset to another), it can also  unnecessarily add lines to the code and create confusion.
  2. R can easily handle multiple types of objects: vectors, matrices, arrays, data frames (i.e. datasets), lists, functions… Stata on the other hand is mostly designed to work on datasets: most commands take variables or variable lists as input; and when Stata tries to handle other types of objects (matrices, scalars, macros, ado files…), Stata uses distinct commands each with a different syntax (e.g. “matrix define”, “scalar”, “local”, “global”, “program define” instead of “generate”…) and sometimes a completely different language (Mata for matrix operations — which I have never had the patience to learn). R on the other hand handles these objects in a simple and consistent manner (for example it uses the same assignment operator “<-” for a matrix, a vector, an array, a list or a function…) and can extract elements which are seamlessly “converted” into other object types (e.g. a column of a matrix, or coefficients/standard errors from a model are by definition vectors, which can be treated as such and added as variables in a data frame, without even using any special command à la “svmat”).

In my next post, I’ll try to explain why I keep using Stata despite all this…

April 28, 2009 at 12:57 pm 9 comments

Stata graphs in pdf fixed

| Gabriel |

Yesterday I complained that Stata makes ugly pdf graphs. Today I decided to fix it. Not only does this code make graphs that aren’t ugly, but it should port to Stata for Linux or Solaris (the current pdf feature only “works” on Stata for mac). This little program first saves a graph as eps (which Stata does well) and then uses the commnd line utility “ps2pdf” to make the pdf. Deleting the eps file is optional.

This program runs a little slower than the native pdf export but on the other hand it’s usually faster to load a pdf than an eps into a graphics program and I tend to be more impatient about that sort of thing (which I do interactively) than I am about running Stata (which I usually batch).

(Update 4/27 3:34pm, I have added the program to the ssc directory. So instead of copying the code below you can just type “ssc install graphexportpdf, replace”. Thanks to Kit Baum for some help with the syntax parsing)

*1.0.1 GHR April 27, 2009
program graphexportpdf
	version 10
	set more off
	syntax anything(name=filename) [, DROPeps replace]
	local extension=regexm("`filename'","(.+)\.pdf")
	if `extension'==1 {
		disp "{text:note, the file extension .pdf is allowed but not necessary}"
		local filename=regexs(1)
	}
	if "`replace'"=="replace" {
		disp "{text:note, replace option is always on with graphexportpdf}"
	}
	graph export "`filename'.eps", replace
	if "$S_OS"=="Windows" {
		disp "{error:Sorry, this command only works properly with Mac, Linux, and Solaris. Although I can't make a pdf for you, I have generated an eps file that you can convert to pdf with programs like ghostscript or acrobat distiller}"
	}
	else {
		shell ps2pdf -dAutoPositionEPSFiles=true -dPreserveEPSInfo=true -dAutoRotatePages=/None -dEPSCrop=true "`filename'.eps" "`filename'.pdf"
		if "`dropeps'"=="dropeps" | "`dropeps'"=="drop" {
			shell rm "`filename'.eps"
		}
	}
end

April 27, 2009 at 11:37 am 3 comments

Did you control for this?

| Gabriel |

Probably because we deal with observational data (as compared to randomized experiments), sociologists tend to believe that the secret to a good model is throwing so many controls into it that the journal would need to have a centerfold to print the table. This is partly based on the idea that the goal of analysis is to strive for an r-squared of 1.0 or that it’s fun to treat control variables like a gift bag of theoretically irrelevant but substantively cute micro-findings.

The real problem though is based on a misconception of how control variables relate to hypothesis testing. To avoid spurious findings you do not need to control for everything that is plausibly related to the dependent variable. What you do need to control for is everything that is plausibly related to the dependent variable and the independent variable. If a variable is uncorrelated with your independent variable then you don’t need to control for it even if it’s very good at explaining the dependent variable.

You can demonstrate this with a very simple simulation (code is below) by generating an independent variable and three controls.

  • Goodcontrol is related to both X and Y
  • Stupidcontrol1 is related to Y but not X
  • Stupidcontrol2 is related to X but not Y

Compared to a simple bivariate regression, both goodcontrol and stupidcontrol1 greatly improve the overall fit of the model. (Stupidcontrol2 doesn’t so most people would drop it). But controls aren’t about improving fit, they’re about avoiding spurious results. So what we really should care about is not how r-squared changes or the size of beta-stupidcontrol1, but how beta-x changes. Here the results do differ. Introducing stupidcontrol1 leaves beta-x essentially unchanged but introducing goodcontrol changes beta-x appreciably. The moral of the story, it’s only worthwhile to control for variables that are correlated with the independent variable and the dependent variable.

Note that in this simulation we know by assumption which variables are good and which are stupid. In real life we often don’t know this, and just as importantly peer reviewers don’t know either. Whether it’s worth it to leave stupidcontrol1 and stupidcontrol2 in the model purely for “cover your ass” effect is an exercise left to the reader. However if a reviewer asks you to include “stupidcontrol1” and it proves unfeasible to collect this data then you can always try reminding the reviewer that you agree that it is probably a good predictor of the outcome, but your results would nonetheless be robust to its inclusion because there is no reason to believe that it is also correlated with your independent variable. I’m not guaranteeing that the reviewers would believe this, only that they should believe this so long as they agree with your assumption that stupidcontrol1 will have a low correlation with X.

clear
set obs 10000
gen x=rnormal()
gen goodcontrol=rnormal()/2+(x/2)
gen stupidcontrol1=rnormal()
gen stupidcontrol2=rnormal()/2+(x/2)
gen error=rnormal()
gen y=x+goodcontrol+stupidcontrol1+error
eststo clear
quietly eststo: reg y x
quietly eststo: reg y x goodcontrol
quietly eststo: reg y x stupidcontrol1
quietly eststo: reg y x stupidcontrol2
quietly eststo: reg y x goodcontrol stupidcontrol1 stupidcontrol2
esttab, scalars (r2) nodepvars nomtitles
*have a nice day

April 27, 2009 at 5:29 am 9 comments

Stata graphs and pdf go together like mustard and ice cream

| Gabriel |

[Update, see this follow-up post where I offer a solution]
[Update 2, this issue is fixed as of Stata 11.2 (3/30/11), “graph export foo.pdf” now works great on Macs]

I was embedding Stata graphs in Lyx and I noticed that the graphs came out kind of fug, with a lot of gratuitous jaggedness. I experimented with it a little and I realized that the problem was that I was having Stata save the graphs as pdf and for some reason it doesn’t do it very well. The top half of this image is part of my graph when I saved it as eps and the bottom half is saved as pdf. (both versions are zoomed in to 400% but you can see it even at 100%).

eps vs pdf

I really don’t understand why the pdf looks so fug. It’s not fuzzy or jaggy so it’s clearly a vector-graphic not a bitmap, it’s just a vector graphic with very few nodes and gratuitous line thickness. As far as I can tell this is a problem with Stata, not OS X, because if I have Stata create an eps and then use Preview (or for that matter, Lyx/LaTeX) to convert it to pdf, the results look beautiful. Likewise I’ve always been impressed with the deep integration of pdf into OS X. In any case, in the future I’m doing all my Stata graphs as eps even though in some ways pdf is more convenient (such as native integration with “quick look”).

April 26, 2009 at 10:45 pm 3 comments

Graphing novels

| Gabriel |

Via Andrew Gelman, I just found this link to TextArc, a package for automated social network visualization of books. Gelman says that (as is often true of visualizations) it’s beautiful and a technical achievement but it’s not clear what the theoretical payoff is. I wasn’t aware of TextArc at the time, but a few years ago I suggested some possibilities for this kind of analysis in a post on orgtheory in which I graphed R. Kelly’s “Trapped in the Closet.” My main suggestion at the time was, and still is, that social network analysis could be a very effective way to distinguish between two basic types of literature:

  1. episodic works. In these works a single protagonist or small group of protagonists continually meets new people in each chapter. For example The Odyssey or The Adventures of Huckleberry Finn. The network structure for these works will be a star, with a single hub consisting of the main protagonist and a bunch of small cliques each radiating from the hub but not connected to each other. There would also be a steep power-law distribution for centrality.
  2. serial works.  In these works an ensemble of several characters, both major and minor, all seem to know each other but often with low tie strength. For example Neal Stephenson’s Baroque Cycle. This kind of work would have a small world structure. Centrality would follow a poisson but there’d be much higher dispersion for the frequency of repeated interaction across edges.

An engine like TextArc could be very powerful at coding for these things (although I’d want to tweak it so it only draws networks among proper nouns, which would be easy enough with a concordance and/or some regular expression filters.) Of course as Gelman asked, what’s the point?

I can think of several. First, there might be strong associations between network structure and genre. Second, we might imagine that the prevalence of these network structures might change over time. A related issue would be to distinguish between literary fiction and popular fiction. My impression is that currently episodic fiction is popular and serial fiction is literary (especially in the medium of television), but in other eras it was the opposite. A good coding method would allow you to track the relationship between formal structure, genre, prestige, and time.

April 26, 2009 at 1:12 pm 3 comments

Older Posts


The Culture Geeks