Archive for June, 2009

The Workflow of Data Analysis Using Stata

| Gabriel |

I recently read Scott Long’s new book The Workflow of Data Analysis Using Stata and I highly recommend it. One of the ironies of graduate education in the social sciences is that we spend quite a bit of time trying to explain things like standard error but largely ignore that on a modal day quantitative research is all about data management and programming. Although Long is too charitable to mention it, one of the reasons to emphasize these issues is that many of the notorious horror stories of quantitative research do not involve modeling but data management. For instance, “88” was an unnoticed missing value code not actual data on senescent priapism, it was a weighting error that led to wildly exaggerated estimates of post-divorce income effects, and, most recently, findings about anomie were at least in part an artifact of a NORC missing data coding error.

By focusing on these largely neglected but critical data management issues, Long has done a service to the discipline. The publication of it may even reduce Indiana’s comparative advantage of producing hotshot quant PhDs now that grad students elsewhere can vicariously benefit from this important aspect of the training there. Certain aspects of it aren’t relevant to everyone (e.g., his section on value labels is most applicable to surveys with lots of Likert scales) but almost any serious quant is likely to find an enormous amount of clearly presented useful information.

For many of the issues the book addresses he shows a highly efficient and reliable way to do things. This is a service because many self-taught people will satisfice with a clunky and inefficient technique, even though with a little more upfront effort (an upfront effort greatly reduced by this book) they could avoid both effort and error in the long run. “Chapter 4: Automating Your Work” is particularly good in this respect. Since I lacked the benefit of a a copy of this book time-warped to 1997, I used Stata for years until I learned the “program” and “foreach” syntax. Even until now, I’d never understood how to use matrices (which is why this script is so hideously clunky, really, please don’t click the link) but Long has a very clear explanation of how to use all of these programming constructs. In the future I think my scripts will be much more elegant for having read his book, and especially chapter 4.

A less obvious contribution is that in several places he suggests standards. For instance, he suggests several missing data codes to distinguish between different types of missing data (coding error, skip code, respondent refused, etc). The particular codes he provides are necessarily arbitrary but no less useful for it because standards benefit from network externalities and it would make data analysis much easier if Stata users harmonized on these standards. Therefore the important thing is to have a remotely sensible standard, regardless of what it is.

Despite my enthusiasm I had a few differences of opinion and style. The main one is that the book reads something like a series of clear but nonetheless relatively discrete pieces of advice with only implicitly unifying themes. Over the course of a 200+ page book even consistently good advice starts to feel like one thing after another.

I think it might have made more sense and been more engaging to lay out a short list of principles for good code in the introduction. Then throughout the text each particular technique or standard could be shown as a manifestation of one or more of these general rules. Here is my own attempt to codify the general principles that at present are only implicit. Over the next few weeks I’ll elaborate on how these principles manifest in the book.

  1. The project should be replicable. (As Hillel said, “this is the whole law, the rest is commentary.”)
  2. Document your work by doing everything through adequately commented, organized, and archived scripts.
  3. Treat the raw data files as read-only.
  4. Good code will let you make changes in one place and see those changes propagate. (Note: Long embraces this principle within a single version of a single script, but otherwise sees this as a bug not a feature. As I’ll discuss in a few days, I disagree with him on the trade-offs involved in this issue).
  5. Good code is modular.

June 29, 2009 at 5:48 am 5 comments


| Gabriel |

The House is scheduled to vote today on the Waxman-Markey CO2 cap and trade bill. The interesting thing about this bill is that its watered down to ensure passage meaning that even if it works exactly as intended, it will have only trivial direct impacts on climate change. Some supporters of the bill acknowledge that it will have minimal direct effects, but it will have much greater indirect effects because it will show global leadership such that the BRIC countries will sign on for CO2 limits of their own. That is, we can view the bill as a big bet on the macro institutionalist world polity theory associated with John Meyer’s group at Stanford.

I think this is a very interesting bet because cap and trade is a hard case for macro institutionalism. The typical case examined by macro institutionalism involves countries signing onto vague but pleasant sounding treaties which they don’t necessarily plan to enforce (i.e. “decoupling”). The classic example is that North Korea signs any human rights treaty you put in front of it and several countries that practice genital mutilation have signed and ratified the UN women’s rights treaty.

On the other hand there are limits to the decoupling, especially in countries with rule of law. (This is why the US has not ratified the women’s rights treaty, because we know that unlike Saudi Arabia our courts would actually enforce it and unlike Canada or European countries we don’t want them to). The classic example of this is that Japan signed the indigenous rights treaty, thinking of it as cheap grace on the grounds that they had no indigenous minorities who might demand rights, only to discover later that the Ainu were able to effectively use it to form an identity and make enforceable claims on the state.

So in order for Waxman-Markey to have an appreciable effect on climate change, it would not only have to motivate other countries to pass similar legislation but for them to do so effectively rather than just symbolically. The thing is that, unlike protecting the rights of imaginary indigenous minorities, serious CO2 reduction is really expensive. This is why you see a certain amount of decoupling both in Waxman-Markey and in earlier European Union cap and trade regulation. In both cases incumbent emitters have been essentially grandfathered in with grants of permits so as to buy them off from mobilizing to oppose the legislation. Given that China’s social contract amounts to “the CCP provides jobs and increasing prosperity, the people don’t challenge its legitimacy” and the CCP is willing to do things like currency manipulation to keep this going, I find it very difficult to believe that they will stop building coal plants and replace them with wind mills because America and the EU showed that this was the legitimate practice for a nation-state.

My prediction is that the institutionalist model for climate change legislation will succeed in the sense that many countries will pass it, in part inspired by America’s example (which in turn was inspired by the EU’s example). However institutionalism also predicts that it will largely fail in the sense that such legislation will (like the American and EU laws) be substantially decoupled from actual practice, containing substantial carve-outs and exemptions. In other words, if you need real estate in Bangladesh, you’re better off leasing rather than buying.

June 26, 2009 at 5:12 am 2 comments

Would you buy a car from these guys?

| Gabriel |

This week’s episode of EconTalk (host Russ Roberts, guest Mike Munger) discusses the relationship between GM and its local dealers. Car dealers are franchisees and in most states the manufacturer can only cancel the franchise agreement by buying out the dealer. Thus in the short to medium run it can be less costly to keep a brand going at a loss than to close it (and buy out the dealers). More broadly, the argument goes that under corporatism GM was making so much money that it didn’t mind rent-seeking from its stakeholders. However once GM started to face serious competition from the Japanese in the late 70s it was so constrained by these arrangements that it was unable to adapt. Probably the most curious case is the launch of the Saturn nameplate in the early 1990s, where GM seemed to have some good ideas about how to imitate Toyota, but rather than tackling the Herculean task of directly taking on its stakeholders by applying those ideas to, say, Chevy, GM tried to create a new nameplate that would escape the historical relations with its stakeholders. (It didn’t work).

Let me count the ways that I enjoyed this episode:

  • It’s a nice respite from the many episodes they’ve had on macroeconomics over the last year.
  • The guys are adopting a path dependency argument in that if you’re going to be a company operating in neoliberalism, there’s a huge difference between having been born under neoliberalism vs making a transition from corporatism. I thought this was interesting coming from EconTalk, which is usually skeptical of friction arguments.
  • Mike Munger and Russ Roberts have great chemistry. This episode sounds like an ordinary conversation where they honestly don’t understand why the situation is turning out like this and are trying to tease it out. In some cases they note an argument seems plausible until you consider a piece of counter-evidence and table the issue as undecided. (This reflects Roberts’ trend over the last few months towards almost agnostic intellectual humility). In many older episodes with Munger, they have the didactic feel of Plato’s dialogues, or rather, what Plato’s dialogues would have sounded like if Socrates and Alcibiades continually riffed to see who can outdo the other in elaborate sarcastic imagery. (I’ve listened to their gouging conversation three times and I laugh every time).

June 25, 2009 at 5:14 am 1 comment

Frialator research methods

| Gabriel |

AdAge notes that market research consistently finds that consumers say they want more healthy option, yet mysteriously people buy fried chicken, not grilled chicken. In other news, voters want more government services but lower taxes and Augustine prayed for chastity, but not yet.

The only real question to me is whether opinion/marketing survey and focus group respondents have clear self-awareness of their preferences but obsfucate them for reasons of social desirability bias? The alternative is that people do not really understand their own latent preferences (which are only made manifest in interaction) and when asked to articulate their preferences they fall back onto cultural scripts. My own vote goes for the “we don’t understand our own preferences” model as I think of myself as being an idiot far more often than a liar and I generalize from that n of 1.

June 24, 2009 at 4:58 am 1 comment

Underneath it all

| Gabriel |

A few years ago I had a friendly argument with Jenn Lena and Pete Peterson about their ASR article on genre trajectories. While I generally love that article, my one minor quibble is their position that there is such a thing as non-genre music, and in particular that “pop” can be considered unmarked, in genre terms. They write “Not all commercial music can be properly considered a genre in our sense of the term.” They exclude Tin Pan Alley (showtunes) and go on to write that, “Much the same argument holds for pop and teen music. At its core, pop music is music found in Billboard magazine’s Hot 100 Singles chart. Songs intended for the pop music market usually have their distinguishing genre characteristics purposely obscured or muted in the interest of gaining wider appeal.”

Myself, I disagree with treating pop as beyond genre. First, the Hot 100 is an aggregate without any real meaning as a categorical marker. I find it interesting that in radio it’s increasingly prevalent to call “Top 40” as “Contemporary Hits Radio” in recognition of the fact that in the literal sense top 40 hasn’t existed for decades and many bands who are very popular would nonetheless not get played in CHR and many bands (think Britney Spears) only get played in CHR, implying that CHR is itself a genre of what we might call “high pop.” Billboard itself distinguishes between the Hot 100 (whatever is really popular, regardless of genre) and Top 40 Mainstream (CHR).

Second, and more importantly, it is impossible to have non-genre music in the same way that it is impossible to have language-less speech if you take the Howard Becker perspective that genre is about having sufficient shared understandings and expectations so as to allow coordination between actors. Consider the fact that most genres work on the Buddy Holly model of long-lasting bands who write their own songs whereas high pop almost exclusively involves project-based collaborations of songwriters, session musicians, producers, and (most salient to the audience) singers. Since standards are especially important when the collaborations are ephemeral, then coordination through strong shared expectations is more important in high pop than genre music. Likewise, high pop sounds more monotonous than many genre-based music. Furthermore, high pop is not merely the baseline, but involves specialized skills and techniques (e.g., vocal filters) not found in “genres.”

For the most part this issue is orthogonal to the argument they present in the article (which is why I like the article despite this dispute) but I think it potentially creates problems for the IST (Industry-> Scene-> Traditional) trajectory, most of which involves a spin-off of high pop music (as is seen most clearly with the Nashville Sound, which was basically Tin Pan Alley with cowboy hats). In response to this Pete said that there is a distinction between pop and genre in that with pop change is gradual and more Lamarckian than the creative destruction and churn seen with genres. I think this is definition is fair enough, certainly it’s highly relevant to their purposes. So the question of whether it is possible to have non-genre music ultimately comes down to whether you choose to emphasize churn or shared expectations as the defining feature of genre.

Anyway, I was reminded of this discussion a few days ago when my wife and I went to see No Doubt. This band has had 8 singles on the Billboard 100 chart and had multiple singles in four different Billboard format charts (rhythmic, CHR, adult, modern rock) so I think they are a fair candidate for what Jenn and Pete have in mind as “pop.” However the performance I attended made it apparent that at their core they are ultimately still a ska band. Most obviously, during one of Gwen’s costume changes the band did a cover of The Special’s arrangement of “Guns of Navarone” and when she came back she was wearing what can only be described as a two-tone sequined romper and later on she wore a metallic Fred Perry shirt and braces (worn hanging). More generally all of their dancing was based on ska steps, their rhythm section dominates their lead guitar, and they had a horns section and keyboard (tuned as an organ).

In a sense, I think you can take No Doubt as a vindication of what Jenn and Pete are arguing. Here you have a band that started out within genre music but graduated into commercial success by recording unmarked pop. Note that their return to ska/dancehall with “Rock Steady” didn’t sell nearly as many copies as the mostly pop albums “Tragic Kingdom” and “Return of Saturn”. However there’s also the interesting fact that when Gwen decided to dive headfirst into high pop, she did so as a “solo” act, which in effect meant that she went from collaborating with Tony Kanal to doing so with Dr Dre and the Neptunes. I take Gwen’s solo career as a vindication for my perspective, the idea being that going into high pop involves not just the negative act of losing the markings and skills of genre and becoming generic music (which presumably Kanal could have done), but the positive act of acquiring the markings and skills of high pop (which required soliciting the efforts of high pop specialists like the Neptunes).

Special bonus armchair speculation!

Compare and contrast No Doubt and Dance Hall Crashers. Both are up-tempo California ska bands that started in the late 80s and have girl singers (two of them in the case of DHC). Although this is necessarily disputable, I would submit that c. 1995 (when No Doubt broke), DHC was the more talented band. Likewise, DHC has the better pedigree, being (along with Rancid) the successors to Operation Ivy. So why is it that Gwen Stefani rather than Elyse Rogers or Karina Denike is the one who ultimately became a world class pop star and an entrepreneur of overpriced designer fauxriental baby clothes?

I have three speculations, listed below in rough order of how much credence I give each of them:

  1. Looking for an explanation is futile because cultural markets are radically stochastic. If you have two talented bands it is literally impossible to predict ex ante which will become popular and in some alternate universe DHC are gazillionaires whereas No Doubt is known only to aficionados of California 90s music.
  2. Jenn and Pete are right and the issue is that No Doubt was better at transcending genre. Noteworthy in this respect is that basically all of DHC’s music is skacore whereas from their very first recordings No Doubt has always included elements of disco and pop, including AC-friendly Tin-Pan-Alley-esque ballads like “Don’t Speak” that it’s pretty hard to imagine DHC playing.
  3. There’s a cluster economy explanation in that No Doubt is from Orange County (which c. 1994 was supposed to be the next Seattle) whereas DHC is from the East Bay.

June 23, 2009 at 5:42 am 1 comment

p(gay married couple | married couple reporting same sex)

| Gabriel |

Over at Volokh, Dale Carpenter reproduces an email from Gary Gates (who unfortunately I don’t know personally, even though we’re both faculty affiliates of CCPR). In the email, Gates disputes a Census report on gay couples that Carpenter had previously discussed, arguing that many of the “gay” couples were actually straight couples who had coding errors for gender. This struck me as pretty funny, in no small part because in grad school my advisor used to warn me that no variable is reliable, even self-reported gender. (Paul, you were right). More broadly, this points to the problems of studying small groups. (Gays and lesbians are about 3% of the population, the famous 10% figure is a myth based on Kinsey’s use of convenience/purposive sampling).

Of course the usual problem with studying minorities is how to recruit a decent sample size in such a way that still approximates a random sample drawn from the (minority) population. If you take a random sample of the population and then do a screening question (“do you consider yourself gay”) you’re facing a lot of expense and also problems of refusal if the screener involves stigma because refusal and social desirability bias will be higher on a screener than if the same question is asked later on in the interview. On the other hand if you just direct your sample recruitment to areas where your minority is concentrated you’ll save a lot of time but you will also be getting only members of the minority who experience segregation, which is unfortunate as gays who live in West Hollywood are very different from those who live in Northridge, American Indians who live on reservations are very different from those who live in Phoenix, etc. Both premature screeners involving stigma and recruitment by concentrated area are likely to lead to recruiting unrepresentative members of the group on such dimensions as salience of the group identity.

These problems are familiar nightmares to anyone who knows survey methods. However the issue described by Gates in response to Carpenter (and the underlying Census study) presents a wholly new issue that when you are dealing with a small class you can have problems even if sampling is not a problem and even if measurement error in defining the class is minimal. Really this is the familiar Bayesian problem that when you are dealing with a low baseline probability events, even reasonably accurate measures can lead to false positives outnumbering true positives. The usual example given in statistics/probability textbooks is that if few people actually have a disease and you have a very accurate test for that disease, nonetheless the large majority of people who initially test positive for this disease will ultimately turn out to be healthy. Similarly, if straight marriages are much more common than gay marriages then it can still be that most so-called gay marriages are actually coding errors of straight marriages, even if the odds of a miscoded household roster for a given straight marriage are very low.

June 22, 2009 at 8:03 am

Where was this published? Who cares? Viva Jeremy!

| Gabriel |

In honor of Jeremy’s election to the publications committee, I’m posting a BibTex style file that incorporates his campaign promise to abolish the anachronistic “place of publication” field from ASA citation style. The file is hand-modified from the Dierkes and Louch style file of the soon to be defunct ASA citation style.

Because it’s a particularly long bit of code it’s below the fold. (more…)

June 19, 2009 at 10:50 pm 1 comment

Journey to the True Tales of the IMDB!

| Gabriel |

Following up on yesterday’s post, check out this paper by Herr and his colleagues that graphs IMDB and provides some basic descriptions of the network. You can also see a zoomable version of their truly gorgeous visualization. Finally the answer to that age old question, what do you get for the quantitative cultural sociologist who has everything?

The authors are affiliated with the Cyberinfrastructure for Network Science Center at Indiana. Although Indiana sociology has a well-deserved reputation for hardcore quant research, CNS is at the school of Information. Following the logic I learned from reading Marvel comics as a kid I can only speculate that something about the pesticide run-off in the drinking water gives scholars at Indiana superhuman abilities to code.

Also of note is that CNS provides the cross-platform open source package Network Workbench. I was a little skeptical because it’s written in Java (which tends to be slow) but I got it to create a PageRank vector of a huge dataset in six minutes, which isn’t bad at all. I may have more to say about this program in the future as I plan to tinker with it.

June 19, 2009 at 5:11 am

Son of True Tales of the IMDB!

| Gabriel |

Continuing with the discussion of IMDB networks …

Although it’s pretty much futile to get Stata to calculate any network parameters beyond degree centrality, it’s actually good at cleaning collaboration network data and converting it to an edge list. You can then export this edge list to a package better suited for network analysis like Pajek, Mathematica, or any of several packages written in R or SAS.

The IMDB is a bipartite network where the worker is one mode and the film is the other. Presumably you’ll be reducing this to a one-mode network, traditionally a network of actors (connected by films) but you can do a network of films (connected by actors). So you’ll need to start with the personnel files (writers.list, actors.list, actresses.list, etc). Whether you want one profession (just actors) or all the professions is a judgement call but actors are traditional.

Having decided which files you want to use, you have to clean them. (See previous thoughts here). Most of the files are organized as some variation on this:

Birch, Thora	Alaska (1996)  [Jessie Barnes]  <1>
	Ghost World (2000)  [Enid]  <1>

So first you clean the file in perl or a text editor, then “insheet” with Stata. There are two issues:

  1. In all of the files the worker name appears only on the first record, subsequent credits are whitespace. To fill it in use this command:
  2. replace var1=var1[_n-1] if var1==""
  3. The name is tab-delimited from the credit, but the “credit” includes several types of information. You’ll need to do a regular expression search either in the text editor to turn the tags to tabs or in Stata use regexm/regexs to pull the information out from within the tags. For instance in the the actor/actress files “[]” shows the name of the character and “<>” the credit rank. Parentheses shows the release date, but that’s effectively part of the film title as it helps distinguish between remakes.

Now you need to append the personnel files to each other in a file we can call credits.dta. Whether you include just actors and actresses or all the professions is a judgement call. The next couple steps are not necessary in theory but in practice they are very helpful for keeping the file sizes reasonably small. So it helps a lot to encode the data, though because of the large number of values you have to do it manually.

*the following block of code is basically a roundabout "encode" command but it doesn't have the same limitations
use credits.dta, clear
contract name
drop _freq
sort name
*create "i" as a name serial number based on row number/ alphabetical order
gen i=[_n]
lab var i "name id"
save i_key, replace
outsheet using i_key.txt, replace
sort i
*i_keyb.dta is same as i_key but sorted by "i" instead of name.
*substantively they are identical, but having two versions is useful for merging
save i_keyb.dta, replace

*create list of films and assign serial number "j" to each, just as with "i" for name
use credits.dta, clear
keep film
contract film
drop _freq
sort film
gen j=[_n]
lab var j "film id"
save j_key, replace
outsheet using j_key.txt, replace
sort j
save j_keyb, replace

The next memory-saving step is to break it up into annual files. This will work if you plan to have films connect actors but not the other way around.

*create annual credit (ijt) files
forvalues t=1900/2009 {
 use credits.dta, clear
 keep if year==`t'
 sort name
 merge name using i_key.dta
 tab _merge
 keep if _merge==3
 drop name _merge
 sort film
 merge film using j_key.dta
 tab _merge
 keep if _merge==3
 keep i j
 sort i j
 save ij_`t'.dta, replace

Now that you have a set of encoded annual credit files, it’s time to turn these two-mode files into one-mode edge lists.

*create dyads/collaborations (ii) by year
forvalues t=1900/2009 {
 use ij_`t'.dta, clear
 ren i ib
 sort j
 *square the matrix of each film's credits
 joinby j using ij_`t'.dta
 *eliminate auto-ties
 drop if i==ib
 *drop film titles.
 drop j
 contract i ib
 drop _freq /*optional, keep it and treat as tie strength*/
 save ii_`t'.dta, replace

At this point you can use “append” (followed by contract or collapse) to combine waves. Export to ASCII and knock yourself out in a program better suited for network analysis than Stata. (At least until somebody inevitably jerry-rigs SNA out of Mata). Remember that the worker and film names are stored in i_key.txt and j_key.txt.

June 18, 2009 at 5:37 am 1 comment

Bride of True Tales of the IMDB!

| Gabriel |

One of the things social scientists (and the physicists who love them) like to do with IMDB is use it to build up collaboration networks, which is basically playing the Kevin Bacon game but dressed up with terms like “mean path length” and “reachability.” This dates back to the late 1990s, before the “social media” fad made for an abundance of easily downloadable (or scrapable) large-scale social networks. Believe it or not, as recently as the early 1990s network people were still doing secondary analyses of the same handful of tiny datasets they’d been using for decades. If you spent a career trying to model rivalry among a couple dozen monks or marriage alliances amongst Florentine merchant families, you would have been excited about graphing the IMDB too.

Anyway, there are a few problems with using IMDB, several of which I’ve already discussed. The main thing is that it’s really, really, really big and when you try to make it into a network it just gets ludicrous. In part this is because of a few outlier works with really large casts.

Consider the 800 pound gorilla of IMDB, General Hospital, which has been on tv since 1963 (and was on the radio long before that).* That’s 46 years of not just the gradually churning ensemble cast, but guest stars and even bit part players with one line. I forget the exact number, but something like 1000 people have appeared in General Hospital. Since the logic of affiliation networks treats all members of the affiliation as a clique, this is one big black mess of 1000 nodes and 499,000 edges. A ginormous clique like this can make an appreciable impact on things like the overall clustering coefficient (which in turn is part of the small world index). Likewise it can do weird things to node-level traits like centrality.

Furthermore, unless you have some really esoteric theoretical concerns, it doesn’t even make sense to think of this being a collaboration that includes both the original actors and the current stars (most of whom weren’t even born in 1963). Many of the “edges” in the clique involve people who, far from having any kind of meaningful contact, didn’t even set foot on set within four decades of each other. For a different approach, consider an article in the current issue of Connections which graphs the Dutch national soccer team (pre-print here). The article does not treat the entire history of the team as one big clique (which would make for a short article) but rather an edge is defined as appearing in the same match. Not surprisingly the resulting structure is basically a chain as the team slowly rotates out old players and in new players. Overall it reminds me of one of the towers you’d build in World of Goo. The closest it gets to breaking a structure off from the giant component is the substantial turnover over the hiatus of WW2, but aside from that it’s pretty regular.

So anyway, unless you think General Hospital is the one true ring of Hollywood I think you only have two options:

  1. Follow the approach in the Dutch soccer paper and break a long running institution into smaller contemporaneous collaborations — games for Orange and episodes for General Hospital. Unfortunately IMDB doesn’t always have episode specific data for tv shows.
  2. Drop any non-theatrical content from the dataset. One of the perennial issues in any social research, and especially networks, is bounding the population. I think you can make an excellent substantive case that the production systems for television (and pornography) are sufficiently loosely coupled from theatrical film that they don’t belong in the same network dataset.

[*updated 5/18/15, General Hospital was never on the radio. I confused it with Guiding Light which, like Amos & Andy, did make the jump from network radio to network television]

June 17, 2009 at 5:44 am 1 comment

Older Posts

The Culture Geeks