Archive for June, 2009

The Workflow of Data Analysis Using Stata

| Gabriel |

I recently read Scott Long’s new book The Workflow of Data Analysis Using Stata and I highly recommend it. One of the ironies of graduate education in the social sciences is that we spend quite a bit of time trying to explain things like standard error but largely ignore that on a modal day quantitative research is all about data management and programming. Although Long is too charitable to mention it, one of the reasons to emphasize these issues is that many of the notorious horror stories of quantitative research do not involve modeling but data management. For instance, “88” was an unnoticed missing value code not actual data on senescent priapism, it was a weighting error that led to wildly exaggerated estimates of post-divorce income effects, and, most recently, findings about anomie were at least in part an artifact of a NORC missing data coding error.

By focusing on these largely neglected but critical data management issues, Long has done a service to the discipline. The publication of it may even reduce Indiana’s comparative advantage of producing hotshot quant PhDs now that grad students elsewhere can vicariously benefit from this important aspect of the training there. Certain aspects of it aren’t relevant to everyone (e.g., his section on value labels is most applicable to surveys with lots of Likert scales) but almost any serious quant is likely to find an enormous amount of clearly presented useful information.

For many of the issues the book addresses he shows a highly efficient and reliable way to do things. This is a service because many self-taught people will satisfice with a clunky and inefficient technique, even though with a little more upfront effort (an upfront effort greatly reduced by this book) they could avoid both effort and error in the long run. “Chapter 4: Automating Your Work” is particularly good in this respect. Since I lacked the benefit of a a copy of this book time-warped to 1997, I used Stata for years until I learned the “program” and “foreach” syntax. Even until now, I’d never understood how to use matrices (which is why this script is so hideously clunky, really, please don’t click the link) but Long has a very clear explanation of how to use all of these programming constructs. In the future I think my scripts will be much more elegant for having read his book, and especially chapter 4.

A less obvious contribution is that in several places he suggests standards. For instance, he suggests several missing data codes to distinguish between different types of missing data (coding error, skip code, respondent refused, etc). The particular codes he provides are necessarily arbitrary but no less useful for it because standards benefit from network externalities and it would make data analysis much easier if Stata users harmonized on these standards. Therefore the important thing is to have a remotely sensible standard, regardless of what it is.

Despite my enthusiasm I had a few differences of opinion and style. The main one is that the book reads something like a series of clear but nonetheless relatively discrete pieces of advice with only implicitly unifying themes. Over the course of a 200+ page book even consistently good advice starts to feel like one thing after another.

I think it might have made more sense and been more engaging to lay out a short list of principles for good code in the introduction. Then throughout the text each particular technique or standard could be shown as a manifestation of one or more of these general rules. Here is my own attempt to codify the general principles that at present are only implicit. Over the next few weeks I’ll elaborate on how these principles manifest in the book.

  1. The project should be replicable. (As Hillel said, “this is the whole law, the rest is commentary.”)
  2. Document your work by doing everything through adequately commented, organized, and archived scripts.
  3. Treat the raw data files as read-only.
  4. Good code will let you make changes in one place and see those changes propagate. (Note: Long embraces this principle within a single version of a single script, but otherwise sees this as a bug not a feature. As I’ll discuss in a few days, I disagree with him on the trade-offs involved in this issue).
  5. Good code is modular.

June 29, 2009 at 5:48 am 5 comments


| Gabriel |

The House is scheduled to vote today on the Waxman-Markey CO2 cap and trade bill. The interesting thing about this bill is that its watered down to ensure passage meaning that even if it works exactly as intended, it will have only trivial direct impacts on climate change. Some supporters of the bill acknowledge that it will have minimal direct effects, but it will have much greater indirect effects because it will show global leadership such that the BRIC countries will sign on for CO2 limits of their own. That is, we can view the bill as a big bet on the macro institutionalist world polity theory associated with John Meyer’s group at Stanford.

I think this is a very interesting bet because cap and trade is a hard case for macro institutionalism. The typical case examined by macro institutionalism involves countries signing onto vague but pleasant sounding treaties which they don’t necessarily plan to enforce (i.e. “decoupling”). The classic example is that North Korea signs any human rights treaty you put in front of it and several countries that practice genital mutilation have signed and ratified the UN women’s rights treaty.

On the other hand there are limits to the decoupling, especially in countries with rule of law. (This is why the US has not ratified the women’s rights treaty, because we know that unlike Saudi Arabia our courts would actually enforce it and unlike Canada or European countries we don’t want them to). The classic example of this is that Japan signed the indigenous rights treaty, thinking of it as cheap grace on the grounds that they had no indigenous minorities who might demand rights, only to discover later that the Ainu were able to effectively use it to form an identity and make enforceable claims on the state.

So in order for Waxman-Markey to have an appreciable effect on climate change, it would not only have to motivate other countries to pass similar legislation but for them to do so effectively rather than just symbolically. The thing is that, unlike protecting the rights of imaginary indigenous minorities, serious CO2 reduction is really expensive. This is why you see a certain amount of decoupling both in Waxman-Markey and in earlier European Union cap and trade regulation. In both cases incumbent emitters have been essentially grandfathered in with grants of permits so as to buy them off from mobilizing to oppose the legislation. Given that China’s social contract amounts to “the CCP provides jobs and increasing prosperity, the people don’t challenge its legitimacy” and the CCP is willing to do things like currency manipulation to keep this going, I find it very difficult to believe that they will stop building coal plants and replace them with wind mills because America and the EU showed that this was the legitimate practice for a nation-state.

My prediction is that the institutionalist model for climate change legislation will succeed in the sense that many countries will pass it, in part inspired by America’s example (which in turn was inspired by the EU’s example). However institutionalism also predicts that it will largely fail in the sense that such legislation will (like the American and EU laws) be substantially decoupled from actual practice, containing substantial carve-outs and exemptions. In other words, if you need real estate in Bangladesh, you’re better off leasing rather than buying.

June 26, 2009 at 5:12 am 2 comments

Would you buy a car from these guys?

| Gabriel |

This week’s episode of EconTalk (host Russ Roberts, guest Mike Munger) discusses the relationship between GM and its local dealers. Car dealers are franchisees and in most states the manufacturer can only cancel the franchise agreement by buying out the dealer. Thus in the short to medium run it can be less costly to keep a brand going at a loss than to close it (and buy out the dealers). More broadly, the argument goes that under corporatism GM was making so much money that it didn’t mind rent-seeking from its stakeholders. However once GM started to face serious competition from the Japanese in the late 70s it was so constrained by these arrangements that it was unable to adapt. Probably the most curious case is the launch of the Saturn nameplate in the early 1990s, where GM seemed to have some good ideas about how to imitate Toyota, but rather than tackling the Herculean task of directly taking on its stakeholders by applying those ideas to, say, Chevy, GM tried to create a new nameplate that would escape the historical relations with its stakeholders. (It didn’t work).

Let me count the ways that I enjoyed this episode:

  • It’s a nice respite from the many episodes they’ve had on macroeconomics over the last year.
  • The guys are adopting a path dependency argument in that if you’re going to be a company operating in neoliberalism, there’s a huge difference between having been born under neoliberalism vs making a transition from corporatism. I thought this was interesting coming from EconTalk, which is usually skeptical of friction arguments.
  • Mike Munger and Russ Roberts have great chemistry. This episode sounds like an ordinary conversation where they honestly don’t understand why the situation is turning out like this and are trying to tease it out. In some cases they note an argument seems plausible until you consider a piece of counter-evidence and table the issue as undecided. (This reflects Roberts’ trend over the last few months towards almost agnostic intellectual humility). In many older episodes with Munger, they have the didactic feel of Plato’s dialogues, or rather, what Plato’s dialogues would have sounded like if Socrates and Alcibiades continually riffed to see who can outdo the other in elaborate sarcastic imagery. (I’ve listened to their gouging conversation three times and I laugh every time).

June 25, 2009 at 5:14 am 1 comment

Frialator research methods

| Gabriel |

AdAge notes that market research consistently finds that consumers say they want more healthy option, yet mysteriously people buy fried chicken, not grilled chicken. In other news, voters want more government services but lower taxes and Augustine prayed for chastity, but not yet.

The only real question to me is whether opinion/marketing survey and focus group respondents have clear self-awareness of their preferences but obsfucate them for reasons of social desirability bias? The alternative is that people do not really understand their own latent preferences (which are only made manifest in interaction) and when asked to articulate their preferences they fall back onto cultural scripts. My own vote goes for the “we don’t understand our own preferences” model as I think of myself as being an idiot far more often than a liar and I generalize from that n of 1.

June 24, 2009 at 4:58 am 1 comment

Underneath it all

| Gabriel |

A few years ago I had a friendly argument with Jenn Lena and Pete Peterson about their ASR article on genre trajectories. While I generally love that article, my one minor quibble is their position that there is such a thing as non-genre music, and in particular that “pop” can be considered unmarked, in genre terms. They write “Not all commercial music can be properly considered a genre in our sense of the term.” They exclude Tin Pan Alley (showtunes) and go on to write that, “Much the same argument holds for pop and teen music. At its core, pop music is music found in Billboard magazine’s Hot 100 Singles chart. Songs intended for the pop music market usually have their distinguishing genre characteristics purposely obscured or muted in the interest of gaining wider appeal.”

Myself, I disagree with treating pop as beyond genre. First, the Hot 100 is an aggregate without any real meaning as a categorical marker. I find it interesting that in radio it’s increasingly prevalent to call “Top 40” as “Contemporary Hits Radio” in recognition of the fact that in the literal sense top 40 hasn’t existed for decades and many bands who are very popular would nonetheless not get played in CHR and many bands (think Britney Spears) only get played in CHR, implying that CHR is itself a genre of what we might call “high pop.” Billboard itself distinguishes between the Hot 100 (whatever is really popular, regardless of genre) and Top 40 Mainstream (CHR).

Second, and more importantly, it is impossible to have non-genre music in the same way that it is impossible to have language-less speech if you take the Howard Becker perspective that genre is about having sufficient shared understandings and expectations so as to allow coordination between actors. Consider the fact that most genres work on the Buddy Holly model of long-lasting bands who write their own songs whereas high pop almost exclusively involves project-based collaborations of songwriters, session musicians, producers, and (most salient to the audience) singers. Since standards are especially important when the collaborations are ephemeral, then coordination through strong shared expectations is more important in high pop than genre music. Likewise, high pop sounds more monotonous than many genre-based music. Furthermore, high pop is not merely the baseline, but involves specialized skills and techniques (e.g., vocal filters) not found in “genres.”

For the most part this issue is orthogonal to the argument they present in the article (which is why I like the article despite this dispute) but I think it potentially creates problems for the IST (Industry-> Scene-> Traditional) trajectory, most of which involves a spin-off of high pop music (as is seen most clearly with the Nashville Sound, which was basically Tin Pan Alley with cowboy hats). In response to this Pete said that there is a distinction between pop and genre in that with pop change is gradual and more Lamarckian than the creative destruction and churn seen with genres. I think this is definition is fair enough, certainly it’s highly relevant to their purposes. So the question of whether it is possible to have non-genre music ultimately comes down to whether you choose to emphasize churn or shared expectations as the defining feature of genre.

Anyway, I was reminded of this discussion a few days ago when my wife and I went to see No Doubt. This band has had 8 singles on the Billboard 100 chart and had multiple singles in four different Billboard format charts (rhythmic, CHR, adult, modern rock) so I think they are a fair candidate for what Jenn and Pete have in mind as “pop.” However the performance I attended made it apparent that at their core they are ultimately still a ska band. Most obviously, during one of Gwen’s costume changes the band did a cover of The Special’s arrangement of “Guns of Navarone” and when she came back she was wearing what can only be described as a two-tone sequined romper and later on she wore a metallic Fred Perry shirt and braces (worn hanging). More generally all of their dancing was based on ska steps, their rhythm section dominates their lead guitar, and they had a horns section and keyboard (tuned as an organ).

In a sense, I think you can take No Doubt as a vindication of what Jenn and Pete are arguing. Here you have a band that started out within genre music but graduated into commercial success by recording unmarked pop. Note that their return to ska/dancehall with “Rock Steady” didn’t sell nearly as many copies as the mostly pop albums “Tragic Kingdom” and “Return of Saturn”. However there’s also the interesting fact that when Gwen decided to dive headfirst into high pop, she did so as a “solo” act, which in effect meant that she went from collaborating with Tony Kanal to doing so with Dr Dre and the Neptunes. I take Gwen’s solo career as a vindication for my perspective, the idea being that going into high pop involves not just the negative act of losing the markings and skills of genre and becoming generic music (which presumably Kanal could have done), but the positive act of acquiring the markings and skills of high pop (which required soliciting the efforts of high pop specialists like the Neptunes).

Special bonus armchair speculation!

Compare and contrast No Doubt and Dance Hall Crashers. Both are up-tempo California ska bands that started in the late 80s and have girl singers (two of them in the case of DHC). Although this is necessarily disputable, I would submit that c. 1995 (when No Doubt broke), DHC was the more talented band. Likewise, DHC has the better pedigree, being (along with Rancid) the successors to Operation Ivy. So why is it that Gwen Stefani rather than Elyse Rogers or Karina Denike is the one who ultimately became a world class pop star and an entrepreneur of overpriced designer fauxriental baby clothes?

I have three speculations, listed below in rough order of how much credence I give each of them:

  1. Looking for an explanation is futile because cultural markets are radically stochastic. If you have two talented bands it is literally impossible to predict ex ante which will become popular and in some alternate universe DHC are gazillionaires whereas No Doubt is known only to aficionados of California 90s music.
  2. Jenn and Pete are right and the issue is that No Doubt was better at transcending genre. Noteworthy in this respect is that basically all of DHC’s music is skacore whereas from their very first recordings No Doubt has always included elements of disco and pop, including AC-friendly Tin-Pan-Alley-esque ballads like “Don’t Speak” that it’s pretty hard to imagine DHC playing.
  3. There’s a cluster economy explanation in that No Doubt is from Orange County (which c. 1994 was supposed to be the next Seattle) whereas DHC is from the East Bay.

June 23, 2009 at 5:42 am 1 comment

p(gay married couple | married couple reporting same sex)

| Gabriel |

Over at Volokh, Dale Carpenter reproduces an email from Gary Gates (who unfortunately I don’t know personally, even though we’re both faculty affiliates of CCPR). In the email, Gates disputes a Census report on gay couples that Carpenter had previously discussed, arguing that many of the “gay” couples were actually straight couples who had coding errors for gender. This struck me as pretty funny, in no small part because in grad school my advisor used to warn me that no variable is reliable, even self-reported gender. (Paul, you were right). More broadly, this points to the problems of studying small groups. (Gays and lesbians are about 3% of the population, the famous 10% figure is a myth based on Kinsey’s use of convenience/purposive sampling).

Of course the usual problem with studying minorities is how to recruit a decent sample size in such a way that still approximates a random sample drawn from the (minority) population. If you take a random sample of the population and then do a screening question (“do you consider yourself gay”) you’re facing a lot of expense and also problems of refusal if the screener involves stigma because refusal and social desirability bias will be higher on a screener than if the same question is asked later on in the interview. On the other hand if you just direct your sample recruitment to areas where your minority is concentrated you’ll save a lot of time but you will also be getting only members of the minority who experience segregation, which is unfortunate as gays who live in West Hollywood are very different from those who live in Northridge, American Indians who live on reservations are very different from those who live in Phoenix, etc. Both premature screeners involving stigma and recruitment by concentrated area are likely to lead to recruiting unrepresentative members of the group on such dimensions as salience of the group identity.

These problems are familiar nightmares to anyone who knows survey methods. However the issue described by Gates in response to Carpenter (and the underlying Census study) presents a wholly new issue that when you are dealing with a small class you can have problems even if sampling is not a problem and even if measurement error in defining the class is minimal. Really this is the familiar Bayesian problem that when you are dealing with a low baseline probability events, even reasonably accurate measures can lead to false positives outnumbering true positives. The usual example given in statistics/probability textbooks is that if few people actually have a disease and you have a very accurate test for that disease, nonetheless the large majority of people who initially test positive for this disease will ultimately turn out to be healthy. Similarly, if straight marriages are much more common than gay marriages then it can still be that most so-called gay marriages are actually coding errors of straight marriages, even if the odds of a miscoded household roster for a given straight marriage are very low.

June 22, 2009 at 8:03 am

Where was this published? Who cares? Viva Jeremy!

| Gabriel |

In honor of Jeremy’s election to the publications committee, I’m posting a BibTex style file that incorporates his campaign promise to abolish the anachronistic “place of publication” field from ASA citation style. The file is hand-modified from the Dierkes and Louch style file of the soon to be defunct ASA citation style.

Because it’s a particularly long bit of code it’s below the fold. (more…)

June 19, 2009 at 10:50 pm 1 comment

Older Posts

The Culture Geeks