Archive for November, 2009

Perl text library

| Gabriel |

I found this very useful library of perl scripts for text cleaning. You can use them even if you can’t code perl yourself, for instance to transpose a dataset just download “transpose.pl” script to your ~/scripts directory and enter the shell command:
perl ~/scripts/transpose.pl row_col.txt > col_row.txt

The transpose script is particularly useful to me as I’ve never gotten Excel’s transpose function to work and for some bizarre reason Stata’s “xpose” command only works with numeric variables. You can even use these scripts from directly in a do-file like so:

tempfile foo1
tempfile foo2
outsheet using `foo1'.txt
shell perl ~/scripts/transpose.pl `foo1'.txt > `foo2'.txt
insheet using `foo2'.txt, clear

Advertisement

November 30, 2009 at 4:49 am 1 comment

some R baby steps

| Gabriel |

I’ve tried to learn R a few times but the syntax has always been opaque to my Stata-centric mind. I think the trick is to realize that two of the key differences are that:

  • whereas handles only come up in a few Stata commands (e.g., file, postfile, log), they are very important in R, what with all the “<-” statements
  • in R there’s a much fuzzier line between commands and functions than in Stata. What I mean by this is both the superficial thing of all the parentheses and also the more substantive issue that often you don’t put them one to a line and they just do something (like Stata commands) but you usually put them many to a line and feed them into something else (like Stata functions). Related to this is that the typical Stata line has the syntax “verb object, adverb” whereas the typical R line has the syntax “object <- verb(object2, adverb)”

The two combine in an obvious way with something as simple as opening a dataset, which is just use file in Stata but is filehandle <- read.table(“file”) in R, that is, there’s not a read.table() command but a read.table() function and you feed this function to a handle. (And people say R isn’t intuitive!)

At least I think that’s a good way to think about the basic syntax — I suck at R and I really could be totally wrong about this. (Pierre or Kieran please correct me).

Anyway, I wrote my first useful R file the other day. It reads my Pajek formatted network data on top 40 radio stations and does a graph.

# File-Name: testgraph.R
# Date: 2009-11-20
# Author: Gabriel Rossman
# Purpose: graph CHR station network
# Data Used: ties_bounded.net
# Packages Used: igraph
library(igraph)
setwd("~/Documents/Sjt/radio/survey")
chrnet <- read.graph("ties.net", c("pajek"))
pdf("~/Documents/book/images/chrnetworkbounded.pdf")
plot.igraph(chrnet, layout=layout.fruchterman.reingold, vertex.size=2, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
dev.off()

The weird thing is that it works fine in R.app but breaks when I try to R run from the Terminal, regardless of whether I try to do it all in one line or first invoke R and then feed it the script. [Update: the issue is a 32 bit library and 64 bit R, the simple solution is to invoke “R32” rather than just plain “R”. see the comments for details]. Here’s a session with both problems:

gabriel-rossmans-macbook-2:~ rossman$ Rscript ~/Documents/book/stata/testgraph.R
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared library '/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so':
dlopen(/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so, 10): Symbol not found: ___gmpz_clear
Referenced from: /Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so
Expected in: dynamic lookup

Error : .onLoad failed in 'loadNamespace' for 'igraph'
Error: package/namespace load failed for 'igraph'
Execution halted
gabriel-rossmans-macbook-2:~ rossman$ R 'source("~/Documents/book/stata/testgraph.R")'
ARGUMENT 'source("~/Documents/book/stata/testgraph.R")' __ignored__

R version 2.10.0 (2009-10-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> source("~/Documents/book/stata/testgraph.R")
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared library '/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so':
dlopen(/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so, 10): Symbol not found: ___gmpz_clear
Referenced from: /Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so
Expected in: dynamic lookup

 

Error : .onLoad failed in 'loadNamespace' for 'igraph'
Error: package/namespace load failed for 'igraph'
>

The problem seems to be that R (terminal) can’t find the igraph library. This is weird because R.app has no trouble finding it. Furthermore, I get the same error even if I make sure igraph is installed directly from R (Terminal) in the same R session:

chooseCRANmirror(graphics = FALSE)
install.packages("igraph")
source("/Users/rossman/Documents/book/stata/testgraph.R")

I guess that’s another difference with Stata, StataConsole knows where the ado library is. I’d like to be able to use the Terminal mode for R as this would let me to reach my Nirvana-like goal of having a single script that does everything without any proximate human intervention. So I’ll just ask? How do I get R (Terminal) to run as reliably as R.app? Is this a naive question?

Or would it be better to try to feed R.app a “source” script from the command line? Much how like I can do this for Stata to launch a do-file into the Stata GUI
exec /Applications/Stata/StataMP.app/Contents/MacOS/stataMP ~/Documents/book/stata/import.do

November 24, 2009 at 4:42 am 11 comments

I am shocked–shocked–to find scientists abusing peer review

| Gabriel |

A major climate lab in Britain was hacked (leaked?) last week and a lot of the material was really embarrassing. Stuff along the lines of obstruction of freedom of information requests, smoothing messy data, and using peer review and shunning to freeze out contradictory perspectives. From the WaPo write-up:

“I can’t see either of these papers being in the next IPCC report,” Jones writes. “Kevin and I will keep them out somehow — even if we have to redefine what the peer-review literature is!”

In another, Jones and Mann discuss how they can pressure an academic journal not to accept the work of climate skeptics with whom they disagree. “Perhaps we should encourage our colleagues in the climate research community to no longer submit to, or cite papers in, this journal,” Mann writes.

“I will be emailing the journal to tell them I’m having nothing more to do with it until they rid themselves of this troublesome editor,” Jones replies.

All I can say is:

Most people have been looking at this in terms of the science or politics of climate change, but I’m completely with Robin Hanson in thinking that those are non sequiturs and what’s really interesting about this is the (office) politics of science. I mean, is anyone who has ever been through peer review at all surprised to hear that peer reviewers can be malicious assholes willing to use power plays to effect closure against minority perspectives?

On the other hand, while I think this is an affront to decency, this doesn’t really give me severe problems as a matter of scientific epistemology. Sure, I’d rather that scientists took the JS Mills ideal of the market of ideas with a “let me hear you out and then if I’m still unconvinced I’ll give you my good faith rebuttal.” Nonetheless, I’m enough of a Quinean/Kuhnian to think that science isn’t about isolated findings but the big picture and the dominant perspective is probably still right, even if its adherents aren’t themselves exactly Popperians actively seeking out (and failing to find) evidence against their perspective.

November 23, 2009 at 4:39 pm

12 weeks of culture

| Gabriel |

Jenn posted her draft syllabus for grad soc of culture / cultural sociology. It looks like about the best survey of the literature you could get in 12 weeks and about a thousand pages of material. Aside from simply choosing good readings, she’s managed to organize them into weeks in a way that imposes a good sense of order on an often messy and amorphous set of issues. She also does a much better job than I do of covering all parts of the field. (My syllabus is unabashedly production-centric, as it’s part of a two quarter sequence with a sister course taught by a colleague on meaning-centric approaches). I highly recommend checking it out for any grad students prepping for a field exam or faculty prepping a course.

November 23, 2009 at 2:57 pm

A few stats pedagogy notes

| Gabriel |

I’ve found the OS X zoom feature to be very effective when teaching stats. Most of the time I have the projector at full resolution (so any given thing on it looks small), but when I want to show a piece of code or output I just mouse over to it and zoom (hold “control” and scroll-wheel or two-finger swipe up). This lets me keep my mac at the regular resolution and use Stata and TextMate in class instead of setting it to a lo-res and/or putting zoomed screenshots in Powerpoint or Keynote. This both has a more improvisatory feel and cuts down on the purely technical aspects of course prep.

Speaking of Powerpoint/Keynote, one of the problems with teaching code is you lose syntax highlighting. However you can keep it by copying from a text editor as rtf.

Finally, via Kai Arzheimer, I see the new site Teaching With Data, which includes both sample datasets and pedagogical materials.

November 17, 2009 at 4:47 am

Programming notes

| Gabriel |

I gave a lecture to my grad stats class today on Stata programming. Regular readers of the blog will have heard most of this before but I thought it would help to have it consolidated and let more recent readers catch up.

Here’s my lecture notes. [Updated link 10/14/2010]

November 12, 2009 at 3:33 pm 2 comments

Science (esp. econ) made fun

| Gabriel |

In a review essay, Vromen talks about the (whodathunkit) popular book/magazine-column/blog genre of economics-made-fun that’s become a huge hit with the mass audience in the last 5 to 10 years. Although Vromen doesn’t mention it, this can be seen as a special case of the science-can-be-fun genre (e.g., Stephen Jay Gould’s short essays that use things like Hershey bars and Mickey Mouse to explain reasonably complex principles of evolutionary biology.)

Vromen makes a careful distinction from the older genre of economists-can-be-funny (currently exemplified by the stand-up economist), which is really a special case of the general genre of scientists doing elaborate satires of their own disciplines for the benefit of their peers. There is an entire journal of this, but my all time favorite example is a satire of mid-20th century psychology in the form of a review of the literature on when people are willing to pass the salt at the dinner table.  Two excerpts from the “references” section should suffice to convince you to click the link and read the whole thing.

  • Festinger, R. “Let’s Give Some Subjects $20 and Some Subjects $1 and See What Happens.” Journal for Predictions Contrary to Common Sense 10, 1956, pp. 1-20.
  • Milgram, R. “An Electrician’s Wiring Guide to Social Science Experiments.” Popular Mechanics 23, 1969, pp. 74-87.

If you don’t remember what Festinger and Milgram actually did in the 50s and 60s this won’t be funny, but if you do it’s hilarious. Hence, the scientists-can-be-funny genre is a self-deprecating genre for an audience of insiders that simultaneously demonstrates the joker’s mastery of the field and the field’s foibles. In contrast, the science-can-be-fun genre is targeted to a mass audience and is about demonstrating the elegance and power of the field. The former inspires humility among practitioners, the latter awe among the yokels.

One of the interesting things about the econ-made-fun literary genre is that it is largely orthogonal to any theoretical distinction within scholarly economics. The most prominent “econ made fun” practitioners span such theoretical areas as applied micro (Levitt), behavioral (Ariely), and Austrian (Cowen). In part because the “econ made fun” genre exploded at about the same time as the Kahneman Nobel and in part because “econ made fun” tends to focus on unusual substantive issues (i.e., anything but financial markets), this has led a lot of people to conflate “econ made fun” and behavioral econ. I’ve heard Steve Levitt referred to as a “behavioral economist” several times. This drives me crazy as at a theoretical level, behavioral economics is the opposite of applied micro, and in fact Levitt has done important work suggesting that behavioral econ may not generalize very well from the lab to the real world. That people (including people who ought to know better) nonetheless refer to him as a “behavioral economist” suggests to me that in the popular imagination literary genre is vastly more salient than theoretical content.

I myself occasionally do the “sociologists can be funny” genre (see here , here, and here) but these are basically elaborate deadpan in-jokes and I am under no illusions that anyone without a PhD would find them at all funny. I have no idea how to go about writing “sociology can be fun” (this is probably the closest I’ve come) along the lines of Levitt/Dubner or Harford, nor to be honest do I see any other sociologist doing it particularly well. There are plenty of sociologists who try to speak to a mass audience, but the tone tends to be professorial exposition or political exhortation rather than amusement at the surprising intricacy of social life. Fortunately Malcolm Gladwell has an intense and fairly serious interest in sociology and is very talented at making our field look fun.

November 10, 2009 at 4:40 am 1 comment

Team Sorting

| Gabriel |

Tyler Cowen links to an NBER paper by Hoxby that shows that in recent decades, status sorting has gotten more intense for college. Cowen asks “is this a more general prediction in a superstars model?” The archetypal superstar system is Hollywood, and here’s my quick and dirty stab at answering Tyler’s question for that field. Faulkner and Anderson’s 1987 AJS showed that there is a lot of quality sorting in Hollywood, but they didn’t give a time trend. As shown in my forthcoming ASR with Esparza and Bonacich, there are big team spillovers so this is something we ought to care about.

I’m reusing the dataset from our paper, which is a subset of IMDB for Oscar eligible films (basically, theatrically-released non-porn) from 1936-2005. If I were doing it for publication I’d do it better (i.e., I’d allow the data to have more structure and I’d build confidence intervals from randomness), but for exploratory purposes the simplest way to measure sorting is to see if a given film had at least one prior Oscar nominee writer, director, and actor. From that I can calculate the odds-ratio of having an elite peer in the other occupation.

Overall, a movie that has at least one prior nominee writer is 7.3 times more likely than other films to have a prior nominee director and 4.4 times more likely to have a prior nominee cast. A cast with a prior nominee is 6.5 times more likely to have a prior nominee director. Of course we already knew there was a lot of sorting from Faulker and Anderson, the question suggested by Hoxby/Cowen is what are the effects over time?

This little table shows odds-ratios for cast-director, writer-director, and writer-cast. Big numbers mean more intense sorting.

...+--------------------------------------+
...| decade    cd       wd       wc       |
...|--------------------------------------|
1. | 1936-1945 6.545898 6.452388 4.306554 |
2. | 1946-1955 9.407476 6.425553 5.368151 |
3. | 1956-1965 12.09229 8.741302 6.720059 |
4. | 1966-1975 4.697238 5.399081 4.781106 |
5. | 1976-1985 4.113508 6.984528 4.450109 |
6. | 1986-1995 4.923809 7.599852 3.301461 |
7. | 1996-2005 4.826018 12.35915 3.641975 |
+-----------------------------------------+

The trend is a little complicated. For collaborations between Oscar-nominated casts on the one-hand and either writers or directors, the sorting is most intense in the 1946-1955 decade and especially the 1956-1965 decade. My guess is that this is tied to the decline of the studio system and/or the peak power of MCA. The odds-ratio of good director for nom vs non-nom writers also has a jump around the end of the studio system, but it seems there’s a second jump starting in the 80s. My guess is that this is an artifact of the increasing number of writer-directors (see Baker and Faulkner AJS 1991), but it’s an empirical question.

Putting aside the writer-director thing, it seems that sorting is not growing stronger in Hollywood. My guess is that ever more intense sorting is not a logical necessity of superstar markets, but has to do with contingencies, such as the rise of a national market for elite education in Hoxby’s case or the machinations of Lew Wasserman and Olivia deHavilland in my case.

The Stata code is below. (sorry that wordpress won’t preserve the whitespace). The data consists of film-level data with dummies for having at least one prior nominee for the three occupations.

global parentpath "/Users/rossman/Documents/oscars"

capture program drop makedecade
program define makedecade
gen decade=year
recode decade 1900/1935=. 1936/1945=1 1946/1955=2 1956/1965=3 1966/1975=4 1976/1985=5 1986/1995=6 1996/2005=7
capture lab drop decade
lab def decade 1 "1936-1945" 2 "1946-1955" 3 "1956-1965" 4 "1966-1975" 5 "1976-1985" 6 "1986-1995" 7 "1996-2005"
lab val decade decade
end

cd $parentpath

capture log close
log using $parentpath/sorting_analysis.log, replace

use sorting, clear
makedecade

*do odds-ratio of working w oscar nom, by own status

capture program drop allstar
program define allstar
preserve
if "`1'"!="" {
keep if decade==`1'
}
tabulate cast director, matcell(CD)
local pooled_cd=(CD[2,2]*CD[1,1])/(CD[1,2]*CD[2,1])
tabulate writers director, matcell(WD)
local pooled_wd=(WD[2,2]*WD[1,1])/(WD[1,2]*WD[2,1])
tabulate writers cast, matcell(WC)
local pooled_wc=(WC[2,2]*WC[1,1])/(WC[1,2]*WC[2,1])
shell echo "`pooled_cd' `pooled_wd' `pooled_wc' `1'" >> sortingresults.txt
restore
end

shell echo "cd wd wc decade" > sortingresults.txt
quietly allstar
forvalues t=1/7 {
quietly allstar `t'
}

insheet using sortingresults.txt, delimiter(" ") names clear
lab val decade decade

 

*have a nice day

November 8, 2009 at 7:27 pm 4 comments

A Note on the Uses of Official Statistics

| Gabriel |

They are ourselves, I replied; and they see only the shadows of the images which the fire throws on the wall of the den; to these they give names, and if we add an echo which returns from the wall, the voices of the passengers will seem to proceed from the shadows.  — Plato

One of the points I like to stress to my grad students is that data is not an objective (or even unbiased) representation of reality but the result of a social process. The WSJ had a story recently on how we get the “jobs created or saved” figures around the stimulus bill and it makes me want to burn my Stata dvd, take a two-hour shower, and then switch to qualitative methods where at least I know that I would be responsible for any validity problems in my work.

The idea of “jobs created or saved” by a government policy is a meaningful concept in principle but in practice it’s essentially impossible to reckon with any certainty. It’s the kind of problem you might be able to approach empirically if it happened many times and there was some relatively exogenous instrument, but in a single instance you’re probably better off using an answer derived from theory than actually trying to measure it. Nonetheless the political process demands that it be answered empirically and the results are absurd.

The way the government has tried to measure “jobs created or saved” by the stimulus is by simply asking contractors or subcontractors how many jobs were created or saved in their firm by the contract. This involves both false positives of contractors exaggerating the number of jobs they created or saved and false negatives of firms that were not direct beneficiaries of contracts but increased or retained production in expectations of benefitting from the multiplier. In the case covered by the WSJ, a shoe store that sold nine pairs of boots for $100 each to the Army Corps of Engineers didn’t know what else to put and so said they saved nine jobs. When asked about this by the WSJ the shoe store owner’s daughter/bookkeeper replied

“The question, I would like to know is: How do you answer that? Did we create zero? Is it creating a job because they have boots and go out and work for the Corps? I would be really curious to hear how somebody does create a job. The formula is out there for anyone to create, and it’s just so difficult,” she said.

Who’d a thunk it, but apparently FA Hayek was reincarnated as a shoe store worker in Kentucky.

(h/t McArdle)

November 4, 2009 at 1:46 pm 4 comments

Astro-baptists

| Gabriel |

On NPR the other day I heard a story about how a lobbyist forged letters to Congress from the NAACP and AAUW opposing the Waxman-Markey cap-trade bill. I thought this was amusing on several levels, only the first of which is that apparently the bill wasn’t convoluted and toothless enough to buy off all of the incumbent stakeholders as some of them hired this guy. The real interest though is that the blatant absurdity of this story heightens the basic dynamics of the bootlegger and Baptist coalition dynamic in that in this case the bootlegger was so desperate for a Baptist that he imagined one, much as the too-good-to-be-true quotes conjured by fabulist reporters heighten the absurd genre conventions of journalism.

The bootlegger and Baptist model is a part of public choice theory that argues that policy making often involves a coalition between stakeholders motivated by rentseeking and ideologues with principled positions. In the titular example, the policy is blue laws which would be supported both by Baptists who don’t like booze violating the sabbath and clandestine alcohol entrepreneurs delighted to see demand pushed from legitimate retailers to the black market. We had something close to a literal bootlegger-baptist model with the Abramoff scandal, in which various gambling interests paid the Christian Coalition to kneecap the competition. Another recent prominent example is that, before being airbrushed out of history for having, ahem, unorthodox political affiliations, Van Jones was best known for “green jobs,” which can be uncharitably described as a bit of political entrepreneurship proposing a grand bargain in which his constituents would get patronage jobs in exchange for supporting green policies.

Although bootlegger-Baptist is an econ model, soc and OB folks independently arrived at this same model by noting that resource dependence on the state is not a pure Tullock lottery, but is contingent on facial legitimacy. If you read chapter 8 of External Control of Organizations you’ll see that it’s not only the bridge between resource dependence and neo-institutionalism, but also a bootlegger-Baptist model avant le lettre.

One of the interesting things is that lately civil rights groups seem to have been the (real or imagined) Baptists of choice, and not just in the anti-Waxman-Markey forgery. So for instance a few weeks ago 72 Democratic Congressmen sent a letter to the FCC opposing net neutrality. It’s not surprising that the blue dogs were among them as you’d expect fiscal conservatives to oppose a new regulation. The interesting thing is that the letter was also signed by most of the Congressional Black Caucus, as well as “the Hispanic Technology and Telecommunications Partnership, the National Association for the Advancement of Colored People (NAACP), the Asian American Justice Center.” Their (plausible) logic was essentially that preventing telecoms from charging content providers would delay the rollout of broadband and therefore maintain the digital divide. So here we have an issue combining rent-seeking telecoms hoping to soak content providers and prevent competition from VOIP forming a coalition with civil rights groups and their legislative allies who have a principled commitment to eliminating inequality in use of technology.

I got total deja vu when I read this as the exact same thing happened a few years ago when Nielsen was attacked by the Don’t Count Us Out Coalition. The backstory is that Nielsen and Arbitron traditionally rely on diaries to collect the audience data that is used to set advertising rates. Unfortunately respondents are too lazy/stupid to complete diaries accurately. In recognition of this problem both Arbitron and Nielsen have been trying to switch to more accurate passive monitoring techniques that aren’t dependent on the diligence and recall of the respondent, but they still use diaries for sweeps.

Nielsen had the bright idea of the Local People Meter project, which would eliminate sweeps diaries in the largest media markets and rely entirely on a large continuous rolling sample using passive monitoring. This implies a substantial improvement in data quality for a large part of the advertising market. This sounds like a good thing but Nielsen found itself attacked by the “Don’t Count Us Out Coalition” which argued that Nielsen was a racist monopoly, mostly on the basis that in one or two of the test markets for LPM they undersampled blacks. The “Coalition” got some serious support in Congress until Nielsen was able to demonstrate that it was just an astroturf* group set up by NewsCorp, which stood to see a ratings drop under the improved technology. (Or more technically, the new technology would reveal that the old technology had been exaggerating the ratings of NewsCorp properties. Peterson and Anand have a great article on a similar dynamic in recorded music sales).

—–

*Given the rather promiscuous way that people throw around the term “astroturf” it’s necessary to clarify the term. I reserve the term “astroturf” exclusively for fax machine and letterhead operations organized by a lobbyist, pr firm, or the like. It is not analytically useful to extend the term to cover things like the tea parties where elites mobilize ordinary people to come and protest. If you want to distinguish such things from the Platonic ideal of grassroots mobilization fine, call them “fertilizing the grassroots” or something, but astroturf they ain’t. Likewise, it is lazy and slanderous conspiracy-mongering to assume without further evidence that anyone who takes the same position on an issue as a stakeholder must of course be bought by the stakeholder. If you want to echo Orwell and call such people “objectively pro-X” then fine, but that don’t mean the Baptist lacks a principled reasons for siding with the bootlegger on a particular issue.

November 4, 2009 at 4:31 am 1 comment

Older Posts


The Culture Geeks