Archive for March, 2010

| Gabriel |

A few months ago I talked about reshaping field-tagged data and gave some clumsy advice for doing so. I’ve now written a perl script that does this more elegantly. It’s written to extract movie title (“MV”) and domestic box office (“GR”) from the IMDB file business.list, but you could adapt it to get other variables and/or work on other field-tagged data.
Basically, the script will turn this:

MV: Little Shop of Horrors (1986)

AD: 118,418 (Sweden) 

BT: USD 30,000,000 

GR: USD 34,656,704 (USA) (8 February 1987) 
GR: USD 33,126,503 (USA) (1 February 1987) 
GR: USD 30,810,276 (USA) (25 January 1987) 
GR: USD 27,781,027 (USA) (18 January 1987) 
GR: USD 23,727,232 (USA) (11 January 1987) 
GR: USD 19,546,049 (USA) (4 January 1987) 
GR: USD 11,412,248 (USA) (28 December 1986) 
GR: USD 3,659,884 (USA) (21 December 1986) 
GR: USD 38,747,385 (USA) 
GR: SEK 4,318,255 (Sweden) 

OW: USD 3,659,884 (USA) (21 December 1986) (866 screens) 

RT: USD 19,300,000 (USA) 

SD: 21 October 1985 - ? 

WG: USD 1,112,016 (USA) (8 February 1987) (871 screens) 
WG: USD 1,719,329 (USA) (1 February 1987) 
WG: USD 2,093,847 (USA) (25 January 1987) 
WG: USD 3,222,066 (USA) (18 January 1987) 
WG: USD 3,057,666 (USA) (11 January 1987) (858 screens) 
WG: USD 4,004,838 (USA) (4 January 1987) (866 screens) 
WG: USD 5,042,682 (USA) (28 December 1986) (866 screens) 
WG: USD 3,659,884 (USA) (21 December 1986) (866 screens) 


Into this:

Little Shop of Horrors (1986)	34,656,704 (USA) (8 February 1987) 
Little Shop of Horrors (1986)	33,126,503 (USA) (1 February 1987) 
Little Shop of Horrors (1986)	30,810,276 (USA) (25 January 1987) 
Little Shop of Horrors (1986)	27,781,027 (USA) (18 January 1987) 
Little Shop of Horrors (1986)	23,727,232 (USA) (11 January 1987) 
Little Shop of Horrors (1986)	19,546,049 (USA) (4 January 1987) 
Little Shop of Horrors (1986)	11,412,248 (USA) (28 December 1986) 
Little Shop of Horrors (1986)	3,659,884 (USA) (21 December 1986) 
Little Shop of Horrors (1986)	38,747,385 (USA) 

Here’s the code:

#!/usr/bin/perl by ghr
#this script cleans the IMDB file business.list
#raw data is field-tagged, key tags are "MV" (movie title) and "GR" (gross)
#record can have multiple "gross" fields, only interested in those with "(USA)"
#MV: Astronaut's Wife, The (1999)
#GR: USD 10,654,581 (USA) (7 November 1999) 
#find "MV" tag, keep in memory, go to "GR" tag and write out as "GR\tMV"

use warnings; use strict;
die "usage: <IMDB business file>\n" unless @ARGV==1;
my $rawdata = shift(@ARGV);

# if line=MV, redefine the "title" variable
# if line=GR, write out with "title" in front
#optional, screen out non "USA" gross, parse GR into 
#"currency, quantity, country, date"
my $title ;
my $gross ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">gross.txt") or die "error creating gross.txt\n";
print OUT "title\tgross\n";
while (<IN>) {
	#match "MV" lines by looking for lines beginning "MV: "
	if($_=~ /^MV: /) {
		$title = $_; 
		$title =~ s/\015?\012//; #manual chomp
		$title =~ s/^MV: //; #drop leading tag
		print "$title ";
	#match "GR" lines, write out with clid
	if ($_ =~ m/^GR: USD .+\(USA\)/) {
		$gross = $_; 
		$gross =~ s/\015?\012//; #manual chomp
		$gross =~ s/^GR: USD //; #drop leading tag
		print OUT "$title\t$gross\n";
close IN;
close OUT;
print "\ndone\n";

March 31, 2010 at 3:40 pm 4 comments

More adventures in the reliability of fixed traits

| Gabriel |

[Update: the finding is not robust. See Hannon and Defina]

Last year I described a memo by Gary Gates about some implications of coding errors for gender.

The new Social Problems has an article by Saperstein and Penner that’s comparably weird (and an impressive effort in secondary data analysis). They use data from NLSY on self-reported race over many waves and find that after a prison term an appreciable number of people who always or almost always had called themselves “white” will start calling themselves “black.” The finding seems to be a real effect rather than just random reliability problems as it’s often a sustained shift in identity over subsequent waves and the rate is strongly in one direction after the prison shock. You can’t really demonstrate validity for this kind of thing in a brief summary, but suffice it to say that if you read the article you get the very strong impression that the authors were both careful from the get go and faced demanding peer reviewers.

It’s obviously a weird finding but it makes some sense once you buy that a) identity is somewhat plastic and that b) the racial associations of crime and criminal justice could have an impact on this. One way to understand the results is that many people have multiple ancestry which they can invoke in kind of an ancestral toolkit. For instance Wentworth Miller ’95 (who given his most famous two performances could be seen as the personification of this article) might decide to emphasize his whiteness, Jewishness, blackness, or Arabness, whereas I am limited to the much narrower gamut of identifying as white or Jewish. Largely as a function of with whom I was associating at the time, I have in fact at times self-identified as mostly Irish or as mostly Jewish, but I’d always imagined that I lack the option of plausibly considering myself black. The really weird thing is that this intuition seems to be mistaken as the SP article finds that the effect is not limited to people who the authors could identify as being Hispanic or mixed-race but extends to many non-Hispanic whites with no known black ancestry.

One complication to keep in mind is that almost all American blacks have a lot of white (and to a lesser extent, American Indian) ancestry but we still tend to consider them “black” rather than “mixed race” unless they have a very high proportion of white ancestry and/or very white phenotype. That is the American identity=f(ancestry) is pretty bizarre and it’s hard to puzzle how that affects findings like in the SP piece. For this and other reasons I’d love to see this kind of study replicated in a society with racial schemas other than the “one-drop” tradition.

March 23, 2010 at 5:21 am 9 comments

Ratings game

| Gabriel |

David Waguespack and Olav Sorenson have an interesting new paper on Hollywood (their earlier Hollywood paper is here) that contributes to the literature on categorization, rankings, and sensemaking that increasingly seems to be the dominant theme in econ soc. The new paper is about MPAA ratings (G, PG, PG13, R, NC17) and finds that, controlling for the salacious of the content, the big studios get more lenient ratings than small studios. The exact mechanism through which this occurs is hard to nail down but it occurs even on the initial submission so it’s not just that studios continuously edit down and resubmit the movie until they get a PG13 (which is what I would have expected). Thus the finding is similar to some of the extant literature on how private or quasi-private ranking systems can have similar effects to government mandates but adds the theoretical twist that rankings can function as a barrier to entry. This kind of thing has been suspected by the industry itself, and in fact I heard the findings discussed on “The Business” in the car and was planning to google the paper only to find that Olav had emailed me a copy while I was in transit.

Aside from the theoretical/substantive interest, there are two methods points worth noting. First, their raw data on salaciousness is a set of three Likert scales: sex, violence, and cussing. The natural thing to do would have been to just treat these as three continuous variables or even sum them to a single index. Of course this would be making the assumption that effects are additive, linear, and the intervals on the scale are consistent. They avoided this problem by creating a massive dummy set of all combinations of the three scores. Perhaps overkill, but pretty hard to second guess (unless you’re worried about over-fitting, but they present the parametric models too and everything is consistent). Second, to allow for replication, Olav’s website has a zip with their code and data (the unique salaciousness data, not the IMDB data that is available elsewhere). This is important because as several studies have shown, “available on request” is usually a myth.

March 22, 2010 at 4:27 am

Affiliation network 2 edge list

| Gabriel |

The grad student for whom I wrote dyadkey came back after realizing he had a more complicated problem than we thought. Long story short, he had a dataset where each record was an affiliation with one of the variables being a space-delimited list of all the members of the affiliation. That is his dataset was “wide” and basically looks like this:

j	i
a	"1 2 3 4"
b	"2 5 6"

Note that in most citation databases the data is field-tagged but the author tag itself is “wide.” Hence this code would be useful for making a co-authorship network but you’d first have to rearrange it so that the publication key is “j” and the author list is “i”.

In contrast, “long” affiliation data like IMDB is structured like this:

j	i
a	1
a	2
a	3
a	4
b	2
b	5
b	6

My student wanted to project the affiliation network into an edge list at the “i” level. As before he only wanted each edge in once (so if we have “1 & 2” we don’t also want “2 & 1”). To accomplish this, I wrote him a program that takes as arguments the name of the “affiliation” variable and the name of the “members list” variable. To do this it first reshapes to a long file (mostly lines 11-18), then uses joinby against itself to create all permutations (mostly lines 22-24), and finally drops redundant cases by only keeping dyads where ego was listed before alter in the original list of affiliation members (mostly lines 25-32). With minor modifications, the script would also work with affiliation data that starts out as long, like IMDB. Also note that it should work well in combination with stata2pajek classic (“ssc install stata2pajek”) or the version that lets you save vertice traits like color.

capture program drop affiliation2edges
program define affiliation2edges
	local affiliation `1'
	local memberslist `2'
	tempfile dataset
	tempvar sortorder
	quietly gen `sortorder'=[_n]
	sort `affiliation'
	quietly save "`dataset'"
	keep `affiliation' `memberslist'
	gen wc=wordcount(`memberslist')
	quietly sum wc
	local maxsize=`r(max)'
	forvalues word=1/`maxsize' {
		quietly gen member`word'=word(`memberslist',`word')
	reshape long member, i(`affiliation') j(membernumber)
	quietly drop if member=="" /*drop the empty cell, n_dropped ~ sd(wc) */
	sort `affiliation'
	tempfile datasetlong
	quietly save "`datasetlong'"
	ren member member_a
	joinby `affiliation' using "`datasetlong'"
	ren member member_b
	ren membernumber membernumber_a
	quietly drop if member_a==member_b /*drop loops*/
	quietly gen membernumber_b=.
	forvalues w=1/`maxsize' {
		quietly replace membernumber_b=`w' if member_b==word(`memberslist',`w')
	quietly drop if membernumber_a < membernumber_b /*keep only one version of a combinatio; ie, treat as "edges"*/
	drop membernumber_a membernumber_b `memberslist'
	lab var wc "n members in affiliation"
	sort `affiliation'
	merge m:1 `affiliation' using "`dataset'"
	disp "'from using' only are those with only one member in affiliation's memberlist"
	quietly sort `sortorder'
	quietly drop `sortorder'

March 17, 2010 at 5:30 am 6 comments

Thanks for the Aspirin Guys

| Gabriel |

In a recent post, I lamented that I couldn’t figure out how to do loops and a few other things in R. The script at issue was intended to create a slideshow of network graphs which I’d then use Image Magick to convert to an animated gif. With some help from Brian and Kieran, I ultimately got it to work and the results are in this post (as well as a presentation at a mini-conference I made a few days later).

Kieran very generously further improved on the code at his own blog using simulated data. Kieran’s code has a certain elegance to it, but it’s an R style of elegance based on objects and lists and the like so it’s not completely intuitive to me, despite his thorough commenting. Anyway, I worked his suggestions into my code (see below) and it works great.

Since Kieran was “interested to see whether this approach was at all faster,” I ran the old version and the new version with timestamps. (To make it fair, I commented out the “pdf” version of the flipbook, which only appears in the “old” code — this saves about one second). Turns out they’re almost exactly the same speed — about 26 or 27 seconds (it varies about a second or two every time I do it, sometimes new is faster, sometimes old). The vast majority of that half a minute is taken up by Image Magick. The actual R code only takes about 7 seconds to execute in either version. I think that’s amazingly fast to import 60 or so datasets (albeit small datasets), generate two different FR layouts, apply the layouts to 60 or so graphs (albeit graphs that are written directly to disk rather than to screen), and do miscellaneous housekeeping. I haven’t run a timestamped comparison, but my impression is that this is much faster than comparable operations in Pajek or NWB and appreciably faster than doing scatterplots in Stata.

# File-Name:       chrnetwork.R                 
# Date:            2010-03-11
# Created Date:    2009-11-24                               
# Author:          Gabriel Rossman (w a fair amount of help from BR and KJH)           
# Purpose:         graph CHR station network
# Data Used:
# Packages Used:   igraph plyr   
#ties -- including ties to non-top40 (who can't reciprocate)
chrnet <- read.graph("", c("pajek"))
 plot.igraph(chrnet, layout=layout.fruchterman.reingold, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
#ties bounded to only top 40, includes adoption time color-codes, but use is optional
chrnetbounded <- read.graph("", c("pajek"))
la = layout.fruchterman.reingold(chrnetbounded)  #create layout for use on several related graphs
#graph structure only
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
#graph color coded diffusion
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray80", edge.arrow.size=0.3, margin=0)
filenames <- list.files(getwd(), pattern="\\.net$")
ind <- order(as.numeric(gsub("[^[:digit:]]", "", filenames)))
filenames <- filenames[ind]
g.list <- llply(filenames, read.graph, format="pajek")
png(file="~/Documents/book/images/flipbook/chrnet_hc%d.png", width=600, height=600, units="px") 
l_ply(g.list, plot, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0)
### Finally, generate the gif without having to renumber the files
### individually.
png.filenames <- list.files("~/Documents/book/images/flipbook/", pattern="\\.png$")
## Resort properly again
ind <- order(as.numeric(gsub("[^[:digit:]]", "", png.filenames)))
png.filenames <- png.filenames[ind]
png.filenames <- paste("~/Documents/book/images/flipbook/", png.filenames, sep="") # add the path back
png.string <- capture.output(cat(png.filenames))
## Open a pipe to the shell to execute the convert command
## directly.
gifpipe <- pipe(paste("convert", png.string, "~/Documents/book/images/flipbook/chrnet_humps.gif", sep=" "), "w")

March 16, 2010 at 4:31 am 2 comments

Fiddler’s Green

| Gabriel |

Matt at Permutations links to a critique of the famous zombie epidemiology paper. The original paper was your basic diffusion model, which assumes spatial homogeneity, and as people like to do to these kinds of models, the critique relaxes that assumption. Specifically the new model assumes a gradient of areas that are relatively favorable to humans, perhaps because of resource stocks, terrain that’s easily defensible with a rifle, etc. The new model then finds that the equilibrium is not total apocalypse, but a mixed equilibrium with substantial depopulation with few zombies and a relatively large number of humans.

Being both the world’s foremost expert on the sociology of zombies and a diffusion guy, I feel obliged to weigh in on this. The new model adds a parameter of “safe areas” but assumes that “safeness” is exogenous. However, if the Romero movies have taught us anything, it’s that the defensive resources are only effective if they aren’t sabotaged by the internal squabbles of humans. (If you’re not familiar with Romero’s movies, think of what Newman from Seinfeld did in “Jurassic Park”). Thus you’d have to add another parameter, which is the probability in any given period that some jackass sabotages the defensive perimeter, steals the battle bus, etc. If such sabotage eliminates or even appreciably reduces the “safe area” efficacy then human survival in the “safe areas” is contingent on the act of sabotage not occurring. If we assume that p(sabotage) is 1% in any given month, then the probability of sabotage occurring at least once over the course of two years is 1-.99^24, which works out to 21%. That’s not bad, but if we assume a p(sabotage) per month of at least 2.9% then there’s a better than even chance that we’re fucked. Having a dim view of human nature, I don’t like those odds.

So a more elaborated model would not only have to add in parameters for spatial heterogeneity, but also human sabotage. The two probably interact in that the higher the probability of sabotage, the more important it is to have many small resource islands rather than one big resource island. This may have policy implications in terms of how and where we stockpile resources in preparation for the inevitable zombie apocalypse.

March 12, 2010 at 2:06 pm 1 comment

Keep the best 5

| Gabriel |

In my undergrad lecture class I give pop quizzes, which are intended to measure both attendance and reading comprehension. Since I think an absence or two is understandable, I told the students I’d only count their best five quizzes out of seven offered. Of course this meant I had to figure out a way to calculate this, which is a lot harder to do than a simple sum. After some failed experimentation with Stata, I found this was easy to do in perl. This is because perl likes to process data row by row and its “sort()” function sorts elements in an array (rather than rows in a spreadsheet, like the “sort” commands in Stata or Excel). To view it from a Stata-centric or Excel-centric perspective, perl finds it quite natural to sort columns/variables within a row/record.

[Update: also see the comments for some solutions using Excel or R.]

Here’s the perl script

# Created by Gabriel Rossman, 2010-03-10
# this file cleans the grades by sorting the quizzes. this makes it easy to find the best 5

use strict; use warnings;
die "usage: <in.txt> <out.txt>\n" unless @ARGV == 2;

#read-write boilerplate
my $infile = shift (@ARGV);
my $outfile = shift (@ARGV);
open(IN, "<$infile") or die "error reading $infile";
open(OUT, ">$outfile") or die "error creating $outfile";
#loop to read, process, and write line by line
while (<IN>) {
	chomp; #drop \n
	my @fields = split("\t", $_); #parse STDIN into fields
	my $uid = shift (@fields); #lop off the first field, call it "uid" or University ID, which is my key variable
	my @sortedlist = sort {$b <=> $a} @fields; #descending sort other fields
	print OUT "$uid @sortedlist\n"; #write to disk as space-delimited text
close IN;
close OUT;

To execute it from the command line you’d type
perl dirtygrades.txt cleangrades.txt

You could easily use this script with, say, Excel, but I wrap it in a Stata script.

cd "~/Documents/w10_m176/exams/myuclagradebook/"
insheet using grades.csv, comma clear

drop if uid==. & midterm==.

sort uid
save grades.dta, replace
keep uid q1-q7
outsheet using grades.txt, nonames noquote replace

*send to perl to sort quizzes by score (by row)
shell perl grades.txt gradesclean.txt

insheet using gradesclean.txt, clear delim(" ")
ren v1 uid
ren v2 qg1
ren v3 qg2
ren v4 qg3
ren v5 qg4
ren v6 qg5
ren v7 qg6
ren v8 qg7
sort uid
merge 1:1 uid using grades
drop _merge q1-q7
gen q_top5=(qg1+qg2+qg3+qg4+qg5)/5

*have a nice day

March 12, 2010 at 5:05 am 11 comments

Older Posts

The Culture Geeks