Archive for March, 2010

| Gabriel |

A few months ago I talked about reshaping field-tagged data and gave some clumsy advice for doing so. I’ve now written a perl script that does this more elegantly. It’s written to extract movie title (“MV”) and domestic box office (“GR”) from the IMDB file business.list, but you could adapt it to get other variables and/or work on other field-tagged data.
Basically, the script will turn this:

MV: Little Shop of Horrors (1986)

AD: 118,418 (Sweden) 

BT: USD 30,000,000 

GR: USD 34,656,704 (USA) (8 February 1987) 
GR: USD 33,126,503 (USA) (1 February 1987) 
GR: USD 30,810,276 (USA) (25 January 1987) 
GR: USD 27,781,027 (USA) (18 January 1987) 
GR: USD 23,727,232 (USA) (11 January 1987) 
GR: USD 19,546,049 (USA) (4 January 1987) 
GR: USD 11,412,248 (USA) (28 December 1986) 
GR: USD 3,659,884 (USA) (21 December 1986) 
GR: USD 38,747,385 (USA) 
GR: SEK 4,318,255 (Sweden) 

OW: USD 3,659,884 (USA) (21 December 1986) (866 screens) 

RT: USD 19,300,000 (USA) 

SD: 21 October 1985 - ? 

WG: USD 1,112,016 (USA) (8 February 1987) (871 screens) 
WG: USD 1,719,329 (USA) (1 February 1987) 
WG: USD 2,093,847 (USA) (25 January 1987) 
WG: USD 3,222,066 (USA) (18 January 1987) 
WG: USD 3,057,666 (USA) (11 January 1987) (858 screens) 
WG: USD 4,004,838 (USA) (4 January 1987) (866 screens) 
WG: USD 5,042,682 (USA) (28 December 1986) (866 screens) 
WG: USD 3,659,884 (USA) (21 December 1986) (866 screens) 


Into this:

Little Shop of Horrors (1986)	34,656,704 (USA) (8 February 1987) 
Little Shop of Horrors (1986)	33,126,503 (USA) (1 February 1987) 
Little Shop of Horrors (1986)	30,810,276 (USA) (25 January 1987) 
Little Shop of Horrors (1986)	27,781,027 (USA) (18 January 1987) 
Little Shop of Horrors (1986)	23,727,232 (USA) (11 January 1987) 
Little Shop of Horrors (1986)	19,546,049 (USA) (4 January 1987) 
Little Shop of Horrors (1986)	11,412,248 (USA) (28 December 1986) 
Little Shop of Horrors (1986)	3,659,884 (USA) (21 December 1986) 
Little Shop of Horrors (1986)	38,747,385 (USA) 

Here’s the code:

#!/usr/bin/perl by ghr
#this script cleans the IMDB file business.list
#raw data is field-tagged, key tags are "MV" (movie title) and "GR" (gross)
#record can have multiple "gross" fields, only interested in those with "(USA)"
#MV: Astronaut's Wife, The (1999)
#GR: USD 10,654,581 (USA) (7 November 1999) 
#find "MV" tag, keep in memory, go to "GR" tag and write out as "GR\tMV"

use warnings; use strict;
die "usage: <IMDB business file>\n" unless @ARGV==1;
my $rawdata = shift(@ARGV);

# if line=MV, redefine the "title" variable
# if line=GR, write out with "title" in front
#optional, screen out non "USA" gross, parse GR into 
#"currency, quantity, country, date"
my $title ;
my $gross ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">gross.txt") or die "error creating gross.txt\n";
print OUT "title\tgross\n";
while (<IN>) {
	#match "MV" lines by looking for lines beginning "MV: "
	if($_=~ /^MV: /) {
		$title = $_; 
		$title =~ s/\015?\012//; #manual chomp
		$title =~ s/^MV: //; #drop leading tag
		print "$title ";
	#match "GR" lines, write out with clid
	if ($_ =~ m/^GR: USD .+\(USA\)/) {
		$gross = $_; 
		$gross =~ s/\015?\012//; #manual chomp
		$gross =~ s/^GR: USD //; #drop leading tag
		print OUT "$title\t$gross\n";
close IN;
close OUT;
print "\ndone\n";

March 31, 2010 at 3:40 pm 4 comments

More adventures in the reliability of fixed traits

| Gabriel |

[Update: the finding is not robust. See Hannon and Defina]

Last year I described a memo by Gary Gates about some implications of coding errors for gender.

The new Social Problems has an article by Saperstein and Penner that’s comparably weird (and an impressive effort in secondary data analysis). They use data from NLSY on self-reported race over many waves and find that after a prison term an appreciable number of people who always or almost always had called themselves “white” will start calling themselves “black.” The finding seems to be a real effect rather than just random reliability problems as it’s often a sustained shift in identity over subsequent waves and the rate is strongly in one direction after the prison shock. You can’t really demonstrate validity for this kind of thing in a brief summary, but suffice it to say that if you read the article you get the very strong impression that the authors were both careful from the get go and faced demanding peer reviewers.

It’s obviously a weird finding but it makes some sense once you buy that a) identity is somewhat plastic and that b) the racial associations of crime and criminal justice could have an impact on this. One way to understand the results is that many people have multiple ancestry which they can invoke in kind of an ancestral toolkit. For instance Wentworth Miller ’95 (who given his most famous two performances could be seen as the personification of this article) might decide to emphasize his whiteness, Jewishness, blackness, or Arabness, whereas I am limited to the much narrower gamut of identifying as white or Jewish. Largely as a function of with whom I was associating at the time, I have in fact at times self-identified as mostly Irish or as mostly Jewish, but I’d always imagined that I lack the option of plausibly considering myself black. The really weird thing is that this intuition seems to be mistaken as the SP article finds that the effect is not limited to people who the authors could identify as being Hispanic or mixed-race but extends to many non-Hispanic whites with no known black ancestry.

One complication to keep in mind is that almost all American blacks have a lot of white (and to a lesser extent, American Indian) ancestry but we still tend to consider them “black” rather than “mixed race” unless they have a very high proportion of white ancestry and/or very white phenotype. That is the American identity=f(ancestry) is pretty bizarre and it’s hard to puzzle how that affects findings like in the SP piece. For this and other reasons I’d love to see this kind of study replicated in a society with racial schemas other than the “one-drop” tradition.

March 23, 2010 at 5:21 am 9 comments

Ratings game

| Gabriel |

David Waguespack and Olav Sorenson have an interesting new paper on Hollywood (their earlier Hollywood paper is here) that contributes to the literature on categorization, rankings, and sensemaking that increasingly seems to be the dominant theme in econ soc. The new paper is about MPAA ratings (G, PG, PG13, R, NC17) and finds that, controlling for the salacious of the content, the big studios get more lenient ratings than small studios. The exact mechanism through which this occurs is hard to nail down but it occurs even on the initial submission so it’s not just that studios continuously edit down and resubmit the movie until they get a PG13 (which is what I would have expected). Thus the finding is similar to some of the extant literature on how private or quasi-private ranking systems can have similar effects to government mandates but adds the theoretical twist that rankings can function as a barrier to entry. This kind of thing has been suspected by the industry itself, and in fact I heard the findings discussed on “The Business” in the car and was planning to google the paper only to find that Olav had emailed me a copy while I was in transit.

Aside from the theoretical/substantive interest, there are two methods points worth noting. First, their raw data on salaciousness is a set of three Likert scales: sex, violence, and cussing. The natural thing to do would have been to just treat these as three continuous variables or even sum them to a single index. Of course this would be making the assumption that effects are additive, linear, and the intervals on the scale are consistent. They avoided this problem by creating a massive dummy set of all combinations of the three scores. Perhaps overkill, but pretty hard to second guess (unless you’re worried about over-fitting, but they present the parametric models too and everything is consistent). Second, to allow for replication, Olav’s website has a zip with their code and data (the unique salaciousness data, not the IMDB data that is available elsewhere). This is important because as several studies have shown, “available on request” is usually a myth.

March 22, 2010 at 4:27 am

Affiliation network 2 edge list

| Gabriel |

The grad student for whom I wrote dyadkey came back after realizing he had a more complicated problem than we thought. Long story short, he had a dataset where each record was an affiliation with one of the variables being a space-delimited list of all the members of the affiliation. That is his dataset was “wide” and basically looks like this:

j	i
a	"1 2 3 4"
b	"2 5 6"

Note that in most citation databases the data is field-tagged but the author tag itself is “wide.” Hence this code would be useful for making a co-authorship network but you’d first have to rearrange it so that the publication key is “j” and the author list is “i”.

In contrast, “long” affiliation data like IMDB is structured like this:

j	i
a	1
a	2
a	3
a	4
b	2
b	5
b	6

My student wanted to project the affiliation network into an edge list at the “i” level. As before he only wanted each edge in once (so if we have “1 & 2” we don’t also want “2 & 1”). To accomplish this, I wrote him a program that takes as arguments the name of the “affiliation” variable and the name of the “members list” variable. To do this it first reshapes to a long file (mostly lines 11-18), then uses joinby against itself to create all permutations (mostly lines 22-24), and finally drops redundant cases by only keeping dyads where ego was listed before alter in the original list of affiliation members (mostly lines 25-32). With minor modifications, the script would also work with affiliation data that starts out as long, like IMDB. Also note that it should work well in combination with stata2pajek classic (“ssc install stata2pajek”) or the version that lets you save vertice traits like color.

capture program drop affiliation2edges
program define affiliation2edges
	local affiliation `1'
	local memberslist `2'
	tempfile dataset
	tempvar sortorder
	quietly gen `sortorder'=[_n]
	sort `affiliation'
	quietly save "`dataset'"
	keep `affiliation' `memberslist'
	gen wc=wordcount(`memberslist')
	quietly sum wc
	local maxsize=`r(max)'
	forvalues word=1/`maxsize' {
		quietly gen member`word'=word(`memberslist',`word')
	reshape long member, i(`affiliation') j(membernumber)
	quietly drop if member=="" /*drop the empty cell, n_dropped ~ sd(wc) */
	sort `affiliation'
	tempfile datasetlong
	quietly save "`datasetlong'"
	ren member member_a
	joinby `affiliation' using "`datasetlong'"
	ren member member_b
	ren membernumber membernumber_a
	quietly drop if member_a==member_b /*drop loops*/
	quietly gen membernumber_b=.
	forvalues w=1/`maxsize' {
		quietly replace membernumber_b=`w' if member_b==word(`memberslist',`w')
	quietly drop if membernumber_a < membernumber_b /*keep only one version of a combinatio; ie, treat as "edges"*/
	drop membernumber_a membernumber_b `memberslist'
	lab var wc "n members in affiliation"
	sort `affiliation'
	merge m:1 `affiliation' using "`dataset'"
	disp "'from using' only are those with only one member in affiliation's memberlist"
	quietly sort `sortorder'
	quietly drop `sortorder'

March 17, 2010 at 5:30 am 6 comments

Thanks for the Aspirin Guys

| Gabriel |

In a recent post, I lamented that I couldn’t figure out how to do loops and a few other things in R. The script at issue was intended to create a slideshow of network graphs which I’d then use Image Magick to convert to an animated gif. With some help from Brian and Kieran, I ultimately got it to work and the results are in this post (as well as a presentation at a mini-conference I made a few days later).

Kieran very generously further improved on the code at his own blog using simulated data. Kieran’s code has a certain elegance to it, but it’s an R style of elegance based on objects and lists and the like so it’s not completely intuitive to me, despite his thorough commenting. Anyway, I worked his suggestions into my code (see below) and it works great.

Since Kieran was “interested to see whether this approach was at all faster,” I ran the old version and the new version with timestamps. (To make it fair, I commented out the “pdf” version of the flipbook, which only appears in the “old” code — this saves about one second). Turns out they’re almost exactly the same speed — about 26 or 27 seconds (it varies about a second or two every time I do it, sometimes new is faster, sometimes old). The vast majority of that half a minute is taken up by Image Magick. The actual R code only takes about 7 seconds to execute in either version. I think that’s amazingly fast to import 60 or so datasets (albeit small datasets), generate two different FR layouts, apply the layouts to 60 or so graphs (albeit graphs that are written directly to disk rather than to screen), and do miscellaneous housekeeping. I haven’t run a timestamped comparison, but my impression is that this is much faster than comparable operations in Pajek or NWB and appreciably faster than doing scatterplots in Stata.

# File-Name:       chrnetwork.R                 
# Date:            2010-03-11
# Created Date:    2009-11-24                               
# Author:          Gabriel Rossman (w a fair amount of help from BR and KJH)           
# Purpose:         graph CHR station network
# Data Used:
# Packages Used:   igraph plyr   
#ties -- including ties to non-top40 (who can't reciprocate)
chrnet <- read.graph("", c("pajek"))
 plot.igraph(chrnet, layout=layout.fruchterman.reingold, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
#ties bounded to only top 40, includes adoption time color-codes, but use is optional
chrnetbounded <- read.graph("", c("pajek"))
la = layout.fruchterman.reingold(chrnetbounded)  #create layout for use on several related graphs
#graph structure only
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
#graph color coded diffusion
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray80", edge.arrow.size=0.3, margin=0)
filenames <- list.files(getwd(), pattern="\\.net$")
ind <- order(as.numeric(gsub("[^[:digit:]]", "", filenames)))
filenames <- filenames[ind]
g.list <- llply(filenames, read.graph, format="pajek")
png(file="~/Documents/book/images/flipbook/chrnet_hc%d.png", width=600, height=600, units="px") 
l_ply(g.list, plot, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0)
### Finally, generate the gif without having to renumber the files
### individually.
png.filenames <- list.files("~/Documents/book/images/flipbook/", pattern="\\.png$")
## Resort properly again
ind <- order(as.numeric(gsub("[^[:digit:]]", "", png.filenames)))
png.filenames <- png.filenames[ind]
png.filenames <- paste("~/Documents/book/images/flipbook/", png.filenames, sep="") # add the path back
png.string <- capture.output(cat(png.filenames))
## Open a pipe to the shell to execute the convert command
## directly.
gifpipe <- pipe(paste("convert", png.string, "~/Documents/book/images/flipbook/chrnet_humps.gif", sep=" "), "w")

March 16, 2010 at 4:31 am 2 comments

Fiddler’s Green

| Gabriel |

Matt at Permutations links to a critique of the famous zombie epidemiology paper. The original paper was your basic diffusion model, which assumes spatial homogeneity, and as people like to do to these kinds of models, the critique relaxes that assumption. Specifically the new model assumes a gradient of areas that are relatively favorable to humans, perhaps because of resource stocks, terrain that’s easily defensible with a rifle, etc. The new model then finds that the equilibrium is not total apocalypse, but a mixed equilibrium with substantial depopulation with few zombies and a relatively large number of humans.

Being both the world’s foremost expert on the sociology of zombies and a diffusion guy, I feel obliged to weigh in on this. The new model adds a parameter of “safe areas” but assumes that “safeness” is exogenous. However, if the Romero movies have taught us anything, it’s that the defensive resources are only effective if they aren’t sabotaged by the internal squabbles of humans. (If you’re not familiar with Romero’s movies, think of what Newman from Seinfeld did in “Jurassic Park”). Thus you’d have to add another parameter, which is the probability in any given period that some jackass sabotages the defensive perimeter, steals the battle bus, etc. If such sabotage eliminates or even appreciably reduces the “safe area” efficacy then human survival in the “safe areas” is contingent on the act of sabotage not occurring. If we assume that p(sabotage) is 1% in any given month, then the probability of sabotage occurring at least once over the course of two years is 1-.99^24, which works out to 21%. That’s not bad, but if we assume a p(sabotage) per month of at least 2.9% then there’s a better than even chance that we’re fucked. Having a dim view of human nature, I don’t like those odds.

So a more elaborated model would not only have to add in parameters for spatial heterogeneity, but also human sabotage. The two probably interact in that the higher the probability of sabotage, the more important it is to have many small resource islands rather than one big resource island. This may have policy implications in terms of how and where we stockpile resources in preparation for the inevitable zombie apocalypse.

March 12, 2010 at 2:06 pm 1 comment

Keep the best 5

| Gabriel |

In my undergrad lecture class I give pop quizzes, which are intended to measure both attendance and reading comprehension. Since I think an absence or two is understandable, I told the students I’d only count their best five quizzes out of seven offered. Of course this meant I had to figure out a way to calculate this, which is a lot harder to do than a simple sum. After some failed experimentation with Stata, I found this was easy to do in perl. This is because perl likes to process data row by row and its “sort()” function sorts elements in an array (rather than rows in a spreadsheet, like the “sort” commands in Stata or Excel). To view it from a Stata-centric or Excel-centric perspective, perl finds it quite natural to sort columns/variables within a row/record.

[Update: also see the comments for some solutions using Excel or R.]

Here’s the perl script

# Created by Gabriel Rossman, 2010-03-10
# this file cleans the grades by sorting the quizzes. this makes it easy to find the best 5

use strict; use warnings;
die "usage: <in.txt> <out.txt>\n" unless @ARGV == 2;

#read-write boilerplate
my $infile = shift (@ARGV);
my $outfile = shift (@ARGV);
open(IN, "<$infile") or die "error reading $infile";
open(OUT, ">$outfile") or die "error creating $outfile";
#loop to read, process, and write line by line
while (<IN>) {
	chomp; #drop \n
	my @fields = split("\t", $_); #parse STDIN into fields
	my $uid = shift (@fields); #lop off the first field, call it "uid" or University ID, which is my key variable
	my @sortedlist = sort {$b <=> $a} @fields; #descending sort other fields
	print OUT "$uid @sortedlist\n"; #write to disk as space-delimited text
close IN;
close OUT;

To execute it from the command line you’d type
perl dirtygrades.txt cleangrades.txt

You could easily use this script with, say, Excel, but I wrap it in a Stata script.

cd "~/Documents/w10_m176/exams/myuclagradebook/"
insheet using grades.csv, comma clear

drop if uid==. & midterm==.

sort uid
save grades.dta, replace
keep uid q1-q7
outsheet using grades.txt, nonames noquote replace

*send to perl to sort quizzes by score (by row)
shell perl grades.txt gradesclean.txt

insheet using gradesclean.txt, clear delim(" ")
ren v1 uid
ren v2 qg1
ren v3 qg2
ren v4 qg3
ren v5 qg4
ren v6 qg5
ren v7 qg6
ren v8 qg7
sort uid
merge 1:1 uid using grades
drop _merge q1-q7
gen q_top5=(qg1+qg2+qg3+qg4+qg5)/5

*have a nice day

March 12, 2010 at 5:05 am 11 comments

Soc of mass media, week 10

| Gabriel |

On Monday I talked about cultural capital and on Wednesday I did cultural consecration. Since UCLA is on the quarter system, ten weeks and we’re done, so back to a monomaniacal devotion to research.

The best way to get the whole course (including the slides) is the ITunes U link.

March 10, 2010 at 6:25 pm

Ok, what’s your second favorite movie?

| Gabriel |

A few journalists have asked me about the recent changes to the “best picture” category at the Oscars. I’m reproducing my answer here. Also see my previous post on the increasing performativity of Oscar bait. Finally, the official version of the article on centrality and spillovers is finally out.

The Academy’s recent changes to the “best picture” category were a reaction to the increasing dominance of the category by obscure films. This has increasingly been an issue since the surprise win of “Shakespeare in Love” at the 1999 Oscars. This film was much less popular that “Saving Private Ryan” but took the Oscar in part because of an aggressive lobbying campaign from Miramax. Ever since, the Oscars have been increasingly dominated by obscure art films rather than big tentpole films for the simple reason that the small films benefit much more from Oscar attention than do the big films. Simply put, “Avatar” won’t sell any more tickets or DVDs if it wins or doesn’t win an Oscar, whereas Oscars could make a huge difference to the box office and DVD for films like “Precious” or “The Hurt Locker.” We saw this again last year in that “Slumdog Millionaire” made most of its money after the Oscars.

This dynamic has created a niche for “Oscar bait” films, which are released in November or December and often feature unpleasant (or if you prefer, challenging) material. The best example of this in the current slate is “Precious,” which is not exactly a “fun” movie. The downside for the Academy is that because audiences aren’t very interested in these movies, it depresses attention for the Oscars. Probably the breaking incident was that last year none of the nominated films had made over $40 million domestic in calendar year 2008, but “Dark Knight” (which made over $530 million domestic in calendar year 2008) was not nominated. This despite the fact that “Dark Knight” was not just a popcorn movie but artistically defensible, having been well-received by critics who saw in it a lot of interesting themes about morality and moral culpability. In large part because “Dark Knight” was not nominated, the Oscars had poor ratings, lower than many episodes of the game show “American Idol.” The hope was that by expanding the nominee list and changing the voting system, the Academy could ensure that at least a few hits would be nominated and would be likely to win, thereby halting the evolution of the Oscars into a ghetto for obscure art films.

Of course the downside to expanding the nomination list is that it makes it plausible that a few broadly popular films could split the vote and a film with a cohesive minority of supporters could attain a plurality, despite broad distaste. Similar issues are seen every once in awhile in politics when two mainstream political parties split the vote and seats go to an extremist party. To avoid such a possibility, the Academy has adopted an “instant runoff” voting system wherein voters do not just choose their favorite, but rank all of their choices. The tally then considers second and third choices until a film achieves a majority. The effect of the voting system should work as intended, which is to say it should ensure a consensus pick that most voters are reasonably happy with rather than a divisive pick with a few fervent fans but which is otherwise despised. The instant runoff system is less likely to produce a dark horse win than the old simple plurality system, but it’s worth noting that according to the “Arrow impossibility theorem” no voting system can reflect voter preferences with perfect accuracy.

In theory, one potential problem with instant runoff is strategic voting. With strategic voting, somebody might write a false second choice if they are afraid that their true second choice is likely to beat out their actual favorite. So if my favorite movie is “Hurt Locker” and my second favorite is “Avatar,” I might falsely claim that my second favorite is “District 9” because I see it as a longshot and I’m thereby assuring that I won’t contribute to “Avatar” beating “Hurt Locker.” In reality, I think this is unlikely to happen to any great extent because strategic voting requires a lot of coordination and the Academy is very strict about policing negative campaigning. In fact, they just censured a producer of “Hurt Locker” for implying that voters should vote against “Avatar.”

March 6, 2010 at 3:46 pm 4 comments

Soc of Mass Media, week 9

| Gabriel |

On Monday I did logics of gatekeeping in the entertainment industry: aesthetics, fads, and research. On Wednesday I did journalism “objectivity,” both how it is practiced (Tuchman) and where it came from historically (Starr).

Next week, cultural capital and cultural consecration. Since UCLA is on the quarter system, that’ll be it.

March 4, 2010 at 4:45 am

Older Posts

The Culture Geeks