Posts tagged ‘IMDB’

Strange Things Are Afoot at the IMDb

| Gabriel |

I was helping a friend check something on IMDb for a paper and so we went to the URL that gives you the raw data. We found it’s in a completely different format than it was last time I checked, about a year ago.

The old data will be available until November 2017. I suggest you grab a complete copy while you still can.

Good news: The data is in a much simpler format, being six wide tables that are tab-separated row/column text files. You’ll no longer need my Perl scripts to convert them from a few dozen files that are a weird mish mash of field-tagged format and the weirdest tab-delimited text you’ve ever seen. Good riddance.

Bad news: It’s hard to use. S3 is designed for developers not end users. You could download the old version with Chrome or “curl” from the command line. The new version requires you to create an S3 account and as best I can tell, there’s no way to just use the S3 web interface to get it. There is sample Java code, but it requires supplying your account credentials which gives me cold sweat flashbacks to when Twitter changed its API and my R scrape broke. Anyway, bottom line being you’ll probably need IT to help you with this.

Really bad news: A lot of the files are gone. There’s no country by country release dates, no box offices, no plot keywords, there are only up to three genres, no distributor or production company, etc. These are all things I’ve used in publications.


September 8, 2017 at 2:29 pm 3 comments

Oscar Appeal

| Gabriel |

This post contains two Stata do-files for constructing the “Oscar appeal” variable at the center of Rossman & Schilke “Close But No Cigar.”


July 29, 2013 at 8:06 am 5 comments

| Gabriel |

As previously remarked, IMDb files have a weird structure that ain’t exactly ready to rock. I already posted a file for dealing with business.list (which could also be modified to work with files like certificates.list). The personnel files (actors.list, actresses.list, directors.list, writers.list, etc) look like this:

Gilligan, Vince		2-Face (2013)  (screenplay)
			A.M.P.E.D. (2007) (TV)  (writer)
			Hancock (2008)  (written by)  <1,2,1>
			Home Fries (1998)  (written by)  <1,1,1>
			The X Files: Revelations (2008) (V)  (written by) (segment "Bad Blood")  <8,1,1>
			The X Files: Revelations (2008) (V)  (written by) (segment "Memento Mori")  <6,1,3>
			Wilder Napalm (1993)  (written by)  <1,1,1>
			"Breaking Bad" (2008)  (creator)
			"Breaking Bad" (2008) {(#3.12)}  (creator)  <1,1,1>
			"Breaking Bad" (2008) {(#3.13)}  (creator)  <1,1,1>

Whereas we’re used to data that looks like this:

Gilligan, Vince	2-Face (2013)  (screenplay)
Gilligan, Vince	A.M.P.E.D. (2007) (TV)  (writer)
Gilligan, Vince	Hancock (2008)  (written by)  <1,2,1>
Gilligan, Vince	Home Fries (1998)  (written by)  <1,1,1>
Gilligan, Vince	The X Files: Revelations (2008) (V)  (written by) (segment "Bad Blood")  <8,1,1>
Gilligan, Vince	The X Files: Revelations (2008) (V)  (written by) (segment "Memento Mori")  <6,1,3>
Gilligan, Vince	Wilder Napalm (1993)  (written by)  <1,1,1>
Gilligan, Vince	"Breaking Bad" (2008)  (creator)
Gilligan, Vince	"Breaking Bad" (2008) {(#3.12)}  (creator)  <1,1,1>

Of course that’s still not complete since ideally you want to parse the title of the work (eg “Breaking Bad” (2008) ) from details of the artist’s contribution to the work (eg (creator) ). Likewise, depending on what your analysis is about you might want to drop certain kinds of works entirely. (I usually drop the porn, television, and direct to video ASAP). However you can do all that from within Stata (assuming memory isn’t an issue, which it might be) and this script will suffice to get you that far:

#!/usr/bin/perl by ghr
#this script cleans IMDB personnel files (eg, writers.list)
#works best if you delete the header (about the first 300 lines)
#raw data is organized by artist with
# "ARTIST\t\tCREDIT" for the first credit (though sometimes w a single tab) and
# subsequent records are "\t\t\tCREDIT"
#this script makes all rows "ARTIST\tCREDIT" and drops blank rows
#the resulting file is about 20% larger than the original but has a simpler structure that is easier for other programs (eg Stata) to read
#further cleaning would parse the "CREDIT" field but the contents of "CREDIT" 
#vary by personnel file
#in all files "CREDIT" begins with "FILM TITLE (YEAR)" but has further info
# eg, writers.list distinguishes screenplay vs story, etc and actors.list gives character name, etc

use warnings; use strict;
die "usage: <IMDB personnel file>\n" unless @ARGV==1;
my $rawdata = shift(@ARGV);

# if $_ matches leading non-tab, redefine the "artist" variable
# if $_ matches 3 leading tabs, drop two tabs and add current "artist"
my $artist ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n";
print OUT "artist\tcredit\n";
while (<IN>) {
	#match beginning of artist's credits by looking for lines NOT beginning with a tab
	if($_=~ /^[^\t].+\t.+/) {
		$artist = $_; 
		$artist =~ s/\015?\012//; #manual chomp
		$artist =~ s/\t.+$//; #drop the tab(s) and all else after it 
		$_ =~ s/\t\t/\t/; #go from two tabs to one
		print OUT "$_";
	#match subsequent credits (three leading tabs)
	if ($_ =~ m/^\t\t\t/) {
		$_ =~ s/^\t\t\t//; #drop leading tabs
		print OUT "$artist\t$_";
	#when matching blank line, clear "artist"
	if ($_ =~ m/^$/) {
		$artist = "";
close IN;
close OUT;
print "\ndone\n";
#have a nice day

July 26, 2010 at 4:13 am 2 comments

| Gabriel |

A few months ago I talked about reshaping field-tagged data and gave some clumsy advice for doing so. I’ve now written a perl script that does this more elegantly. It’s written to extract movie title (“MV”) and domestic box office (“GR”) from the IMDB file business.list, but you could adapt it to get other variables and/or work on other field-tagged data.
Basically, the script will turn this:

MV: Little Shop of Horrors (1986)

AD: 118,418 (Sweden) 

BT: USD 30,000,000 

GR: USD 34,656,704 (USA) (8 February 1987) 
GR: USD 33,126,503 (USA) (1 February 1987) 
GR: USD 30,810,276 (USA) (25 January 1987) 
GR: USD 27,781,027 (USA) (18 January 1987) 
GR: USD 23,727,232 (USA) (11 January 1987) 
GR: USD 19,546,049 (USA) (4 January 1987) 
GR: USD 11,412,248 (USA) (28 December 1986) 
GR: USD 3,659,884 (USA) (21 December 1986) 
GR: USD 38,747,385 (USA) 
GR: SEK 4,318,255 (Sweden) 

OW: USD 3,659,884 (USA) (21 December 1986) (866 screens) 

RT: USD 19,300,000 (USA) 

SD: 21 October 1985 - ? 

WG: USD 1,112,016 (USA) (8 February 1987) (871 screens) 
WG: USD 1,719,329 (USA) (1 February 1987) 
WG: USD 2,093,847 (USA) (25 January 1987) 
WG: USD 3,222,066 (USA) (18 January 1987) 
WG: USD 3,057,666 (USA) (11 January 1987) (858 screens) 
WG: USD 4,004,838 (USA) (4 January 1987) (866 screens) 
WG: USD 5,042,682 (USA) (28 December 1986) (866 screens) 
WG: USD 3,659,884 (USA) (21 December 1986) (866 screens) 


Into this:

Little Shop of Horrors (1986)	34,656,704 (USA) (8 February 1987) 
Little Shop of Horrors (1986)	33,126,503 (USA) (1 February 1987) 
Little Shop of Horrors (1986)	30,810,276 (USA) (25 January 1987) 
Little Shop of Horrors (1986)	27,781,027 (USA) (18 January 1987) 
Little Shop of Horrors (1986)	23,727,232 (USA) (11 January 1987) 
Little Shop of Horrors (1986)	19,546,049 (USA) (4 January 1987) 
Little Shop of Horrors (1986)	11,412,248 (USA) (28 December 1986) 
Little Shop of Horrors (1986)	3,659,884 (USA) (21 December 1986) 
Little Shop of Horrors (1986)	38,747,385 (USA) 

Here’s the code:

#!/usr/bin/perl by ghr
#this script cleans the IMDB file business.list
#raw data is field-tagged, key tags are "MV" (movie title) and "GR" (gross)
#record can have multiple "gross" fields, only interested in those with "(USA)"
#MV: Astronaut's Wife, The (1999)
#GR: USD 10,654,581 (USA) (7 November 1999) 
#find "MV" tag, keep in memory, go to "GR" tag and write out as "GR\tMV"

use warnings; use strict;
die "usage: <IMDB business file>\n" unless @ARGV==1;
my $rawdata = shift(@ARGV);

# if line=MV, redefine the "title" variable
# if line=GR, write out with "title" in front
#optional, screen out non "USA" gross, parse GR into 
#"currency, quantity, country, date"
my $title ;
my $gross ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">gross.txt") or die "error creating gross.txt\n";
print OUT "title\tgross\n";
while (<IN>) {
	#match "MV" lines by looking for lines beginning "MV: "
	if($_=~ /^MV: /) {
		$title = $_; 
		$title =~ s/\015?\012//; #manual chomp
		$title =~ s/^MV: //; #drop leading tag
		print "$title ";
	#match "GR" lines, write out with clid
	if ($_ =~ m/^GR: USD .+\(USA\)/) {
		$gross = $_; 
		$gross =~ s/\015?\012//; #manual chomp
		$gross =~ s/^GR: USD //; #drop leading tag
		print OUT "$title\t$gross\n";
close IN;
close OUT;
print "\ndone\n";

March 31, 2010 at 3:40 pm 4 comments

Ratings game

| Gabriel |

David Waguespack and Olav Sorenson have an interesting new paper on Hollywood (their earlier Hollywood paper is here) that contributes to the literature on categorization, rankings, and sensemaking that increasingly seems to be the dominant theme in econ soc. The new paper is about MPAA ratings (G, PG, PG13, R, NC17) and finds that, controlling for the salacious of the content, the big studios get more lenient ratings than small studios. The exact mechanism through which this occurs is hard to nail down but it occurs even on the initial submission so it’s not just that studios continuously edit down and resubmit the movie until they get a PG13 (which is what I would have expected). Thus the finding is similar to some of the extant literature on how private or quasi-private ranking systems can have similar effects to government mandates but adds the theoretical twist that rankings can function as a barrier to entry. This kind of thing has been suspected by the industry itself, and in fact I heard the findings discussed on “The Business” in the car and was planning to google the paper only to find that Olav had emailed me a copy while I was in transit.

Aside from the theoretical/substantive interest, there are two methods points worth noting. First, their raw data on salaciousness is a set of three Likert scales: sex, violence, and cussing. The natural thing to do would have been to just treat these as three continuous variables or even sum them to a single index. Of course this would be making the assumption that effects are additive, linear, and the intervals on the scale are consistent. They avoided this problem by creating a massive dummy set of all combinations of the three scores. Perhaps overkill, but pretty hard to second guess (unless you’re worried about over-fitting, but they present the parametric models too and everything is consistent). Second, to allow for replication, Olav’s website has a zip with their code and data (the unique salaciousness data, not the IMDB data that is available elsewhere). This is important because as several studies have shown, “available on request” is usually a myth.

March 22, 2010 at 4:27 am

Team Sorting

| Gabriel |

Tyler Cowen links to an NBER paper by Hoxby that shows that in recent decades, status sorting has gotten more intense for college. Cowen asks “is this a more general prediction in a superstars model?” The archetypal superstar system is Hollywood, and here’s my quick and dirty stab at answering Tyler’s question for that field. Faulkner and Anderson’s 1987 AJS showed that there is a lot of quality sorting in Hollywood, but they didn’t give a time trend. As shown in my forthcoming ASR with Esparza and Bonacich, there are big team spillovers so this is something we ought to care about.

I’m reusing the dataset from our paper, which is a subset of IMDB for Oscar eligible films (basically, theatrically-released non-porn) from 1936-2005. If I were doing it for publication I’d do it better (i.e., I’d allow the data to have more structure and I’d build confidence intervals from randomness), but for exploratory purposes the simplest way to measure sorting is to see if a given film had at least one prior Oscar nominee writer, director, and actor. From that I can calculate the odds-ratio of having an elite peer in the other occupation.

Overall, a movie that has at least one prior nominee writer is 7.3 times more likely than other films to have a prior nominee director and 4.4 times more likely to have a prior nominee cast. A cast with a prior nominee is 6.5 times more likely to have a prior nominee director. Of course we already knew there was a lot of sorting from Faulker and Anderson, the question suggested by Hoxby/Cowen is what are the effects over time?

This little table shows odds-ratios for cast-director, writer-director, and writer-cast. Big numbers mean more intense sorting.

...| decade    cd       wd       wc       |
1. | 1936-1945 6.545898 6.452388 4.306554 |
2. | 1946-1955 9.407476 6.425553 5.368151 |
3. | 1956-1965 12.09229 8.741302 6.720059 |
4. | 1966-1975 4.697238 5.399081 4.781106 |
5. | 1976-1985 4.113508 6.984528 4.450109 |
6. | 1986-1995 4.923809 7.599852 3.301461 |
7. | 1996-2005 4.826018 12.35915 3.641975 |

The trend is a little complicated. For collaborations between Oscar-nominated casts on the one-hand and either writers or directors, the sorting is most intense in the 1946-1955 decade and especially the 1956-1965 decade. My guess is that this is tied to the decline of the studio system and/or the peak power of MCA. The odds-ratio of good director for nom vs non-nom writers also has a jump around the end of the studio system, but it seems there’s a second jump starting in the 80s. My guess is that this is an artifact of the increasing number of writer-directors (see Baker and Faulkner AJS 1991), but it’s an empirical question.

Putting aside the writer-director thing, it seems that sorting is not growing stronger in Hollywood. My guess is that ever more intense sorting is not a logical necessity of superstar markets, but has to do with contingencies, such as the rise of a national market for elite education in Hoxby’s case or the machinations of Lew Wasserman and Olivia deHavilland in my case.

The Stata code is below. (sorry that wordpress won’t preserve the whitespace). The data consists of film-level data with dummies for having at least one prior nominee for the three occupations.

global parentpath "/Users/rossman/Documents/oscars"

capture program drop makedecade
program define makedecade
gen decade=year
recode decade 1900/1935=. 1936/1945=1 1946/1955=2 1956/1965=3 1966/1975=4 1976/1985=5 1986/1995=6 1996/2005=7
capture lab drop decade
lab def decade 1 "1936-1945" 2 "1946-1955" 3 "1956-1965" 4 "1966-1975" 5 "1976-1985" 6 "1986-1995" 7 "1996-2005"
lab val decade decade

cd $parentpath

capture log close
log using $parentpath/sorting_analysis.log, replace

use sorting, clear

*do odds-ratio of working w oscar nom, by own status

capture program drop allstar
program define allstar
if "`1'"!="" {
keep if decade==`1'
tabulate cast director, matcell(CD)
local pooled_cd=(CD[2,2]*CD[1,1])/(CD[1,2]*CD[2,1])
tabulate writers director, matcell(WD)
local pooled_wd=(WD[2,2]*WD[1,1])/(WD[1,2]*WD[2,1])
tabulate writers cast, matcell(WC)
local pooled_wc=(WC[2,2]*WC[1,1])/(WC[1,2]*WC[2,1])
shell echo "`pooled_cd' `pooled_wd' `pooled_wc' `1'" >> sortingresults.txt

shell echo "cd wd wc decade" > sortingresults.txt
quietly allstar
forvalues t=1/7 {
quietly allstar `t'

insheet using sortingresults.txt, delimiter(" ") names clear
lab val decade decade


*have a nice day

November 8, 2009 at 7:27 pm 4 comments


| Gabriel |

[Update: I’ve rewritten the command to be more flexible and posted it to ssc. to get it type “ssc install shufflevar”. this post may still be of interest for understanding how to apply the command].

Sometimes you face a situation where it’s really hard to see what the null is because the data structure is really complicated and there is all sorts of nonlinearity, etc. Analyses of non-sparse square network matrices can use the quadratic assignment procedure, but you can do something similar with other data structures, including bipartite networks.

A good null keeps everything constant, but shows what associations we would expect were association random. The simplest way to do this is to keep the actual variable vectors but randomly sort one of the vectors. So for instance, you could keep the actual income distribution and the actual values of peoples’ education, race, etc, but randomly assign actual incomes to people.

Fernandez, Castilla, and Moore used what was basically this approach to build a null distribution of the effects of employment referrals. Since then Ezra Zuckerman has used it in several papers on Hollywood to measure the strength of repeat collaboration. I myself am using it in some of my current radio work to understand how much corporate clustering we’d expect to see in the diffusion of pop songs under the null hypothesis that radio corporations don’t actually practice central coordination.

I wrote a little program that takes the argument of the variable you want shuffled. It has a similar application as bsample, and like bsample it’s best used as part of a loop.

capture program drop shufflevar
program define shufflevar
  local shufflevar `1'
  tempvar oldsortorder
  gen `oldsortorder'=[_n]
  tempvar newsortorder
  gen `newsortorder'=uniform()
  sort `newsortorder'
  capture drop `shufflevar'_shuffled
  gen `shufflevar'_shuffled=`shufflevar'[_n-1]
  replace `shufflevar'_shuffled=`shufflevar'[_N] in 1/1
  sort `oldsortorder'
  drop `newsortorder' `oldsortorder'

Here’s an example to show how much clustering of “y” you’d expect to see by “clusterid” if we keep the observed distributions of “y” and “clusterid” but break any association between them:

shell echo "run rho" > _results_shuffled.txt

forvalues run=1/1000 {
  disp "iteration # `run' of 1000"
  quietly shufflevar clusterid
  quietly xtreg y, re i(clusterid_shuffled)
  shell echo "`run' `e(rho)'" >> _results_shuffled.txt

insheet using _results_shuffled.txt, names clear delimiter(" ")
histogram rho
sum rho

(Note that “shell echo” only works with Mac/Unix, Windows users should try postfile).

October 26, 2009 at 5:09 am 3 comments

Older Posts

The Culture Geeks