imdb_personnel.pl

July 26, 2010 at 4:13 am 2 comments

| Gabriel |

As previously remarked, IMDb files have a weird structure that ain’t exactly ready to rock. I already posted a file for dealing with business.list (which could also be modified to work with files like certificates.list). The personnel files (actors.list, actresses.list, directors.list, writers.list, etc) look like this:

Gilligan, Vince		2-Face (2013)  (screenplay)
			A.M.P.E.D. (2007) (TV)  (writer)
			Hancock (2008)  (written by)  <1,2,1>
			Home Fries (1998)  (written by)  <1,1,1>
			The X Files: Revelations (2008) (V)  (written by) (segment "Bad Blood")  <8,1,1>
			The X Files: Revelations (2008) (V)  (written by) (segment "Memento Mori")  <6,1,3>
			Wilder Napalm (1993)  (written by)  <1,1,1>
			"Breaking Bad" (2008)  (creator)
			"Breaking Bad" (2008) {(#3.12)}  (creator)  <1,1,1>
			"Breaking Bad" (2008) {(#3.13)}  (creator)  <1,1,1>

Whereas we’re used to data that looks like this:

Gilligan, Vince	2-Face (2013)  (screenplay)
Gilligan, Vince	A.M.P.E.D. (2007) (TV)  (writer)
Gilligan, Vince	Hancock (2008)  (written by)  <1,2,1>
Gilligan, Vince	Home Fries (1998)  (written by)  <1,1,1>
Gilligan, Vince	The X Files: Revelations (2008) (V)  (written by) (segment "Bad Blood")  <8,1,1>
Gilligan, Vince	The X Files: Revelations (2008) (V)  (written by) (segment "Memento Mori")  <6,1,3>
Gilligan, Vince	Wilder Napalm (1993)  (written by)  <1,1,1>
Gilligan, Vince	"Breaking Bad" (2008)  (creator)
Gilligan, Vince	"Breaking Bad" (2008) {(#3.12)}  (creator)  <1,1,1>

Of course that’s still not complete since ideally you want to parse the title of the work (eg “Breaking Bad” (2008) ) from details of the artist’s contribution to the work (eg (creator) ). Likewise, depending on what your analysis is about you might want to drop certain kinds of works entirely. (I usually drop the porn, television, and direct to video ASAP). However you can do all that from within Stata (assuming memory isn’t an issue, which it might be) and this script will suffice to get you that far:

#!/usr/bin/perl
#imdb_personnel.pl by ghr
#this script cleans IMDB personnel files (eg, writers.list)
#works best if you delete the header (about the first 300 lines)
#raw data is organized by artist with
# "ARTIST\t\tCREDIT" for the first credit (though sometimes w a single tab) and
# subsequent records are "\t\t\tCREDIT"
#this script makes all rows "ARTIST\tCREDIT" and drops blank rows
#the resulting file is about 20% larger than the original but has a simpler structure that is easier for other programs (eg Stata) to read
#further cleaning would parse the "CREDIT" field but the contents of "CREDIT" 
#vary by personnel file
#in all files "CREDIT" begins with "FILM TITLE (YEAR)" but has further info
# eg, writers.list distinguishes screenplay vs story, etc and actors.list gives character name, etc

use warnings; use strict;
die "usage: imdb_personnel.pl <IMDB personnel file>\n" unless @ARGV==1;
my $rawdata = shift(@ARGV);

# if $_ matches leading non-tab, redefine the "artist" variable
# if $_ matches 3 leading tabs, drop two tabs and add current "artist"
my $artist ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n";
print OUT "artist\tcredit\n";
while (<IN>) {
	#match beginning of artist's credits by looking for lines NOT beginning with a tab
	if($_=~ /^[^\t].+\t.+/) {
		$artist = $_; 
		$artist =~ s/\015?\012//; #manual chomp
		$artist =~ s/\t.+$//; #drop the tab(s) and all else after it 
		$_ =~ s/\t\t/\t/; #go from two tabs to one
		print OUT "$_";
	}
	#match subsequent credits (three leading tabs)
	if ($_ =~ m/^\t\t\t/) {
		$_ =~ s/^\t\t\t//; #drop leading tabs
		print OUT "$artist\t$_";
	}
	#when matching blank line, clear "artist"
	if ($_ =~ m/^$/) {
		$artist = "";
	}
}
close IN;
close OUT;
print "\ndone\n";
#have a nice day

Entry filed under: Uncategorized. Tags: , , .

Diffusion Simulation (now with Network Structure) Diffusion Simulation [updated]

2 Comments

  • 1. impxia  |  January 25, 2012 at 6:10 am

    I’m having problems with the TV and porn data. How do you identify them?

    • 2. gabrielrossman  |  January 25, 2012 at 8:38 am

      First, use the titles themselves. The regular expression “\(T?VG?\)” will identify TV, direct to video, and video games. Also, anything with literal quotes is TV, even if they left out the “(TV)” code.

      Next, merge with the “genres.list” file. Anything coded as “adult” is porn.

      That will take care of 99% of the problem but a few will remain. If you want to further limit the data to only the kind of movies you’ve heard of (no student films, no foreign films w/o US release, etc) the easiest way is to use the release dates file. This only works well since the mid 1980s, prior to that there’s too much missing data even for major films.



%d bloggers like this: