Son of True Tales of the IMDB!

June 18, 2009 at 5:37 am 1 comment

| Gabriel |

Continuing with the discussion of IMDB networks …

Although it’s pretty much futile to get Stata to calculate any network parameters beyond degree centrality, it’s actually good at cleaning collaboration network data and converting it to an edge list. You can then export this edge list to a package better suited for network analysis like Pajek, Mathematica, or any of several packages written in R or SAS.

The IMDB is a bipartite network where the worker is one mode and the film is the other. Presumably you’ll be reducing this to a one-mode network, traditionally a network of actors (connected by films) but you can do a network of films (connected by actors). So you’ll need to start with the personnel files (writers.list, actors.list, actresses.list, etc). Whether you want one profession (just actors) or all the professions is a judgement call but actors are traditional.

Having decided which files you want to use, you have to clean them. (See previous thoughts here). Most of the files are organized as some variation on this:

Birch, Thora	Alaska (1996)  [Jessie Barnes]  <1>
	Ghost World (2000)  [Enid]  <1>

So first you clean the file in perl or a text editor, then “insheet” with Stata. There are two issues:

  1. In all of the files the worker name appears only on the first record, subsequent credits are whitespace. To fill it in use this command:
  2. replace var1=var1[_n-1] if var1==""
  3. The name is tab-delimited from the credit, but the “credit” includes several types of information. You’ll need to do a regular expression search either in the text editor to turn the tags to tabs or in Stata use regexm/regexs to pull the information out from within the tags. For instance in the the actor/actress files “[]” shows the name of the character and “<>” the credit rank. Parentheses shows the release date, but that’s effectively part of the film title as it helps distinguish between remakes.

Now you need to append the personnel files to each other in a file we can call credits.dta. Whether you include just actors and actresses or all the professions is a judgement call. The next couple steps are not necessary in theory but in practice they are very helpful for keeping the file sizes reasonably small. So it helps a lot to encode the data, though because of the large number of values you have to do it manually.

*the following block of code is basically a roundabout "encode" command but it doesn't have the same limitations
use credits.dta, clear
contract name
drop _freq
sort name
*create "i" as a name serial number based on row number/ alphabetical order
gen i=[_n]
lab var i "name id"
save i_key, replace
outsheet using i_key.txt, replace
sort i
*i_keyb.dta is same as i_key but sorted by "i" instead of name.
*substantively they are identical, but having two versions is useful for merging
save i_keyb.dta, replace

*create list of films and assign serial number "j" to each, just as with "i" for name
use credits.dta, clear
keep film
contract film
drop _freq
sort film
gen j=[_n]
lab var j "film id"
save j_key, replace
outsheet using j_key.txt, replace
sort j
save j_keyb, replace

The next memory-saving step is to break it up into annual files. This will work if you plan to have films connect actors but not the other way around.

*create annual credit (ijt) files
forvalues t=1900/2009 {
 use credits.dta, clear
 keep if year==`t'
 sort name
 merge name using i_key.dta
 tab _merge
 keep if _merge==3
 drop name _merge
 sort film
 merge film using j_key.dta
 tab _merge
 keep if _merge==3
 keep i j
 sort i j
 save ij_`t'.dta, replace

Now that you have a set of encoded annual credit files, it’s time to turn these two-mode files into one-mode edge lists.

*create dyads/collaborations (ii) by year
forvalues t=1900/2009 {
 use ij_`t'.dta, clear
 ren i ib
 sort j
 *square the matrix of each film's credits
 joinby j using ij_`t'.dta
 *eliminate auto-ties
 drop if i==ib
 *drop film titles.
 drop j
 contract i ib
 drop _freq /*optional, keep it and treat as tie strength*/
 save ii_`t'.dta, replace

At this point you can use “append” (followed by contract or collapse) to combine waves. Export to ASCII and knock yourself out in a program better suited for network analysis than Stata. (At least until somebody inevitably jerry-rigs SNA out of Mata). Remember that the worker and film names are stored in i_key.txt and j_key.txt.


Entry filed under: Uncategorized. Tags: , , , .

Bride of True Tales of the IMDB! Journey to the True Tales of the IMDB!

1 Comment

  • 1. « Code and Culture  |  July 26, 2010 at 4:21 am

    […] previously remarked, IMDb files have a weird structure that ain’t exactly ready to rock. I already posted a file for dealing with business.list […]

The Culture Geeks

%d bloggers like this: