Posts tagged ‘IMDB’

Strange Things Are Afoot at the IMDb

| Gabriel |

I was helping a friend check something on IMDb for a paper and so we went to the URL that gives you the raw data. We found it’s in a completely different format than it was last time I checked, about a year ago.

The old data will be available until November 2017. I suggest you grab a complete copy while you still can.

Good news: The data is in a much simpler format, being six wide tables that are tab-separated row/column text files. You’ll no longer need my Perl scripts to convert them from a few dozen files that are a weird mish mash of field-tagged format and the weirdest tab-delimited text you’ve ever seen. Good riddance.

Bad news: It’s hard to use. S3 is designed for developers not end users. You could download the old version with Chrome or “curl” from the command line. The new version requires you to create an S3 account and as best I can tell, there’s no way to just use the S3 web interface to get it. There is sample Java code, but it requires supplying your account credentials which gives me cold sweat flashbacks to when Twitter changed its API and my R scrape broke. Anyway, bottom line being you’ll probably need IT to help you with this.

Really bad news: A lot of the files are gone. There’s no country by country release dates, no box offices, no plot keywords, there are only up to three genres, no distributor or production company, etc. These are all things I’ve used in publications.

September 8, 2017 at 2:29 pm 3 comments

Oscar Appeal

| Gabriel |

This post contains two Stata do-files for constructing the “Oscar appeal” variable at the center of Rossman & Schilke “Close But No Cigar.”

(more…)

July 29, 2013 at 8:06 am 5 comments

imdb_personnel.pl

| Gabriel |

As previously remarked, IMDb files have a weird structure that ain’t exactly ready to rock. I already posted a file for dealing with business.list (which could also be modified to work with files like certificates.list). The personnel files (actors.list, actresses.list, directors.list, writers.list, etc) look like this:

Gilligan, Vince		2-Face (2013)  (screenplay)
			A.M.P.E.D. (2007) (TV)  (writer)
			Hancock (2008)  (written by)  <1,2,1>
			Home Fries (1998)  (written by)  <1,1,1>
			The X Files: Revelations (2008) (V)  (written by) (segment "Bad Blood")  <8,1,1>
			The X Files: Revelations (2008) (V)  (written by) (segment "Memento Mori")  <6,1,3>
			Wilder Napalm (1993)  (written by)  <1,1,1>
			"Breaking Bad" (2008)  (creator)
			"Breaking Bad" (2008) {(#3.12)}  (creator)  <1,1,1>
			"Breaking Bad" (2008) {(#3.13)}  (creator)  <1,1,1>

Whereas we’re used to data that looks like this:

Gilligan, Vince	2-Face (2013)  (screenplay)
Gilligan, Vince	A.M.P.E.D. (2007) (TV)  (writer)
Gilligan, Vince	Hancock (2008)  (written by)  <1,2,1>
Gilligan, Vince	Home Fries (1998)  (written by)  <1,1,1>
Gilligan, Vince	The X Files: Revelations (2008) (V)  (written by) (segment "Bad Blood")  <8,1,1>
Gilligan, Vince	The X Files: Revelations (2008) (V)  (written by) (segment "Memento Mori")  <6,1,3>
Gilligan, Vince	Wilder Napalm (1993)  (written by)  <1,1,1>
Gilligan, Vince	"Breaking Bad" (2008)  (creator)
Gilligan, Vince	"Breaking Bad" (2008) {(#3.12)}  (creator)  <1,1,1>

Of course that’s still not complete since ideally you want to parse the title of the work (eg “Breaking Bad” (2008) ) from details of the artist’s contribution to the work (eg (creator) ). Likewise, depending on what your analysis is about you might want to drop certain kinds of works entirely. (I usually drop the porn, television, and direct to video ASAP). However you can do all that from within Stata (assuming memory isn’t an issue, which it might be) and this script will suffice to get you that far:

#!/usr/bin/perl
#imdb_personnel.pl by ghr
#this script cleans IMDB personnel files (eg, writers.list)
#works best if you delete the header (about the first 300 lines)
#raw data is organized by artist with
# "ARTIST\t\tCREDIT" for the first credit (though sometimes w a single tab) and
# subsequent records are "\t\t\tCREDIT"
#this script makes all rows "ARTIST\tCREDIT" and drops blank rows
#the resulting file is about 20% larger than the original but has a simpler structure that is easier for other programs (eg Stata) to read
#further cleaning would parse the "CREDIT" field but the contents of "CREDIT" 
#vary by personnel file
#in all files "CREDIT" begins with "FILM TITLE (YEAR)" but has further info
# eg, writers.list distinguishes screenplay vs story, etc and actors.list gives character name, etc

use warnings; use strict;
die "usage: imdb_personnel.pl <IMDB personnel file>\n" unless @ARGV==1;
my $rawdata = shift(@ARGV);

# if $_ matches leading non-tab, redefine the "artist" variable
# if $_ matches 3 leading tabs, drop two tabs and add current "artist"
my $artist ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n";
print OUT "artist\tcredit\n";
while (<IN>) {
	#match beginning of artist's credits by looking for lines NOT beginning with a tab
	if($_=~ /^[^\t].+\t.+/) {
		$artist = $_; 
		$artist =~ s/\015?\012//; #manual chomp
		$artist =~ s/\t.+$//; #drop the tab(s) and all else after it 
		$_ =~ s/\t\t/\t/; #go from two tabs to one
		print OUT "$_";
	}
	#match subsequent credits (three leading tabs)
	if ($_ =~ m/^\t\t\t/) {
		$_ =~ s/^\t\t\t//; #drop leading tabs
		print OUT "$artist\t$_";
	}
	#when matching blank line, clear "artist"
	if ($_ =~ m/^$/) {
		$artist = "";
	}
}
close IN;
close OUT;
print "\ndone\n";
#have a nice day

July 26, 2010 at 4:13 am 2 comments

Gross.pl

| Gabriel |

A few months ago I talked about reshaping field-tagged data and gave some clumsy advice for doing so. I’ve now written a perl script that does this more elegantly. It’s written to extract movie title (“MV”) and domestic box office (“GR”) from the IMDB file business.list, but you could adapt it to get other variables and/or work on other field-tagged data.
Basically, the script will turn this:

-------------------------------------------------------------------------------
MV: Little Shop of Horrors (1986)

AD: 118,418 (Sweden) 

BT: USD 30,000,000 

GR: USD 34,656,704 (USA) (8 February 1987) 
GR: USD 33,126,503 (USA) (1 February 1987) 
GR: USD 30,810,276 (USA) (25 January 1987) 
GR: USD 27,781,027 (USA) (18 January 1987) 
GR: USD 23,727,232 (USA) (11 January 1987) 
GR: USD 19,546,049 (USA) (4 January 1987) 
GR: USD 11,412,248 (USA) (28 December 1986) 
GR: USD 3,659,884 (USA) (21 December 1986) 
GR: USD 38,747,385 (USA) 
GR: SEK 4,318,255 (Sweden) 

OW: USD 3,659,884 (USA) (21 December 1986) (866 screens) 

RT: USD 19,300,000 (USA) 

SD: 21 October 1985 - ? 

WG: USD 1,112,016 (USA) (8 February 1987) (871 screens) 
WG: USD 1,719,329 (USA) (1 February 1987) 
WG: USD 2,093,847 (USA) (25 January 1987) 
WG: USD 3,222,066 (USA) (18 January 1987) 
WG: USD 3,057,666 (USA) (11 January 1987) (858 screens) 
WG: USD 4,004,838 (USA) (4 January 1987) (866 screens) 
WG: USD 5,042,682 (USA) (28 December 1986) (866 screens) 
WG: USD 3,659,884 (USA) (21 December 1986) (866 screens) 

-------------------------------------------------------------------------------

Into this:

Little Shop of Horrors (1986)	34,656,704 (USA) (8 February 1987) 
Little Shop of Horrors (1986)	33,126,503 (USA) (1 February 1987) 
Little Shop of Horrors (1986)	30,810,276 (USA) (25 January 1987) 
Little Shop of Horrors (1986)	27,781,027 (USA) (18 January 1987) 
Little Shop of Horrors (1986)	23,727,232 (USA) (11 January 1987) 
Little Shop of Horrors (1986)	19,546,049 (USA) (4 January 1987) 
Little Shop of Horrors (1986)	11,412,248 (USA) (28 December 1986) 
Little Shop of Horrors (1986)	3,659,884 (USA) (21 December 1986) 
Little Shop of Horrors (1986)	38,747,385 (USA) 

Here’s the code:

#!/usr/bin/perl
#gross.pl by ghr
#this script cleans the IMDB file business.list
#raw data is field-tagged, key tags are "MV" (movie title) and "GR" (gross)
#record can have multiple "gross" fields, only interested in those with "(USA)"
#ex
#MV: Astronaut's Wife, The (1999)
#GR: USD 10,654,581 (USA) (7 November 1999) 
#find "MV" tag, keep in memory, go to "GR" tag and write out as "GR\tMV"

use warnings; use strict;
die "usage: gross.pl <IMDB business file>\n" unless @ARGV==1;
my $rawdata = shift(@ARGV);

# if line=MV, redefine the "title" variable
# if line=GR, write out with "title" in front
#optional, screen out non "USA" gross, parse GR into 
#"currency, quantity, country, date"
my $title ;
my $gross ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">gross.txt") or die "error creating gross.txt\n";
print OUT "title\tgross\n";
while (<IN>) {
	#match "MV" lines by looking for lines beginning "MV: "
	if($_=~ /^MV: /) {
		$title = $_; 
		$title =~ s/\015?\012//; #manual chomp
		$title =~ s/^MV: //; #drop leading tag
		print "$title ";
	}
	#match "GR" lines, write out with clid
	if ($_ =~ m/^GR: USD .+\(USA\)/) {
		$gross = $_; 
		$gross =~ s/\015?\012//; #manual chomp
		$gross =~ s/^GR: USD //; #drop leading tag
		print OUT "$title\t$gross\n";
	}
}
close IN;
close OUT;
print "\ndone\n";

March 31, 2010 at 3:40 pm 4 comments

Ratings game

| Gabriel |

David Waguespack and Olav Sorenson have an interesting new paper on Hollywood (their earlier Hollywood paper is here) that contributes to the literature on categorization, rankings, and sensemaking that increasingly seems to be the dominant theme in econ soc. The new paper is about MPAA ratings (G, PG, PG13, R, NC17) and finds that, controlling for the salacious of the content, the big studios get more lenient ratings than small studios. The exact mechanism through which this occurs is hard to nail down but it occurs even on the initial submission so it’s not just that studios continuously edit down and resubmit the movie until they get a PG13 (which is what I would have expected). Thus the finding is similar to some of the extant literature on how private or quasi-private ranking systems can have similar effects to government mandates but adds the theoretical twist that rankings can function as a barrier to entry. This kind of thing has been suspected by the industry itself, and in fact I heard the findings discussed on “The Business” in the car and was planning to google the paper only to find that Olav had emailed me a copy while I was in transit.

Aside from the theoretical/substantive interest, there are two methods points worth noting. First, their raw data on salaciousness is a set of three Likert scales: sex, violence, and cussing. The natural thing to do would have been to just treat these as three continuous variables or even sum them to a single index. Of course this would be making the assumption that effects are additive, linear, and the intervals on the scale are consistent. They avoided this problem by creating a massive dummy set of all combinations of the three scores. Perhaps overkill, but pretty hard to second guess (unless you’re worried about over-fitting, but they present the parametric models too and everything is consistent). Second, to allow for replication, Olav’s website has a zip with their code and data (the unique salaciousness data, not the IMDB data that is available elsewhere). This is important because as several studies have shown, “available on request” is usually a myth.

March 22, 2010 at 4:27 am

Team Sorting

| Gabriel |

Tyler Cowen links to an NBER paper by Hoxby that shows that in recent decades, status sorting has gotten more intense for college. Cowen asks “is this a more general prediction in a superstars model?” The archetypal superstar system is Hollywood, and here’s my quick and dirty stab at answering Tyler’s question for that field. Faulkner and Anderson’s 1987 AJS showed that there is a lot of quality sorting in Hollywood, but they didn’t give a time trend. As shown in my forthcoming ASR with Esparza and Bonacich, there are big team spillovers so this is something we ought to care about.

I’m reusing the dataset from our paper, which is a subset of IMDB for Oscar eligible films (basically, theatrically-released non-porn) from 1936-2005. If I were doing it for publication I’d do it better (i.e., I’d allow the data to have more structure and I’d build confidence intervals from randomness), but for exploratory purposes the simplest way to measure sorting is to see if a given film had at least one prior Oscar nominee writer, director, and actor. From that I can calculate the odds-ratio of having an elite peer in the other occupation.

Overall, a movie that has at least one prior nominee writer is 7.3 times more likely than other films to have a prior nominee director and 4.4 times more likely to have a prior nominee cast. A cast with a prior nominee is 6.5 times more likely to have a prior nominee director. Of course we already knew there was a lot of sorting from Faulker and Anderson, the question suggested by Hoxby/Cowen is what are the effects over time?

This little table shows odds-ratios for cast-director, writer-director, and writer-cast. Big numbers mean more intense sorting.

...+--------------------------------------+
...| decade    cd       wd       wc       |
...|--------------------------------------|
1. | 1936-1945 6.545898 6.452388 4.306554 |
2. | 1946-1955 9.407476 6.425553 5.368151 |
3. | 1956-1965 12.09229 8.741302 6.720059 |
4. | 1966-1975 4.697238 5.399081 4.781106 |
5. | 1976-1985 4.113508 6.984528 4.450109 |
6. | 1986-1995 4.923809 7.599852 3.301461 |
7. | 1996-2005 4.826018 12.35915 3.641975 |
+-----------------------------------------+

The trend is a little complicated. For collaborations between Oscar-nominated casts on the one-hand and either writers or directors, the sorting is most intense in the 1946-1955 decade and especially the 1956-1965 decade. My guess is that this is tied to the decline of the studio system and/or the peak power of MCA. The odds-ratio of good director for nom vs non-nom writers also has a jump around the end of the studio system, but it seems there’s a second jump starting in the 80s. My guess is that this is an artifact of the increasing number of writer-directors (see Baker and Faulkner AJS 1991), but it’s an empirical question.

Putting aside the writer-director thing, it seems that sorting is not growing stronger in Hollywood. My guess is that ever more intense sorting is not a logical necessity of superstar markets, but has to do with contingencies, such as the rise of a national market for elite education in Hoxby’s case or the machinations of Lew Wasserman and Olivia deHavilland in my case.

The Stata code is below. (sorry that wordpress won’t preserve the whitespace). The data consists of film-level data with dummies for having at least one prior nominee for the three occupations.

global parentpath "/Users/rossman/Documents/oscars"

capture program drop makedecade
program define makedecade
gen decade=year
recode decade 1900/1935=. 1936/1945=1 1946/1955=2 1956/1965=3 1966/1975=4 1976/1985=5 1986/1995=6 1996/2005=7
capture lab drop decade
lab def decade 1 "1936-1945" 2 "1946-1955" 3 "1956-1965" 4 "1966-1975" 5 "1976-1985" 6 "1986-1995" 7 "1996-2005"
lab val decade decade
end

cd $parentpath

capture log close
log using $parentpath/sorting_analysis.log, replace

use sorting, clear
makedecade

*do odds-ratio of working w oscar nom, by own status

capture program drop allstar
program define allstar
preserve
if "`1'"!="" {
keep if decade==`1'
}
tabulate cast director, matcell(CD)
local pooled_cd=(CD[2,2]*CD[1,1])/(CD[1,2]*CD[2,1])
tabulate writers director, matcell(WD)
local pooled_wd=(WD[2,2]*WD[1,1])/(WD[1,2]*WD[2,1])
tabulate writers cast, matcell(WC)
local pooled_wc=(WC[2,2]*WC[1,1])/(WC[1,2]*WC[2,1])
shell echo "`pooled_cd' `pooled_wd' `pooled_wc' `1'" >> sortingresults.txt
restore
end

shell echo "cd wd wc decade" > sortingresults.txt
quietly allstar
forvalues t=1/7 {
quietly allstar `t'
}

insheet using sortingresults.txt, delimiter(" ") names clear
lab val decade decade

 

*have a nice day

November 8, 2009 at 7:27 pm 4 comments

Shufflevar

| Gabriel |

[Update: I’ve rewritten the command to be more flexible and posted it to ssc. to get it type “ssc install shufflevar”. this post may still be of interest for understanding how to apply the command].

Sometimes you face a situation where it’s really hard to see what the null is because the data structure is really complicated and there is all sorts of nonlinearity, etc. Analyses of non-sparse square network matrices can use the quadratic assignment procedure, but you can do something similar with other data structures, including bipartite networks.

A good null keeps everything constant, but shows what associations we would expect were association random. The simplest way to do this is to keep the actual variable vectors but randomly sort one of the vectors. So for instance, you could keep the actual income distribution and the actual values of peoples’ education, race, etc, but randomly assign actual incomes to people.

Fernandez, Castilla, and Moore used what was basically this approach to build a null distribution of the effects of employment referrals. Since then Ezra Zuckerman has used it in several papers on Hollywood to measure the strength of repeat collaboration. I myself am using it in some of my current radio work to understand how much corporate clustering we’d expect to see in the diffusion of pop songs under the null hypothesis that radio corporations don’t actually practice central coordination.

I wrote a little program that takes the argument of the variable you want shuffled. It has a similar application as bsample, and like bsample it’s best used as part of a loop.

capture program drop shufflevar
program define shufflevar
  local shufflevar `1'
  tempvar oldsortorder
  gen `oldsortorder'=[_n]
  tempvar newsortorder
  gen `newsortorder'=uniform()
  sort `newsortorder'
  capture drop `shufflevar'_shuffled
  gen `shufflevar'_shuffled=`shufflevar'[_n-1]
  replace `shufflevar'_shuffled=`shufflevar'[_N] in 1/1
  sort `oldsortorder'
  drop `newsortorder' `oldsortorder'
end

Here’s an example to show how much clustering of “y” you’d expect to see by “clusterid” if we keep the observed distributions of “y” and “clusterid” but break any association between them:

shell echo "run rho" > _results_shuffled.txt

forvalues run=1/1000 {
  disp "iteration # `run' of 1000"
  quietly shufflevar clusterid
  quietly xtreg y, re i(clusterid_shuffled)
  shell echo "`run' `e(rho)'" >> _results_shuffled.txt
}

insheet using _results_shuffled.txt, names clear delimiter(" ")
histogram rho
sum rho

(Note that “shell echo” only works with Mac/Unix, Windows users should try postfile).

October 26, 2009 at 5:09 am 3 comments

Journey to the True Tales of the IMDB!

| Gabriel |

Following up on yesterday’s post, check out this paper by Herr and his colleagues that graphs IMDB and provides some basic descriptions of the network. You can also see a zoomable version of their truly gorgeous visualization. Finally the answer to that age old question, what do you get for the quantitative cultural sociologist who has everything?

The authors are affiliated with the Cyberinfrastructure for Network Science Center at Indiana. Although Indiana sociology has a well-deserved reputation for hardcore quant research, CNS is at the school of Information. Following the logic I learned from reading Marvel comics as a kid I can only speculate that something about the pesticide run-off in the drinking water gives scholars at Indiana superhuman abilities to code.

Also of note is that CNS provides the cross-platform open source package Network Workbench. I was a little skeptical because it’s written in Java (which tends to be slow) but I got it to create a PageRank vector of a huge dataset in six minutes, which isn’t bad at all. I may have more to say about this program in the future as I plan to tinker with it.

June 19, 2009 at 5:11 am

Son of True Tales of the IMDB!

| Gabriel |

Continuing with the discussion of IMDB networks …

Although it’s pretty much futile to get Stata to calculate any network parameters beyond degree centrality, it’s actually good at cleaning collaboration network data and converting it to an edge list. You can then export this edge list to a package better suited for network analysis like Pajek, Mathematica, or any of several packages written in R or SAS.

The IMDB is a bipartite network where the worker is one mode and the film is the other. Presumably you’ll be reducing this to a one-mode network, traditionally a network of actors (connected by films) but you can do a network of films (connected by actors). So you’ll need to start with the personnel files (writers.list, actors.list, actresses.list, etc). Whether you want one profession (just actors) or all the professions is a judgement call but actors are traditional.

Having decided which files you want to use, you have to clean them. (See previous thoughts here). Most of the files are organized as some variation on this:

Birch, Thora	Alaska (1996)  [Jessie Barnes]  <1>
	Ghost World (2000)  [Enid]  <1>

So first you clean the file in perl or a text editor, then “insheet” with Stata. There are two issues:

  1. In all of the files the worker name appears only on the first record, subsequent credits are whitespace. To fill it in use this command:
  2. replace var1=var1[_n-1] if var1==""
  3. The name is tab-delimited from the credit, but the “credit” includes several types of information. You’ll need to do a regular expression search either in the text editor to turn the tags to tabs or in Stata use regexm/regexs to pull the information out from within the tags. For instance in the the actor/actress files “[]” shows the name of the character and “<>” the credit rank. Parentheses shows the release date, but that’s effectively part of the film title as it helps distinguish between remakes.

Now you need to append the personnel files to each other in a file we can call credits.dta. Whether you include just actors and actresses or all the professions is a judgement call. The next couple steps are not necessary in theory but in practice they are very helpful for keeping the file sizes reasonably small. So it helps a lot to encode the data, though because of the large number of values you have to do it manually.

*the following block of code is basically a roundabout "encode" command but it doesn't have the same limitations
use credits.dta, clear
contract name
drop _freq
sort name
*create "i" as a name serial number based on row number/ alphabetical order
gen i=[_n]
lab var i "name id"
save i_key, replace
outsheet using i_key.txt, replace
sort i
*i_keyb.dta is same as i_key but sorted by "i" instead of name.
*substantively they are identical, but having two versions is useful for merging
save i_keyb.dta, replace

*create list of films and assign serial number "j" to each, just as with "i" for name
use credits.dta, clear
keep film
contract film
drop _freq
sort film
gen j=[_n]
lab var j "film id"
save j_key, replace
outsheet using j_key.txt, replace
sort j
save j_keyb, replace
clear

The next memory-saving step is to break it up into annual files. This will work if you plan to have films connect actors but not the other way around.

*create annual credit (ijt) files
forvalues t=1900/2009 {
 use credits.dta, clear
 keep if year==`t'
 compress
 sort name
 merge name using i_key.dta
 tab _merge
 keep if _merge==3
 drop name _merge
 sort film
 merge film using j_key.dta
 tab _merge
 keep if _merge==3
 keep i j
 sort i j
 save ij_`t'.dta, replace
}

Now that you have a set of encoded annual credit files, it’s time to turn these two-mode files into one-mode edge lists.

*create dyads/collaborations (ii) by year
forvalues t=1900/2009 {
 use ij_`t'.dta, clear
 ren i ib
 sort j
 *square the matrix of each film's credits
 joinby j using ij_`t'.dta
 *eliminate auto-ties
 drop if i==ib
 *drop film titles.
 drop j
 contract i ib
 drop _freq /*optional, keep it and treat as tie strength*/
 compress
 save ii_`t'.dta, replace
}

At this point you can use “append” (followed by contract or collapse) to combine waves. Export to ASCII and knock yourself out in a program better suited for network analysis than Stata. (At least until somebody inevitably jerry-rigs SNA out of Mata). Remember that the worker and film names are stored in i_key.txt and j_key.txt.

June 18, 2009 at 5:37 am 1 comment

Bride of True Tales of the IMDB!

| Gabriel |

One of the things social scientists (and the physicists who love them) like to do with IMDB is use it to build up collaboration networks, which is basically playing the Kevin Bacon game but dressed up with terms like “mean path length” and “reachability.” This dates back to the late 1990s, before the “social media” fad made for an abundance of easily downloadable (or scrapable) large-scale social networks. Believe it or not, as recently as the early 1990s network people were still doing secondary analyses of the same handful of tiny datasets they’d been using for decades. If you spent a career trying to model rivalry among a couple dozen monks or marriage alliances amongst Florentine merchant families, you would have been excited about graphing the IMDB too.

Anyway, there are a few problems with using IMDB, several of which I’ve already discussed. The main thing is that it’s really, really, really big and when you try to make it into a network it just gets ludicrous. In part this is because of a few outlier works with really large casts.

Consider the 800 pound gorilla of IMDB, General Hospital, which has been on tv since 1963 (and was on the radio long before that).* That’s 46 years of not just the gradually churning ensemble cast, but guest stars and even bit part players with one line. I forget the exact number, but something like 1000 people have appeared in General Hospital. Since the logic of affiliation networks treats all members of the affiliation as a clique, this is one big black mess of 1000 nodes and 499,000 edges. A ginormous clique like this can make an appreciable impact on things like the overall clustering coefficient (which in turn is part of the small world index). Likewise it can do weird things to node-level traits like centrality.

Furthermore, unless you have some really esoteric theoretical concerns, it doesn’t even make sense to think of this being a collaboration that includes both the original actors and the current stars (most of whom weren’t even born in 1963). Many of the “edges” in the clique involve people who, far from having any kind of meaningful contact, didn’t even set foot on set within four decades of each other. For a different approach, consider an article in the current issue of Connections which graphs the Dutch national soccer team (pre-print here). The article does not treat the entire history of the team as one big clique (which would make for a short article) but rather an edge is defined as appearing in the same match. Not surprisingly the resulting structure is basically a chain as the team slowly rotates out old players and in new players. Overall it reminds me of one of the towers you’d build in World of Goo. The closest it gets to breaking a structure off from the giant component is the substantial turnover over the hiatus of WW2, but aside from that it’s pretty regular.

So anyway, unless you think General Hospital is the one true ring of Hollywood I think you only have two options:

  1. Follow the approach in the Dutch soccer paper and break a long running institution into smaller contemporaneous collaborations — games for Orange and episodes for General Hospital. Unfortunately IMDB doesn’t always have episode specific data for tv shows.
  2. Drop any non-theatrical content from the dataset. One of the perennial issues in any social research, and especially networks, is bounding the population. I think you can make an excellent substantive case that the production systems for television (and pornography) are sufficiently loosely coupled from theatrical film that they don’t belong in the same network dataset.

[*updated 5/18/15, General Hospital was never on the radio. I confused it with Guiding Light which, like Amos & Andy, did make the jump from network radio to network television]

June 17, 2009 at 5:44 am 1 comment

Older Posts


The Culture Geeks