Bride of True Tales of the IMDB!
| Gabriel |
One of the things social scientists (and the physicists who love them) like to do with IMDB is use it to build up collaboration networks, which is basically playing the Kevin Bacon game but dressed up with terms like “mean path length” and “reachability.” This dates back to the late 1990s, before the “social media” fad made for an abundance of easily downloadable (or scrapable) large-scale social networks. Believe it or not, as recently as the early 1990s network people were still doing secondary analyses of the same handful of tiny datasets they’d been using for decades. If you spent a career trying to model rivalry among a couple dozen monks or marriage alliances amongst Florentine merchant families, you would have been excited about graphing the IMDB too.
Anyway, there are a few problems with using IMDB, several of which I’ve already discussed. The main thing is that it’s really, really, really big and when you try to make it into a network it just gets ludicrous. In part this is because of a few outlier works with really large casts.
Consider the 800 pound gorilla of IMDB, General Hospital, which has been on tv since 1963
(and was on the radio long before that).* That’s 46 years of not just the gradually churning ensemble cast, but guest stars and even bit part players with one line. I forget the exact number, but something like 1000 people have appeared in General Hospital. Since the logic of affiliation networks treats all members of the affiliation as a clique, this is one big black mess of 1000 nodes and 499,000 edges. A ginormous clique like this can make an appreciable impact on things like the overall clustering coefficient (which in turn is part of the small world index). Likewise it can do weird things to node-level traits like centrality.
Furthermore, unless you have some really esoteric theoretical concerns, it doesn’t even make sense to think of this being a collaboration that includes both the original actors and the current stars (most of whom weren’t even born in 1963). Many of the “edges” in the clique involve people who, far from having any kind of meaningful contact, didn’t even set foot on set within four decades of each other. For a different approach, consider an article in the current issue of Connections which graphs the Dutch national soccer team (pre-print here). The article does not treat the entire history of the team as one big clique (which would make for a short article) but rather an edge is defined as appearing in the same match. Not surprisingly the resulting structure is basically a chain as the team slowly rotates out old players and in new players. Overall it reminds me of one of the towers you’d build in World of Goo. The closest it gets to breaking a structure off from the giant component is the substantial turnover over the hiatus of WW2, but aside from that it’s pretty regular.
So anyway, unless you think General Hospital is the one true ring of Hollywood I think you only have two options:
- Follow the approach in the Dutch soccer paper and break a long running institution into smaller contemporaneous collaborations — games for Orange and episodes for General Hospital. Unfortunately IMDB doesn’t always have episode specific data for tv shows.
- Drop any non-theatrical content from the dataset. One of the perennial issues in any social research, and especially networks, is bounding the population. I think you can make an excellent substantive case that the production systems for television (and pornography) are sufficiently loosely coupled from theatrical film that they don’t belong in the same network dataset.
[*updated 5/18/15, General Hospital was never on the radio. I confused it with Guiding Light which, like Amos & Andy, did make the jump from network radio to network television]