Archive for June, 2009

Journey to the True Tales of the IMDB!

| Gabriel |

Following up on yesterday’s post, check out this paper by Herr and his colleagues that graphs IMDB and provides some basic descriptions of the network. You can also see a zoomable version of their truly gorgeous visualization. Finally the answer to that age old question, what do you get for the quantitative cultural sociologist who has everything?

The authors are affiliated with the Cyberinfrastructure for Network Science Center at Indiana. Although Indiana sociology has a well-deserved reputation for hardcore quant research, CNS is at the school of Information. Following the logic I learned from reading Marvel comics as a kid I can only speculate that something about the pesticide run-off in the drinking water gives scholars at Indiana superhuman abilities to code.

Also of note is that CNS provides the cross-platform open source package Network Workbench. I was a little skeptical because it’s written in Java (which tends to be slow) but I got it to create a PageRank vector of a huge dataset in six minutes, which isn’t bad at all. I may have more to say about this program in the future as I plan to tinker with it.

Advertisements

June 19, 2009 at 5:11 am

Son of True Tales of the IMDB!

| Gabriel |

Continuing with the discussion of IMDB networks …

Although it’s pretty much futile to get Stata to calculate any network parameters beyond degree centrality, it’s actually good at cleaning collaboration network data and converting it to an edge list. You can then export this edge list to a package better suited for network analysis like Pajek, Mathematica, or any of several packages written in R or SAS.

The IMDB is a bipartite network where the worker is one mode and the film is the other. Presumably you’ll be reducing this to a one-mode network, traditionally a network of actors (connected by films) but you can do a network of films (connected by actors). So you’ll need to start with the personnel files (writers.list, actors.list, actresses.list, etc). Whether you want one profession (just actors) or all the professions is a judgement call but actors are traditional.

Having decided which files you want to use, you have to clean them. (See previous thoughts here). Most of the files are organized as some variation on this:

Birch, Thora	Alaska (1996)  [Jessie Barnes]  <1>
	Ghost World (2000)  [Enid]  <1>

So first you clean the file in perl or a text editor, then “insheet” with Stata. There are two issues:

  1. In all of the files the worker name appears only on the first record, subsequent credits are whitespace. To fill it in use this command:
  2. replace var1=var1[_n-1] if var1==""
  3. The name is tab-delimited from the credit, but the “credit” includes several types of information. You’ll need to do a regular expression search either in the text editor to turn the tags to tabs or in Stata use regexm/regexs to pull the information out from within the tags. For instance in the the actor/actress files “[]” shows the name of the character and “<>” the credit rank. Parentheses shows the release date, but that’s effectively part of the film title as it helps distinguish between remakes.

Now you need to append the personnel files to each other in a file we can call credits.dta. Whether you include just actors and actresses or all the professions is a judgement call. The next couple steps are not necessary in theory but in practice they are very helpful for keeping the file sizes reasonably small. So it helps a lot to encode the data, though because of the large number of values you have to do it manually.

*the following block of code is basically a roundabout "encode" command but it doesn't have the same limitations
use credits.dta, clear
contract name
drop _freq
sort name
*create "i" as a name serial number based on row number/ alphabetical order
gen i=[_n]
lab var i "name id"
save i_key, replace
outsheet using i_key.txt, replace
sort i
*i_keyb.dta is same as i_key but sorted by "i" instead of name.
*substantively they are identical, but having two versions is useful for merging
save i_keyb.dta, replace

*create list of films and assign serial number "j" to each, just as with "i" for name
use credits.dta, clear
keep film
contract film
drop _freq
sort film
gen j=[_n]
lab var j "film id"
save j_key, replace
outsheet using j_key.txt, replace
sort j
save j_keyb, replace
clear

The next memory-saving step is to break it up into annual files. This will work if you plan to have films connect actors but not the other way around.

*create annual credit (ijt) files
forvalues t=1900/2009 {
 use credits.dta, clear
 keep if year==`t'
 compress
 sort name
 merge name using i_key.dta
 tab _merge
 keep if _merge==3
 drop name _merge
 sort film
 merge film using j_key.dta
 tab _merge
 keep if _merge==3
 keep i j
 sort i j
 save ij_`t'.dta, replace
}

Now that you have a set of encoded annual credit files, it’s time to turn these two-mode files into one-mode edge lists.

*create dyads/collaborations (ii) by year
forvalues t=1900/2009 {
 use ij_`t'.dta, clear
 ren i ib
 sort j
 *square the matrix of each film's credits
 joinby j using ij_`t'.dta
 *eliminate auto-ties
 drop if i==ib
 *drop film titles.
 drop j
 contract i ib
 drop _freq /*optional, keep it and treat as tie strength*/
 compress
 save ii_`t'.dta, replace
}

At this point you can use “append” (followed by contract or collapse) to combine waves. Export to ASCII and knock yourself out in a program better suited for network analysis than Stata. (At least until somebody inevitably jerry-rigs SNA out of Mata). Remember that the worker and film names are stored in i_key.txt and j_key.txt.

June 18, 2009 at 5:37 am 1 comment

Bride of True Tales of the IMDB!

| Gabriel |

One of the things social scientists (and the physicists who love them) like to do with IMDB is use it to build up collaboration networks, which is basically playing the Kevin Bacon game but dressed up with terms like “mean path length” and “reachability.” This dates back to the late 1990s, before the “social media” fad made for an abundance of easily downloadable (or scrapable) large-scale social networks. Believe it or not, as recently as the early 1990s network people were still doing secondary analyses of the same handful of tiny datasets they’d been using for decades. If you spent a career trying to model rivalry among a couple dozen monks or marriage alliances amongst Florentine merchant families, you would have been excited about graphing the IMDB too.

Anyway, there are a few problems with using IMDB, several of which I’ve already discussed. The main thing is that it’s really, really, really big and when you try to make it into a network it just gets ludicrous. In part this is because of a few outlier works with really large casts.

Consider the 800 pound gorilla of IMDB, General Hospital, which has been on tv since 1963 (and was on the radio long before that).* That’s 46 years of not just the gradually churning ensemble cast, but guest stars and even bit part players with one line. I forget the exact number, but something like 1000 people have appeared in General Hospital. Since the logic of affiliation networks treats all members of the affiliation as a clique, this is one big black mess of 1000 nodes and 499,000 edges. A ginormous clique like this can make an appreciable impact on things like the overall clustering coefficient (which in turn is part of the small world index). Likewise it can do weird things to node-level traits like centrality.

Furthermore, unless you have some really esoteric theoretical concerns, it doesn’t even make sense to think of this being a collaboration that includes both the original actors and the current stars (most of whom weren’t even born in 1963). Many of the “edges” in the clique involve people who, far from having any kind of meaningful contact, didn’t even set foot on set within four decades of each other. For a different approach, consider an article in the current issue of Connections which graphs the Dutch national soccer team (pre-print here). The article does not treat the entire history of the team as one big clique (which would make for a short article) but rather an edge is defined as appearing in the same match. Not surprisingly the resulting structure is basically a chain as the team slowly rotates out old players and in new players. Overall it reminds me of one of the towers you’d build in World of Goo. The closest it gets to breaking a structure off from the giant component is the substantial turnover over the hiatus of WW2, but aside from that it’s pretty regular.

So anyway, unless you think General Hospital is the one true ring of Hollywood I think you only have two options:

  1. Follow the approach in the Dutch soccer paper and break a long running institution into smaller contemporaneous collaborations — games for Orange and episodes for General Hospital. Unfortunately IMDB doesn’t always have episode specific data for tv shows.
  2. Drop any non-theatrical content from the dataset. One of the perennial issues in any social research, and especially networks, is bounding the population. I think you can make an excellent substantive case that the production systems for television (and pornography) are sufficiently loosely coupled from theatrical film that they don’t belong in the same network dataset.

[*updated 5/18/15, General Hospital was never on the radio. I confused it with Guiding Light which, like Amos & Andy, did make the jump from network radio to network television]

June 17, 2009 at 5:44 am 1 comment

MDC code

| Gabriel |

A reader requested that I post some code relevant to the multilevel diffusion curve (MDC) method I published in Sociological Methodology. I have code for both the more primitive techniques we discuss in the lit review and our new MDC technique, but neither script is as elegant as it should be.

I’ve already posted code to do the precursor approach by Edwin Mansfield, though I recently learned some matrix syntax that will let me rewrite it to run much more cleanly when I find a chance to do so. The problem with the current version is that it makes extensive use of writing to disk and POSIX commands via “shell.” On Mac/Linux this is ugly but perfectly functional, but it won’t work at all on Windows (at least not without CygWin). I hope to rewrite it to be more elegant and completely self-contained in Stata, but this is a luxury as the current ugly version works on my Mac.

Likewise, I have code (posted below) to do MDC, but it’s also less than ideal. MDC doesn’t regress anything interesting directly, but first runs a regression (“table 2” in the paper) and then uses the quadratic equation to make the results intelligible (“table 3” in the paper). The problem is that my Stata code only does the first step. To do the second half you need to take the output and put it in this Excel spreadsheet. I’m hoping to rewrite it so that the command produces useful output directly but this is easier said than done as it requires a lot of saving returned results, matrix multiplication, and other things that are somewhat difficult to program.

Anyway, in the meantime, here’s the code. It follows a syntax similar to xtreg. In addition to “i” you also specify “nt” which means adoptions to date.

capture program drop mdcrun
program define mdcrun
	set more off
	syntax varlist , i(string asis) nt(string asis) 

	disp "This code gives information which must be interpreted"
	disp " with the spreadsheet at http://www.sscnet.ucla.edu/08F/soc210a-1/mdc.xls"
	disp "comments in this output give hints on how to use the spreadsheet"

	gettoken first varlist : varlist

	preserve
	gen cons=1
	foreach var in `varlist' cons {
		quietly gen `var'_1=`nt'*`var'
		quietly gen `var'_2=`nt'*`nt'*`var'
	}

	foreach var in `varlist' cons {
		local varlist_ext="`varlist_ext' `var' `var'_1 `var'_2"
	}

	* create `varlist_ext' as an alternate varlist macro that has the interactions

	disp "-------------------------------"
	disp "Columns M+J, mean and sd"
	sum `varlist'

	disp "-------------------------------"
	disp "put the baseline beta+sd model in J7:N5"
	xtreg `first' `varlist', re i(`i')
	disp "-------------------------------"
	disp "coefficients are vars + interactions with nt and nt^2"
	disp "additive beta+se in J and K"
	disp "var_1 beta+se in L and N"
	disp "var_2 beta+se in T and V"
	disp "Please see AC-AJ for interpretation"
	xtreg `first' `varlist_ext', re i(`i')
	disp "-------------------------------"
	disp "For citation and help with theory/interpretation, see"
	disp `"Rossman, Chiu, and Mol. 2008. "Modeling Diffusions of"'
	disp "Multiple Innovations Via Multilevel Diffusion Curves:"
	disp `"Payola in Pop Music Radio" Sociological Methodology"'
	disp "38:201-230."
	restore

end

June 16, 2009 at 5:04 am 1 comment

Clara est la fille de Sophie

| Gabriel |

As referenced at Montclair and Contexts-Graphic, there’s been some interesting work on French baby names. Of course the application of name data to diffusion was seriously kicked off by Lieberson’s book on American baby names. One of the most basic findings of these studies is that names are so fashion-prone that you can practically use them to carbon date birth cohort, especially for women. For instance I was born in 1977 and all the girls I grew up with had Hebrew names (Elizabeth, Rachel, Sarah). My daughter was born in 2007 and, at least in our social class, all the girls have Victorian names (Frances, Rose, Lillian). It seems like every girl born in the 1980s and 1990s has a Celtic name (Britney, Erin, Caitlin).

One piece of research I haven’t seen discussed in the soc-blogs is the recent Berger and LeMens PNAS article, which uses data from both countries. This article basically argues that names with extremely rapid rise are stigmatized as faddish and are thereafter dropped from the culture’s active repertoire. I loved this article, and as I’ve argued before, we need more studies of abandonment.

June 15, 2009 at 1:03 am

Delenda Affectus Epistula Est

| Gabriel |

The new version of Skype has a fantastic “screen sharing” feature but an incredibly annoying feature called mood messages. This feature constantly gives you updates on your contacts, including even what they happen to be listening to on iTunes and other status updates they choose to post. Not only does it make this information accessible should you choose to look for it, but it shows up as a history event, so I’m constantly thinking I missed a phone call or something important, only to find out that it’s just that one of my friends is listening to another song. Personally, I don’t feel the need for all my friends and colleagues to know what song I’m listening to or when I’m using the toilet, nor do I care to know the same about them. If I wanted a ubiquitous adolescent stream of narcissistic micro-banalities I’d already be using Twitter.

Anyway, this feature is turned on by default so I’m describing how to turn it off and return Skype to being what it ought to be, a dignified and professional tool for video conferencing, and not another technologically enabled manifestation of the erosion of personal space through incessant distraction. First, regain your own privacy by going to “Preferences” and unselecting “Enable Mood Message Chat.” Second, ignore the extroversion of your contacts by right-clicking within the mood message pseudo-chat window and select “Chat Notification Settings.” In this screen choose “Do Not Notify Me” and “Mark unread messages as read immediately.”

Your dignity and privacy has now returned, enjoy it.

June 14, 2009 at 9:27 pm 4 comments

The fat tail

| Gabriel |

At Slate XX, Virginia Postrel has an article explaining why women’s clothing sizing hasn’t kept up with the increasingly large American woman herself. This is often explained as an indulgence of taste (designers don’t like making clothes for people they find unattractive) or a Podolny-esque status thing (serving stigmatized customers kills your brand). Postrel doesn’t buy any of that, taking the Gary Becker economics-of-discrimination line that some entrepreneur should be filling this demand unless there is some good business reason not to. (I think the “taste” and “status” things are very plausible at the high-end, but I agree with Postrel about “the customer is always right” for the mass market). Her argument is that it’s about the cost of fabric (boring) and much more interestingly, the right-skewed distribution for size. I think this is worth unpacking and ruminating on because it’s a good example of how it’s more useful to think about distributions than just central tendencies (as people often find more natural).

While (within gender) height follows a normal distribution pretty closely, weight has a right skew. Here’s what that means. The medical people tell us that the ideal BMI is about 22, give or take a few points, which works out to about 130 lbs (+/- 20) for a 5’4″ woman. Now a woman at the first percentile for BMI is going to weigh about 90 lbs. On the other hand, a woman as fat (in percentile terms) as Victoria Beckham is thin is going to weigh over 220 lbs, probably more.

So relative to healthy, extremely thin is about 40 lbs off whereas extremely fat is at least 90 lbs off. If “healthy” is  a reasonable approximation for the median, then for any given point of weight to the left of the median there’s going to be about twice as much density (in the statistical sense) than a comparable point to the right of the median. In other words (to paraphrase Tolstoy) thin women are all alike; every fat woman is fat in her own way. Postrel’s argument is that to the extent that clothing is meant to be tailored to be a pretty close fit for your body then any given plus size will fit few people even if many people fit some plus size.

If you think of clothing sizes as analogous to binning a distribution to plot it as a histogram, any given bin on the skewed side of a distribution will have less density than a bin at the opposite percentile. Postrel notes that there are certain design costs and inventory costs associated with keeping a size in stock and so if it takes sizes 16 and 17 combined to equal the sales just of size 5, then it’s rational for companies to consider dropping their plus sizes, even though in the aggregate they serve a lot of paying customers.

This makes a fair amount of sense, but I wonder about the extent to which it relies on the assumption that the breadth of a size is always a constant range, say +/- 3lbs from some target customer. For all I know this is how clothes are sized and ought to be sized, but I wonder if it is the practice for sizing to have a wider tolerance at higher weights. In quant work when we have a right-skewed distribution we often log the variable. What this effectively does is make the raw scale bin width a function of x, so as you get higher on x the bins get wider on the raw scale even though the bins are all the same width on the log scale. I can think of two substantive reasons why it might be appropriate to imagine any given plus clothing size encompassing a wider range of weight than any given petite clothing size.

First, there might be a taste difference where thin people tended to prefer tighter-fitting clothes and fat people looser fitting clothes, especially for things like jeans. Since loose clothes are more forgiving of fit then it would make sense to have broader plus sizes. You see a similar thing in that people with short hair get it cut much more often than people with long hair. I have very short hair and when I think “I need a hair cut,” I’m thinking something closer to “my hair is 30% longer than it should be” rather than “my hair is one inch longer than it should be.”

Second is that maybe we shouldn’t be thinking about clothing sizes in pounds at all, but something like inches (which is how men’s clothes are sized). In this case, geometry diminishes the skew. If you imagine the radius of a human being in cross-section, that person’s weight is approximately pi*r2*height whereas that person’s waist size is approximately 2*pi*r. The squared term for weight means that weight will be more right-skewed than circumference. These are the same people, but depending on how you measure “size” the distribution may be skewed or it might be symmetrical. This is actually a big deal generally in statistics since assuming the wrong distribution for a variable can lead to weird distributions for the error term. Hence good statistical practice either transforms skewed variables as part of the data cleaning or uses “count” analyses like Poisson and negative-binomial that are designed to work with skewed distributions. There’s also a more basic theoretical question of whether the skewed variable is even the right operationalization. If we’re interested in the size of a person is weight better than waist size? If we’re interested in the size of an organization is number of employees better than length of the chain of command? In both cases the answer is that it depends on what you’re trying to explain.

Anyway, if either for reasons of taste or reasons of geometry, a size 0 has less tolerance (as measured in pounds) than a size 8, which in turn has less tolerance than a size 16, then this could partially compensate for the dynamic Postrel is describing. However note that, unlike me, she actually talked to clothiers so take her data over my armchair speculation.

June 12, 2009 at 5:17 am 2 comments

Older Posts Newer Posts


The Culture Geeks