Posts Tagged IMDB

Team Sorting

| Gabriel |

Tyler Cowen links to an NBER paper by Hoxby that shows that in recent decades, status sorting has gotten more intense for college. Cowen asks “is this a more general prediction in a superstars model?” The archetypal superstar system is Hollywood, and here’s my quick and dirty stab at answering Tyler’s question for that field. Faulkner and Anderson’s 1987 AJS showed that there is a lot of quality sorting in Hollywood, but they didn’t give a time trend. As shown in my forthcoming ASR with Esparza and Bonacich, there are big team spillovers so this is something we ought to care about.

I’m reusing the dataset from our paper, which is a subset of IMDB for Oscar eligible films (basically, theatrically-released non-porn) from 1936-2005. If I were doing it for publication I’d do it better (i.e., I’d allow the data to have more structure and I’d build confidence intervals from randomness), but for exploratory purposes the simplest way to measure sorting is to see if a given film had at least one prior Oscar nominee writer, director, and actor. From that I can calculate the odds-ratio of having an elite peer in the other occupation.

Overall, a movie that has at least one prior nominee writer is 7.3 times more likely than other films to have a prior nominee director and 4.4 times more likely to have a prior nominee cast. A cast with a prior nominee is 6.5 times more likely to have a prior nominee director. Of course we already knew there was a lot of sorting from Faulker and Anderson, the question suggested by Hoxby/Cowen is what are the effects over time?

This little table shows odds-ratios for cast-director, writer-director, and writer-cast. Big numbers mean more intense sorting.

...+--------------------------------------+
...| decade    cd       wd       wc       |
...|--------------------------------------|
1. | 1936-1945 6.545898 6.452388 4.306554 |
2. | 1946-1955 9.407476 6.425553 5.368151 |
3. | 1956-1965 12.09229 8.741302 6.720059 |
4. | 1966-1975 4.697238 5.399081 4.781106 |
5. | 1976-1985 4.113508 6.984528 4.450109 |
6. | 1986-1995 4.923809 7.599852 3.301461 |
7. | 1996-2005 4.826018 12.35915 3.641975 |
+-----------------------------------------+

The trend is a little complicated. For collaborations between Oscar-nominated casts on the one-hand and either writers or directors, the sorting is most intense in the 1946-1955 decade and especially the 1956-1965 decade. My guess is that this is tied to the decline of the studio system and/or the peak power of MCA. The odds-ratio of good director for nom vs non-nom writers also has a jump around the end of the studio system, but it seems there’s a second jump starting in the 80s. My guess is that this is an artifact of the increasing number of writer-directors (see Baker and Faulkner AJS 1991), but it’s an empirical question.

Putting aside the writer-director thing, it seems that sorting is not growing stronger in Hollywood. My guess is that ever more intense sorting is not a logical necessity of superstar markets, but has to do with contingencies, such as the rise of a national market for elite education in Hoxby’s case or the machinations of Lew Wasserman and Olivia deHavilland in my case.

The Stata code is below. (sorry that wordpress won’t preserve the whitespace). The data consists of film-level data with dummies for having at least one prior nominee for the three occupations.

global parentpath "/Users/rossman/Documents/oscars"

capture program drop makedecade
program define makedecade
gen decade=year
recode decade 1900/1935=. 1936/1945=1 1946/1955=2 1956/1965=3 1966/1975=4 1976/1985=5 1986/1995=6 1996/2005=7
capture lab drop decade
lab def decade 1 "1936-1945" 2 "1946-1955" 3 "1956-1965" 4 "1966-1975" 5 "1976-1985" 6 "1986-1995" 7 "1996-2005"
lab val decade decade
end

cd $parentpath

capture log close
log using $parentpath/sorting_analysis.log, replace

use sorting, clear
makedecade

*do odds-ratio of working w oscar nom, by own status

capture program drop allstar
program define allstar
preserve
if "`1'"!="" {
keep if decade==`1'
}
tabulate cast director, matcell(CD)
local pooled_cd=(CD[2,2]*CD[1,1])/(CD[1,2]*CD[2,1])
tabulate writers director, matcell(WD)
local pooled_wd=(WD[2,2]*WD[1,1])/(WD[1,2]*WD[2,1])
tabulate writers cast, matcell(WC)
local pooled_wc=(WC[2,2]*WC[1,1])/(WC[1,2]*WC[2,1])
shell echo "`pooled_cd' `pooled_wd' `pooled_wc' `1'" >> sortingresults.txt
restore
end

shell echo "cd wd wc decade" > sortingresults.txt
quietly allstar
forvalues t=1/7 {
quietly allstar `t'
}

insheet using sortingresults.txt, delimiter(" ") names clear
lab val decade decade

 

*have a nice day

4 comments November 8, 2009

Shufflevar

| Gabriel |

Sometimes you face a situation where it’s really hard to see what the null is because the data structure is really complicated and there is all sorts of nonlinearity, etc. Analyses of non-sparse square network matrices can use the quadratic assignment procedure, but you can do something similar with other data structures, including bipartite networks.

A good null keeps everything constant, but shows what associations we would expect were association random. The simplest way to do this is to keep the actual variable vectors but randomly sort one of the vectors. So for instance, you could keep the actual income distribution and the actual values of peoples’ education, race, etc, but randomly assign actual incomes to people.

Fernandez, Castilla, and Moore used what was basically this approach to build a null distribution of the effects of employment referrals. Since then Ezra Zuckerman has used it in several papers on Hollywood to measure the strength of repeat collaboration. I myself am using it in some of my current radio work to understand how much corporate clustering we’d expect to see in the diffusion of pop songs under the null hypothesis that radio corporations don’t actually practice central coordination.

I wrote a little program that takes the argument of the variable you want shuffled. It has a similar application as bsample, and like bsample it’s best used as part of a loop.

capture program drop shufflevar
program define shufflevar
  local shufflevar `1'
  tempvar oldsortorder
  gen `oldsortorder'=[_n]
  tempvar newsortorder
  gen `newsortorder'=uniform()
  sort `newsortorder'
  capture drop `shufflevar'_shuffled
  gen `shufflevar'_shuffled=`shufflevar'[_n-1]
  replace `shufflevar'_shuffled=`shufflevar'[_N] in 1/1
  sort `oldsortorder'
  drop `newsortorder' `oldsortorder'
end

Here’s an example to show how much clustering of “y” you’d expect to see by “clusterid” if we keep the observed distributions of “y” and “clusterid” but break any association between them:

shell echo "run rho" > _results_shuffled.txt

forvalues run=1/1000 {
  disp "iteration # `run' of 1000"
  quietly shufflevar clusterid
  quietly xtreg y, re i(clusterid_shuffled)
  shell echo "`run' `e(rho)'" >> _results_shuffled.txt
}

insheet using _results_shuffled.txt, names clear delimiter(" ")
histogram rho
sum rho

(Note that “shell echo” only works with Mac/Unix, Windows users should try postfile).

2 comments October 26, 2009

Journey to the True Tales of the IMDB!

| Gabriel |

Following up on yesterday’s post, check out this paper by Herr and his colleagues that graphs IMDB and provides some basic descriptions of the network. You can also see a zoomable version of their truly gorgeous visualization. Finally the answer to that age old question, what do you get for the quantitative cultural sociologist who has everything?

The authors are affiliated with the Cyberinfrastructure for Network Science Center at Indiana. Although Indiana sociology has a well-deserved reputation for hardcore quant research, CNS is at the school of Information. Following the logic I learned from reading Marvel comics as a kid I can only speculate that something about the pesticide run-off in the drinking water gives scholars at Indiana superhuman abilities to code.

Also of note is that CNS provides the cross-platform open source package Network Workbench. I was a little skeptical because it’s written in Java (which tends to be slow) but I got it to create a PageRank vector of a huge dataset in six minutes, which isn’t bad at all. I may have more to say about this program in the future as I plan to tinker with it.

Add comment June 19, 2009

Son of True Tales of the IMDB!

| Gabriel |

Continuing with the discussion of IMDB networks …

Although it’s pretty much futile to get Stata to calculate any network parameters beyond degree centrality, it’s actually good at cleaning collaboration network data and converting it to an edge list. You can then export this edge list to a package better suited for network analysis like Pajek, Mathematica, or any of several packages written in R or SAS.

The IMDB is a bipartite network where the worker is one mode and the film is the other. Presumably you’ll be reducing this to a one-mode network, traditionally a network of actors (connected by films) but you can do a network of films (connected by actors). So you’ll need to start with the personnel files (writers.list, actors.list, actresses.list, etc). Whether you want one profession (just actors) or all the professions is a judgement call but actors are traditional.

Having decided which files you want to use, you have to clean them. (See previous thoughts here). Most of the files are organized as some variation on this:

Birch, Thora	Alaska (1996)  [Jessie Barnes]  <1>
	Ghost World (2000)  [Enid]  <1>

So first you clean the file in perl or a text editor, then “insheet” with Stata. There are two issues:

  1. In all of the files the worker name appears only on the first record, subsequent credits are whitespace. To fill it in use this command:
  2. replace var1=var1[_n-1] if var1==""
  3. The name is tab-delimited from the credit, but the “credit” includes several types of information. You’ll need to do a regular expression search either in the text editor to turn the tags to tabs or in Stata use regexm/regexs to pull the information out from within the tags. For instance in the the actor/actress files “[]” shows the name of the character and “<>” the credit rank. Parentheses shows the release date, but that’s effectively part of the film title as it helps distinguish between remakes.

Now you need to append the personnel files to each other in a file we can call credits.dta. Whether you include just actors and actresses or all the professions is a judgement call. The next couple steps are not necessary in theory but in practice they are very helpful for keeping the file sizes reasonably small. So it helps a lot to encode the data, though because of the large number of values you have to do it manually.

*the following block of code is basically a roundabout "encode" command but it doesn't have the same limitations
use credits.dta, clear
contract name
drop _freq
sort name
*create "i" as a name serial number based on row number/ alphabetical order
gen i=[_n]
lab var i "name id"
save i_key, replace
outsheet using i_key.txt, replace
sort i
*i_keyb.dta is same as i_key but sorted by "i" instead of name.
*substantively they are identical, but having two versions is useful for merging
save i_keyb.dta, replace

*create list of films and assign serial number "j" to each, just as with "i" for name
use credits.dta, clear
keep film
contract film
drop _freq
sort film
gen j=[_n]
lab var j "film id"
save j_key, replace
outsheet using j_key.txt, replace
sort j
save j_keyb, replace
clear

The next memory-saving step is to break it up into annual files. This will work if you plan to have films connect actors but not the other way around.

*create annual credit (ijt) files
forvalues t=1900/2009 {
 use credits.dta, clear
 keep if year==`t'
 compress
 sort name
 merge name using i_key.dta
 tab _merge
 keep if _merge==3
 drop name _merge
 sort film
 merge film using j_key.dta
 tab _merge
 keep if _merge==3
 keep i j
 sort i j
 save ij_`t'.dta, replace
}

Now that you have a set of encoded annual credit files, it’s time to turn these two-mode files into one-mode edge lists.

*create dyads/collaborations (ii) by year
forvalues t=1900/2009 {
 use ij_`t'.dta, clear
 ren i ib
 sort j
 *square the matrix of each film's credits
 joinby j using ij_`t'.dta
 *eliminate auto-ties
 drop if i==ib
 *drop film titles.
 drop j
 contract i ib
 drop _freq /*optional, keep it and treat as tie strength*/
 compress
 save ii_`t'.dta, replace
}

At this point you can use “append” (followed by contract or collapse) to combine waves. Export to ASCII and knock yourself out in a program better suited for network analysis than Stata. (At least until somebody inevitably jerry-rigs SNA out of Mata). Remember that the worker and film names are stored in i_key.txt and j_key.txt.

Add comment June 18, 2009

Bride of True Tales of the IMDB!

| Gabriel |

One of the things social scientists (and the physicists who love them) like to do with IMDB is use it to build up collaboration networks, which is basically playing the Kevin Bacon game but dressed up with terms like “mean path length” and “reachability.” This dates back to the late 1990s, before the “social media” fad made for an abundance of easily downloadable (or scrapable) large-scale social networks. Believe it or not, as recently as the early 1990s network people were still doing secondary analyses of the same handful of tiny datasets they’d been using for decades. If you spent a career trying to model rivalry among a couple dozen monks or marriage alliances amongst Florentine merchant families, you would have been excited about graphing the IMDB too.

Anyway, there are a few problems with using IMDB, several of which I’ve already discussed. The main thing is that it’s really, really, really big and when you try to make it into a network it just gets ludicrous. In part this is because of a few outlier works with really large casts.

Consider the 800 pound gorilla of IMDB, General Hospital, which has been on tv since 1963 (and was on the radio long before that). That’s 46 years of not just the gradually churning ensemble cast, but guest stars and even bit part players with one line. I forget the exact number, but something like 1000 people have appeared in General Hospital. Since the logic of affiliation networks treats all members of the affiliation as a clique, this is one big black mess of 1000 nodes and 499,000 edges. A ginormous clique like this can make an appreciable impact on things like the overall clustering coefficient (which in turn is part of the small world index). Likewise it can do weird things to node-level traits like centrality.

Furthermore, unless you have some really esoteric theoretical concerns, it doesn’t even make sense to think of this being a collaboration that includes both the original actors and the current stars (most of whom weren’t even born in 1963). Many of the “edges” in the clique involve people who, far from having any kind of meaningful contact, didn’t even set foot on set within four decades of each other. For a different approach, consider an article in the current issue of Connections which graphs the Dutch national soccer team (pre-print here). The article does not treat the entire history of the team as one big clique (which would make for a short article) but rather an edge is defined as appearing in the same match. Not surprisingly the resulting structure is basically a chain as the team slowly rotates out old players and in new players. Overall it reminds me of one of the towers you’d build in World of Goo. The closest it gets to breaking a structure off from the giant component is the substantial turnover over the hiatus of WW2, but aside from that it’s pretty regular.

So anyway, unless you think General Hospital is the one true ring of Hollywood I think you only have two options:

  1. Follow the approach in the Dutch soccer paper and break a long running institution into smaller contemporaneous collaborations — games for Orange and episodes for General Hospital. Unfortunately IMDB doesn’t always have episode specific data for tv shows.
  2. Drop any non-theatrical content from the dataset. One of the perennial issues in any social research, and especially networks, is bounding the population. I think you can make an excellent substantive case that the production systems for television (and pornography) are sufficiently loosely coupled from theatrical film that they don’t belong in the same network dataset.

1 comment June 17, 2009

Field-tagged data

| Gabriel |

Most of the datasets we deal with are rectangular in that the variables are always in the same order (whether they are free or fixed) and the records are delimited with a carriage return. A data format that’s less familiar to us but actually quite common in other applications is the field-tagged format. Examples are the BibTex citation database format. Likewise, some of the files in IMDB are a weird hybrid of rectangular and field-tagged. If data formats were human languages and sentences were data records, rectangular formats would be word order syntax (like English) and field-tagged formats would be case marker syntax (like Latin or German). (Yes, I have a bad habit of making overly complicated metaphors make sense only to me.)

In rectangular formats like field-delimited data (.csv or .tsv) or fixed-width data (.prn) you have one record per row and the same variables in the same order for each row, with the variables being either separated by a delimiter (usually comma or tab) or fixed-width with each variable being defined in the data dictionary as columns x-y (which was a really good idea back when we used punch cards, you know, to keep track of our dinosaur herds). In contrast with a field-tagged format, each record spans multiple rows and the first row contains the key that identifies the data. Subsequent rows usually begin with a tab, then a tag that identifies the name of the variable, followed by a delimiter and finally the actually content of the variable for that case. The beginning and end of the record are flagged with special characters. For example here’s a BibTex entry:

@book{vogel_entertainment_2007,
	address = {Cambridge},
	edition = {7th ed.},
	title = {Entertainment Industry Economics: A Guide for Financial Analysis},
	isbn = {9780521874854},
	publisher = {Cambridge University Press},
	author = {Harold Vogel},
	year = {2007}
},

The first thought is, why would anyone want to organize data this way? It certainly doesn’t make it easier to load into Stata (and even if it’s less difficult in R it’s still going to be harder than doing a csv). Basically the reasons people use field-tagged data are that it’s more human-readable / human-editable (a lot of people write BibTex files by hand, although personally I find it easier to let Zotero do it). Not only do you not have to remember what the fifth variable is, but you have more flexibility with things like “comment” fields which can be any length and have internal carriage returns. This is obviously a nice feature for a citation database as it means you can keep detailed notes directly in the file. Furthermore, they are good with situations where you have a lot of “missing data.” BibTex entries can potentially have dozens of variables but most works only require a few of them. For instance the Vogel citation only has eight fields and most of the other potential fields, things like translator, editor, journal title, series title, etc., are appropriately “missing” because they are simply not applicable to this book. It saves a lot of whitespace in the file just to omit these fields entirely rather than having them in but coded as missing (which is what you’d have to do to format BibTex as rectangular).

Nonetheless, if you want to get it into Stata, you need to shoehorn it into rectangular format. Perhaps this is all possible to handle with the “infile” command but last time I tried I couldn’t figure it out. (Comments are welcome if anyone actually knows how to do this). The very clumsy hack I use for these kind of data is to use a text editor to do a regular expression search that first deletes everything but the record key and the variable I want. I then do another search to convert carriage returns to tabs for lines beginning with the record key. I now have a rectangular dataset with the key and one variable. I can save this and get it into Stata. This is a totally insane example both because I can’t imagine why you’d want citation data in Stata and also because there are easier ways to do this (like export filters in citation software) but imagine that you wanted to get “year” and “author” out of a BibTex file and make it rectangular. You would want to run the following regexp patterns through a text editor (or write them into a perl script if you planned on doing it regularly):

^\t[^(year)|@].+\r

Sometimes this is all you need, but what if you want several variables. Basically, rinse, wash repeat until you have one file per variable then you can merge them in Stata. The reason you need a separate file for each variable is because otherwise it’s really easy to get your variables switched around. Because field-tagged formats are so forgiving about having variables in arbitrary orders or missing altogether, when you try to turn it into rectangular you’ll get a lot of values in the wrong column.

3 comments June 10, 2009

Return of True Tales of The IMDB!

| Gabriel |

Note that this advice is written for IMDB but similar principles apply to other continuously updated large string-based relational databases like allmusic or wikipedia.

I recently had to merge in some new variables to the IMDB after not having used the raw data for awhile. Unfortunately this is harder than it sounds since the IMDB frequently updates itself with corrections. For instance, a film called ”Umbrella Woman, The (1987)” was called ”Good Wife, The (1987)” for its American release. A few years ago the IMDB used the latter title, but it’s since been updated to the former. While this speaks well to the strive for perfection exhibited by the IMDB, it also creates a huge hassle for merging. So the short version of my advice is:

If you start using the IMDB, the very first thing you should do is download every file (even the ones you don’t expect to use), back it up to a DVD or external hard drive, and keep that backup very safe. This way you can stick to a single version of the IMDB. The extra couple days of download time this may take will repay itself many-fold by avoiding most merge errors  introduced by updates if you ever need to add more variables in the future.

Note that this assumes that you’re already using the good (i.e., replicable) data cleaning practices of:

  • Always keep a copy of the raw file.
  • Thoroughly document all changes between the raw file and clean file, preferably by making all changes through a well-documented Stata do-file or a shell-script written in a language like perl or awk. If you do make any changes in a text editor, try to do so using regular expressions and keep detailed notes of the search patterns. Never just type changes directly into the data.

If you are unfortunate enough to be in the midst of a project using IMDB and you need to collect more data you will need to correct the merge errors introduced by the IMDB updates. Since film title is usually the merge key, I’ll concentrate on that but the same could go for personal names in the various personnel files if you wanted to merge the cv’s for people who work in multiple occupations (e.g., Woody Allen has entries in actors.list, writers.list,and directors.list).

First, apply the same changes to the new data as you did to the old data (you did document your changes to the old data, right?). For instance, I like to replace dangerous characters like quote and apostrophe with safe characters like spaces or underscores to keep Stata from interpreting them as something other than literal strings. If I did this to the old stuff, I need to do it to the new stuff too.

Second, a very large proportion of the updates to film title involve correcting the release date (which IMDB treats as part of the title) or changing the definite article. You can usually automate this by merging, breaking the title into a string component and a year component, sorting by string then year, and harmonizing to close matches in adjacent years.

*harmonize spelling
*the film titles in the main file and the newly downloaded data differ in two respects
* 1. the old file removed wildcard chars
* 2. some films have new years associated (eg American Ninja IV was either 1990 or 1991)
*create harmonized spelling key
use olddata.dta, clear
contract film
drop _freq
sort film
save olddata_filmlist.dta, replace
use newdata.dta, clear
sort film
merge film using olddata_filmlist
tab _merge
sort film _merge
quietly gen year=.
quietly gen filmx=""
quietly gen filmx2=""
quietly gen filmy=""
quietly replace year=real(regexs(2)) if regexm(film,"(.+) \(([0-9][0-9][0-9][0-9])/?[I]?[II]?[III]?[IV]?\)")
quietly replace filmx=regexs(1) if regexm(film,"(.+) \(([0-9][0-9][0-9][0-9])/?[I]?[II]?[III]?[IV]?\)")
quietly replace filmx2=regexs(1) if regexm(film,"(.+), The")
sort film _merge
gen close=0
gen mergeprobable=0
foreach n in -3 -2 -1 1 2 3 {
 quietly replace close=.
 quietly replace close=1 if year<=year[_n+`n']+1 & year>=year[_n+`n']-1
 quietly replace mergeprobable=.
 quietly replace mergeprobable=1 if _merge==1 & _merge[_n+`n']==2
 replace filmy=film[_n+`n'] if close==1 & mergeprobable==1 & filmx==filmx[_n+`n']
 replace filmy=film[_n+`n'] if close==1 & mergeprobable==1 & filmx==filmx2[_n+`n'] & filmy=="" /*new, intended to match superfluous "the"*/
}
keep if filmy~=""
quietly compress
keep film filmy
lab var filmy "old data film title"
sort film

*merge back onto the data
merge film using newdata.dta
tab _merge
drop _merge
gen mergename=0
replace film=filmy if filmy~=""
replace mergename=1 if filmy~=""
lab var mergename "dummy indicating title was reverted to old version"
drop filmy
quietly compress
sort film

This code will handle a very large proportion of the merge errors but it won’t do all of them. These you’ll have to identify manually. The way I like to do this is first create a file just of merge errors (both “1″ and “2″), use the command “order _merge film”, and export this file to a text editor. If you’ve done anything sensible to your old data like drop all the porn, then you’re going to have a lot more values for the new data than the old data. What you then want to do is search for the merge code indicating this. So if you merged as “using olddata” then you want to search for the regular expression “^2\t” which means a line starting with the merge error code for present in using but not master. Then look around and see if there’s an obvious fit, usually some trivial spelling variation. If not, query the IMDB web interface for a hint. Once you figure out what the correction is, do not correct it directly in the data (which both fails to document your changes and makes it impossible to apply the same changes to other files), but write the correction into a do-file like so:

replace film="Good Wife, The (1987)" if film=="Umbrella Woman, The (1987)"

2 comments April 6, 2009

True Tales of the IMDB!

| Gabriel |

Sometimes the hardest thing is getting the data into Stata. I do some work with the raw IMDB files and these can be hard to get into Stata for all sorts of reasons, the first of which is that they are huge.

This is doubly frustrating because most of the reason the files are so huge is stuff like pornography that I plan to drop from the dataset as soon as possible. No kidding, I traced one of my data problems today to a writing credit for someone named “McNoise” for a film called “Business Ass.” (I’m presuming this is porn as I’d rather not look into the matter further).

The hugeness of these files is compounded by the fact that Stata doesn’t store memory as efficiently as text files. If you see a text file is 100 megs, you might foolishly type “set mem 120m” and expect the thing to insheet. In fact it almost certainly will not because Stata uses enough memory for each case of each string variable to have as many characters as the single longest value for that variable. In other words, if 99% of the movies in IMDB have a name that’s 20 characters or less long but a handful have names that are 244 characters long, then Stata will use as much RAM as if all of them were 244 characters. Thus the Stata memory allocation might have to be three or four times the size of the text file.

But even if you somehow had a terabyte of RAM it’s not like you could just type insheet and leave it at that because the files are dirty (and not just because they have so much porn). The most obvious thing is that the tabs don’t match up. The basic organization of the file is like this:

writer1{tab}1st film credit
{tab}{tab}{tab}2nd film credit
...
{tab}{tab}{tab}kth film credit
writer2{tab}1st film credit

This organization means that when you insheet it the first film credit shows up as v2 but subsequent film credits show up as v4 in different rows. You could fix this in Stata (replace v2=v4 if v2==”") but remembering what I said about RAM you really wouldn’t want to. You’re much better off pre-cleaning the data in a good text editor (or if you plan on doing it routinely, perl). In addition to this systematic thing of first credit, later credit, there are also idiosyncratic errors. For instance, the rapper 50 Cent has a writing credit for a direct to video project called “Before I Self Destruct” and there are two tabs between his name and the credit instead of the usual one tab.

Now here’s the real trick. You insheet your data but half of it’s not there. Note that Stata doesn’t tell you this. You have to check it yourself by using your text editor to see how many rows are in your text file and then typing “desc” in Stata to see your n and notice if it matches. It took me about an hour to realize that the IMDB writers’ file has several hanging quotes (i.e. an odd-number of ” characters in a string). Because Stata uses ” as a string delimiter when you insheet, Stata ignores all the rows in your text file between your first hanging quote and your second hanging quote (and then between your third and fourth, and so on). If I needed the quotes and/or were more patient I’d figure out how to write a regular expression to find hanging quotes and close them, but because I don’t need them (IMDB uses quotes for print and tv but not films and I only care about films) I just turned them all into underscores which is usually a safe character for Stata to handle.

Anyway, I did the cleaning in TextWrangler so there’s no script per se but I did keep notes. You could turn these notes into a perl script but it would only be worth it if you needed to do it several times. The notes show find/replace general expression patterns. The notes are for the file “writers.list”. Because each IMDB file is formatted slightly differently (yeah, I know isn’t that great) you’ll need different code for different files.

\r\r
\r

\)  \(
\)\t\(

\) \(as
\)\t\(as 

}  \(
}\t\(

^\t\t\t
\t

(twice)
\t\t
\t

"
_

the next few following commands will save memory but are not necessary. use
each of them as a find pattern to be replaced with nothing. they eliminate
non-theatrical credits but only if they are not the writer's first credit in
the file. the last pattern matches credit for a tv episode.
^\t.+ \([1-2][0-9][0-9][0-9]\) \(TV\).+\r
^\t.+ \([1-2][0-9][0-9][0-9]\) \(VG\).+\r
^\t.+ \([1-2][0-9][0-9][0-9]\) \(V\).+\r
^\t".+" \([1-2][0-9][0-9][0-9]\) {.+\r

1 comment March 26, 2009


The Culture Geeks

Tags

bayesian cleaning culture diffusion economics economic sociology ethnomethodology financial crisis graphs history IMDB loops lyx macros networks phenomenology philosophy of science R random variables regular expressions resampling shell sociology of organizations sociology of science st Stata superstar text editor typesetting

Archives

Recent Posts

Recent Comments

Blogroll