Return of True Tales of The IMDB!

April 6, 2009 at 11:39 am 2 comments

| Gabriel |

Note that this advice is written for IMDB but similar principles apply to other continuously updated large string-based relational databases like allmusic or wikipedia.

I recently had to merge in some new variables to the IMDB after not having used the raw data for awhile. Unfortunately this is harder than it sounds since the IMDB frequently updates itself with corrections. For instance, a film called “Umbrella Woman, The (1987)” was called “Good Wife, The (1987)” for its American release. A few years ago the IMDB used the latter title, but it’s since been updated to the former. While this speaks well to the strive for perfection exhibited by the IMDB, it also creates a huge hassle for merging. So the short version of my advice is:

If you start using the IMDB, the very first thing you should do is download every file (even the ones you don’t expect to use), back it up to a DVD or external hard drive, and keep that backup very safe. This way you can stick to a single version of the IMDB. The extra couple days of download time this may take will repay itself many-fold by avoiding most merge errors  introduced by updates if you ever need to add more variables in the future.

Note that this assumes that you’re already using the good (i.e., replicable) data cleaning practices of:

  • Always keep a copy of the raw file.
  • Thoroughly document all changes between the raw file and clean file, preferably by making all changes through a well-documented Stata do-file or a shell-script written in a language like perl or awk. If you do make any changes in a text editor, try to do so using regular expressions and keep detailed notes of the search patterns. Never just type changes directly into the data.

If you are unfortunate enough to be in the midst of a project using IMDB and you need to collect more data you will need to correct the merge errors introduced by the IMDB updates. Since film title is usually the merge key, I’ll concentrate on that but the same could go for personal names in the various personnel files if you wanted to merge the cv’s for people who work in multiple occupations (e.g., Woody Allen has entries in actors.list, writers.list,and directors.list).

First, apply the same changes to the new data as you did to the old data (you did document your changes to the old data, right?). For instance, I like to replace dangerous characters like quote and apostrophe with safe characters like spaces or underscores to keep Stata from interpreting them as something other than literal strings. If I did this to the old stuff, I need to do it to the new stuff too.

Second, a very large proportion of the updates to film title involve correcting the release date (which IMDB treats as part of the title) or changing the definite article. You can usually automate this by merging, breaking the title into a string component and a year component, sorting by string then year, and harmonizing to close matches in adjacent years.

*harmonize spelling
*the film titles in the main file and the newly downloaded data differ in two respects
* 1. the old file removed wildcard chars
* 2. some films have new years associated (eg American Ninja IV was either 1990 or 1991)
*create harmonized spelling key
use olddata.dta, clear
contract film
drop _freq
sort film
save olddata_filmlist.dta, replace
use newdata.dta, clear
sort film
merge film using olddata_filmlist
tab _merge
sort film _merge
quietly gen year=.
quietly gen filmx=""
quietly gen filmx2=""
quietly gen filmy=""
quietly replace year=real(regexs(2)) if regexm(film,"(.+) \(([0-9][0-9][0-9][0-9])/?[I]?[II]?[III]?[IV]?\)")
quietly replace filmx=regexs(1) if regexm(film,"(.+) \(([0-9][0-9][0-9][0-9])/?[I]?[II]?[III]?[IV]?\)")
quietly replace filmx2=regexs(1) if regexm(film,"(.+), The")
sort film _merge
gen close=0
gen mergeprobable=0
foreach n in -3 -2 -1 1 2 3 {
 quietly replace close=.
 quietly replace close=1 if year<=year[_n+`n']+1 & year>=year[_n+`n']-1
 quietly replace mergeprobable=.
 quietly replace mergeprobable=1 if _merge==1 & _merge[_n+`n']==2
 replace filmy=film[_n+`n'] if close==1 & mergeprobable==1 & filmx==filmx[_n+`n']
 replace filmy=film[_n+`n'] if close==1 & mergeprobable==1 & filmx==filmx2[_n+`n'] & filmy=="" /*new, intended to match superfluous "the"*/
}
keep if filmy~=""
quietly compress
keep film filmy
lab var filmy "old data film title"
sort film

*merge back onto the data
merge film using newdata.dta
tab _merge
drop _merge
gen mergename=0
replace film=filmy if filmy~=""
replace mergename=1 if filmy~=""
lab var mergename "dummy indicating title was reverted to old version"
drop filmy
quietly compress
sort film

This code will handle a very large proportion of the merge errors but it won’t do all of them. These you’ll have to identify manually. The way I like to do this is first create a file just of merge errors (both “1” and “2”), use the command “order _merge film”, and export this file to a text editor. If you’ve done anything sensible to your old data like drop all the porn, then you’re going to have a lot more values for the new data than the old data. What you then want to do is search for the merge code indicating this. So if you merged as “using olddata” then you want to search for the regular expression “^2\t” which means a line starting with the merge error code for present in using but not master. Then look around and see if there’s an obvious fit, usually some trivial spelling variation. If not, query the IMDB web interface for a hint. Once you figure out what the correction is, do not correct it directly in the data (which both fails to document your changes and makes it impossible to apply the same changes to other files), but write the correction into a do-file like so:

replace film="Good Wife, The (1987)" if film=="Umbrella Woman, The (1987)"

Entry filed under: Uncategorized. Tags: , , , .

Code and Culture Live on Stage! Diffusion of cell phone viruses and network (dis)externalities

2 Comments

  • 1. Mike3550  |  April 6, 2009 at 11:59 am

    Gabriel, Andrew Gelman had a post recommending a Firefox extension to download all files on a page. I didn’t know if it would be helpful for this (it is certainly helpful downloading Census data that comes in hundreds of different tables).

  • 2. gabrielrossman  |  April 6, 2009 at 12:49 pm

    mike,
    thanks for another good idea. i just tested “Download Them All” and it seems to work with IMDB, which is impressive because it’s FTP not HTTP.
    there are about thirty files in the IMDB and it’s a hassle to manually download them all, especially since it’s a good idea to only start downloading one after you’ve finished the previous one (lest you lose your connection and end up with 30 transfers each 5% complete).
    anyway, i should probably link the three mirrors because it’s not obvious where to find them.
    Germany
    Finland
    Sweden


The Culture Geeks


%d bloggers like this: