True Tales of the IMDB!

March 26, 2009 at 3:29 pm GR 1 comment

| Gabriel |

Sometimes the hardest thing is getting the data into Stata. I do some work with the raw IMDB files and these can be hard to get into Stata for all sorts of reasons, the first of which is that they are huge.

This is doubly frustrating because most of the reason the files are so huge is stuff like pornography that I plan to drop from the dataset as soon as possible. No kidding, I traced one of my data problems today to a writing credit for someone named “McNoise” for a film called “Business Ass.” (I’m presuming this is porn as I’d rather not look into the matter further).

The hugeness of these files is compounded by the fact that Stata doesn’t store memory as efficiently as text files. If you see a text file is 100 megs, you might foolishly type “set mem 120m” and expect the thing to insheet. In fact it almost certainly will not because Stata uses enough memory for each case of each string variable to have as many characters as the single longest value for that variable. In other words, if 99% of the movies in IMDB have a name that’s 20 characters or less long but a handful have names that are 244 characters long, then Stata will use as much RAM as if all of them were 244 characters. Thus the Stata memory allocation might have to be three or four times the size of the text file.

But even if you somehow had a terabyte of RAM it’s not like you could just type insheet and leave it at that because the files are dirty (and not just because they have so much porn). The most obvious thing is that the tabs don’t match up. The basic organization of the file is like this:

writer1{tab}1st film credit
{tab}{tab}{tab}2nd film credit
...
{tab}{tab}{tab}kth film credit
writer2{tab}1st film credit

This organization means that when you insheet it the first film credit shows up as v2 but subsequent film credits show up as v4 in different rows. You could fix this in Stata (replace v2=v4 if v2==””) but remembering what I said about RAM you really wouldn’t want to. You’re much better off pre-cleaning the data in a good text editor (or if you plan on doing it routinely, perl). In addition to this systematic thing of first credit, later credit, there are also idiosyncratic errors. For instance, the rapper 50 Cent has a writing credit for a direct to video project called “Before I Self Destruct” and there are two tabs between his name and the credit instead of the usual one tab.

Now here’s the real trick. You insheet your data but half of it’s not there. Note that Stata doesn’t tell you this. You have to check it yourself by using your text editor to see how many rows are in your text file and then typing “desc” in Stata to see your n and notice if it matches. It took me about an hour to realize that the IMDB writers’ file has several hanging quotes (i.e. an odd-number of ” characters in a string). Because Stata uses ” as a string delimiter when you insheet, Stata ignores all the rows in your text file between your first hanging quote and your second hanging quote (and then between your third and fourth, and so on). If I needed the quotes and/or were more patient I’d figure out how to write a regular expression to find hanging quotes and close them, but because I don’t need them (IMDB uses quotes for print and tv but not films and I only care about films) I just turned them all into underscores which is usually a safe character for Stata to handle.

Anyway, I did the cleaning in TextWrangler so there’s no script per se but I did keep notes. You could turn these notes into a perl script but it would only be worth it if you needed to do it several times. The notes show find/replace general expression patterns. The notes are for the file “writers.list”. Because each IMDB file is formatted slightly differently (yeah, I know isn’t that great) you’ll need different code for different files.

\r\r
\r

\)  \(
\)\t\(

\) \(as
\)\t\(as 

}  \(
}\t\(

^\t\t\t
\t

(twice)
\t\t
\t

"
_

the next few following commands will save memory but are not necessary. use
each of them as a find pattern to be replaced with nothing. they eliminate
non-theatrical credits but only if they are not the writer's first credit in
the file. the last pattern matches credit for a tv episode.
^\t.+ \([1-2][0-9][0-9][0-9]\) \(TV\).+\r
^\t.+ \([1-2][0-9][0-9][0-9]\) \(VG\).+\r
^\t.+ \([1-2][0-9][0-9][0-9]\) \(V\).+\r
^\t".+" \([1-2][0-9][0-9][0-9]\) {.+\r

Entry filed under: Uncategorized. Tags: cleaning, IMDB, regular expressions, Stata.

Estout Scientific Inference, part 1 of 4

1 Comment

1. Son of True Tales of the IMDB! « Code and Culture | June 18, 2009 at 5:51 am

[…] decided which files you want to use, you have to clean them. (See previous thoughts here). Most of the files are organized as some variation on […]

Code and Culture