Posts tagged ‘regular expressions’

Merging Pajek vertices into Stata

| Gabriel |

Sometimes I use Pajek (or something that behaves similarly like Mathematica or Network Workbench) to generate a variable which I then want to merge back onto Stata. However the problem is that the output requires a little cleaning because it’s not as if the first column is your “id” variable as it exists in Stata and the second column the metric and you can just merge on “id.” Instead they tend to encode your Stata id variable, which means you have to merge twice, first to associate the Stata id variable with the Pajek id variable, second to associate the new data with your main dataset.

So the first step is to create a merge file to associate the encoded key with the Stata id variable. You get this from the Pajek “.net” file (ie, the data file). The first part of this file is the encoding of the nodes, the rest (which you don’t care about for these purposes) is the connections between these nodes. In other words you want to go from this:

*Vertices 3
1 "tom"
2 "dick"
3 "harry"
*Edges
1 2
2 3

to this:

pajek_id	stata_id
1	Tom
2	Dick
3	Harry

The thing that makes this a pain is that “.net” files are usually really big so if you try to just select the “vertices” part of the file you may be holding down the mouse button for a really long time. My solution is to open the file in a text editor (I prefer TextWrangler for this) and put the cursor at the end of what I want. I then enter the regular expression search pattern “^.+$\r” (or “^.+$\n”) to be replaced with nothing, which has the effect of erasing everything after the cursor. Note that the search should start at the cursor and not wrap so don’t check “start at top” or “wrap around.” You’ll then be left with just the labels, the edge list having been deleted. Another way to do it is to search the whole file and tell it to delete lines that do not include quotes marks.

Having eliminated the edge list and kept only the encoding key, at this point you still need to get the vertice labels into a nice tab-delimited format, which is easily accomplished with this pattern.

“(.+)”$
\t\1

Note the leading space in the search regular expression. Also note that if the labels have embedded spaces there should be quotes around \1 in the replacement regular expression.

Manually name the first column “pajek_id” and the second column “stata_id” (or better yet, whatever you call your id variable in Stata) and save the file as something like “pajekmerge.txt”. Now go to Stata and use “insheet,” “sort,” and “merge” to add the “pajek_id” variable into Stata. You’re now ready to import the foreign data. Use “insheet” to get it into Stata. Some of these programs include an id variable, if so name it “pajek_id.” Others (eg Mathematica) don’t and just rely on ordering. If so, enter the command “gen mathematica_id=[_n]”. You’re now ready to merge the foreign data into Stata.

This is obviously a tricky process and there are a lot of stupid ways it could go wrong. Therefore it is absolutely imperative that you spot-check the results. There are usually some cases where you intuitively know about what the new metric should be. Likewise, you may have another variable native to your Stata dataset that should have a reasonably high (positive or negative) correlation with the new metric imported from Pajek. Check this correlation as when things should be correlated but ain’t it often means a merge error.

Note that it’s not necessarily a problem if some cases in your Stata dataset don’t have corresponding entries in your Pajek output. This is because isolates are often dropped from your Pajek data. However you should know who these isolates are and be able to spot-check that the right people are missing. If you’re doing an inter-locking board studies and you see that an investment bank in your Stata data doesn’t appear in your Pajek data then you probably have a merge error.

July 22, 2009 at 5:02 am 2 comments

Field-tagged data

| Gabriel |

Most of the datasets we deal with are rectangular in that the variables are always in the same order (whether they are free or fixed) and the records are delimited with a carriage return. A data format that’s less familiar to us but actually quite common in other applications is the field-tagged format. Examples are the BibTex citation database format. Likewise, some of the files in IMDB are a weird hybrid of rectangular and field-tagged. If data formats were human languages and sentences were data records, rectangular formats would be word order syntax (like English) and field-tagged formats would be case marker syntax (like Latin or German). (Yes, I have a bad habit of making overly complicated metaphors make sense only to me.)

In rectangular formats like field-delimited data (.csv or .tsv) or fixed-width data (.prn) you have one record per row and the same variables in the same order for each row, with the variables being either separated by a delimiter (usually comma or tab) or fixed-width with each variable being defined in the data dictionary as columns x-y (which was a really good idea back when we used punch cards, you know, to keep track of our dinosaur herds). In contrast with a field-tagged format, each record spans multiple rows and the first row contains the key that identifies the data. Subsequent rows usually begin with a tab, then a tag that identifies the name of the variable, followed by a delimiter and finally the actually content of the variable for that case. The beginning and end of the record are flagged with special characters. For example here’s a BibTex entry:

@book{vogel_entertainment_2007,
	address = {Cambridge},
	edition = {7th ed.},
	title = {Entertainment Industry Economics: A Guide for Financial Analysis},
	isbn = {9780521874854},
	publisher = {Cambridge University Press},
	author = {Harold Vogel},
	year = {2007}
},

The first thought is, why would anyone want to organize data this way? It certainly doesn’t make it easier to load into Stata (and even if it’s less difficult in R it’s still going to be harder than doing a csv). Basically the reasons people use field-tagged data are that it’s more human-readable / human-editable (a lot of people write BibTex files by hand, although personally I find it easier to let Zotero do it). Not only do you not have to remember what the fifth variable is, but you have more flexibility with things like “comment” fields which can be any length and have internal carriage returns. This is obviously a nice feature for a citation database as it means you can keep detailed notes directly in the file. Furthermore, they are good with situations where you have a lot of “missing data.” BibTex entries can potentially have dozens of variables but most works only require a few of them. For instance the Vogel citation only has eight fields and most of the other potential fields, things like translator, editor, journal title, series title, etc., are appropriately “missing” because they are simply not applicable to this book. It saves a lot of whitespace in the file just to omit these fields entirely rather than having them in but coded as missing (which is what you’d have to do to format BibTex as rectangular).

Nonetheless, if you want to get it into Stata, you need to shoehorn it into rectangular format. Perhaps this is all possible to handle with the “infile” command but last time I tried I couldn’t figure it out. (Comments are welcome if anyone actually knows how to do this). The very clumsy hack I use for these kind of data is to use a text editor to do a regular expression search that first deletes everything but the record key and the variable I want. I then do another search to convert carriage returns to tabs for lines beginning with the record key. I now have a rectangular dataset with the key and one variable. I can save this and get it into Stata. This is a totally insane example both because I can’t imagine why you’d want citation data in Stata and also because there are easier ways to do this (like export filters in citation software) but imagine that you wanted to get “year” and “author” out of a BibTex file and make it rectangular. You would want to run the following regexp patterns through a text editor (or write them into a perl script if you planned on doing it regularly):

^\t[^(year)|@].+\r

Sometimes this is all you need, but what if you want several variables. Basically, rinse, wash repeat until you have one file per variable then you can merge them in Stata. The reason you need a separate file for each variable is because otherwise it’s really easy to get your variables switched around. Because field-tagged formats are so forgiving about having variables in arbitrary orders or missing altogether, when you try to turn it into rectangular you’ll get a lot of values in the wrong column.

June 10, 2009 at 5:52 am 5 comments

Return of True Tales of The IMDB!

| Gabriel |

Note that this advice is written for IMDB but similar principles apply to other continuously updated large string-based relational databases like allmusic or wikipedia.

I recently had to merge in some new variables to the IMDB after not having used the raw data for awhile. Unfortunately this is harder than it sounds since the IMDB frequently updates itself with corrections. For instance, a film called “Umbrella Woman, The (1987)” was called “Good Wife, The (1987)” for its American release. A few years ago the IMDB used the latter title, but it’s since been updated to the former. While this speaks well to the strive for perfection exhibited by the IMDB, it also creates a huge hassle for merging. So the short version of my advice is:

If you start using the IMDB, the very first thing you should do is download every file (even the ones you don’t expect to use), back it up to a DVD or external hard drive, and keep that backup very safe. This way you can stick to a single version of the IMDB. The extra couple days of download time this may take will repay itself many-fold by avoiding most merge errors  introduced by updates if you ever need to add more variables in the future.

Note that this assumes that you’re already using the good (i.e., replicable) data cleaning practices of:

  • Always keep a copy of the raw file.
  • Thoroughly document all changes between the raw file and clean file, preferably by making all changes through a well-documented Stata do-file or a shell-script written in a language like perl or awk. If you do make any changes in a text editor, try to do so using regular expressions and keep detailed notes of the search patterns. Never just type changes directly into the data.

If you are unfortunate enough to be in the midst of a project using IMDB and you need to collect more data you will need to correct the merge errors introduced by the IMDB updates. Since film title is usually the merge key, I’ll concentrate on that but the same could go for personal names in the various personnel files if you wanted to merge the cv’s for people who work in multiple occupations (e.g., Woody Allen has entries in actors.list, writers.list,and directors.list).

First, apply the same changes to the new data as you did to the old data (you did document your changes to the old data, right?). For instance, I like to replace dangerous characters like quote and apostrophe with safe characters like spaces or underscores to keep Stata from interpreting them as something other than literal strings. If I did this to the old stuff, I need to do it to the new stuff too.

Second, a very large proportion of the updates to film title involve correcting the release date (which IMDB treats as part of the title) or changing the definite article. You can usually automate this by merging, breaking the title into a string component and a year component, sorting by string then year, and harmonizing to close matches in adjacent years.

*harmonize spelling
*the film titles in the main file and the newly downloaded data differ in two respects
* 1. the old file removed wildcard chars
* 2. some films have new years associated (eg American Ninja IV was either 1990 or 1991)
*create harmonized spelling key
use olddata.dta, clear
contract film
drop _freq
sort film
save olddata_filmlist.dta, replace
use newdata.dta, clear
sort film
merge film using olddata_filmlist
tab _merge
sort film _merge
quietly gen year=.
quietly gen filmx=""
quietly gen filmx2=""
quietly gen filmy=""
quietly replace year=real(regexs(2)) if regexm(film,"(.+) \(([0-9][0-9][0-9][0-9])/?[I]?[II]?[III]?[IV]?\)")
quietly replace filmx=regexs(1) if regexm(film,"(.+) \(([0-9][0-9][0-9][0-9])/?[I]?[II]?[III]?[IV]?\)")
quietly replace filmx2=regexs(1) if regexm(film,"(.+), The")
sort film _merge
gen close=0
gen mergeprobable=0
foreach n in -3 -2 -1 1 2 3 {
 quietly replace close=.
 quietly replace close=1 if year<=year[_n+`n']+1 & year>=year[_n+`n']-1
 quietly replace mergeprobable=.
 quietly replace mergeprobable=1 if _merge==1 & _merge[_n+`n']==2
 replace filmy=film[_n+`n'] if close==1 & mergeprobable==1 & filmx==filmx[_n+`n']
 replace filmy=film[_n+`n'] if close==1 & mergeprobable==1 & filmx==filmx2[_n+`n'] & filmy=="" /*new, intended to match superfluous "the"*/
}
keep if filmy~=""
quietly compress
keep film filmy
lab var filmy "old data film title"
sort film

*merge back onto the data
merge film using newdata.dta
tab _merge
drop _merge
gen mergename=0
replace film=filmy if filmy~=""
replace mergename=1 if filmy~=""
lab var mergename "dummy indicating title was reverted to old version"
drop filmy
quietly compress
sort film

This code will handle a very large proportion of the merge errors but it won’t do all of them. These you’ll have to identify manually. The way I like to do this is first create a file just of merge errors (both “1” and “2”), use the command “order _merge film”, and export this file to a text editor. If you’ve done anything sensible to your old data like drop all the porn, then you’re going to have a lot more values for the new data than the old data. What you then want to do is search for the merge code indicating this. So if you merged as “using olddata” then you want to search for the regular expression “^2\t” which means a line starting with the merge error code for present in using but not master. Then look around and see if there’s an obvious fit, usually some trivial spelling variation. If not, query the IMDB web interface for a hint. Once you figure out what the correction is, do not correct it directly in the data (which both fails to document your changes and makes it impossible to apply the same changes to other files), but write the correction into a do-file like so:

replace film="Good Wife, The (1987)" if film=="Umbrella Woman, The (1987)"

April 6, 2009 at 11:39 am 2 comments

True Tales of the IMDB!

| Gabriel |

Sometimes the hardest thing is getting the data into Stata. I do some work with the raw IMDB files and these can be hard to get into Stata for all sorts of reasons, the first of which is that they are huge.

This is doubly frustrating because most of the reason the files are so huge is stuff like pornography that I plan to drop from the dataset as soon as possible. No kidding, I traced one of my data problems today to a writing credit for someone named “McNoise” for a film called “Business Ass.” (I’m presuming this is porn as I’d rather not look into the matter further).

The hugeness of these files is compounded by the fact that Stata doesn’t store memory as efficiently as text files. If you see a text file is 100 megs, you might foolishly type “set mem 120m” and expect the thing to insheet. In fact it almost certainly will not because Stata uses enough memory for each case of each string variable to have as many characters as the single longest value for that variable. In other words, if 99% of the movies in IMDB have a name that’s 20 characters or less long but a handful have names that are 244 characters long, then Stata will use as much RAM as if all of them were 244 characters. Thus the Stata memory allocation might have to be three or four times the size of the text file.

But even if you somehow had a terabyte of RAM it’s not like you could just type insheet and leave it at that because the files are dirty (and not just because they have so much porn). The most obvious thing is that the tabs don’t match up. The basic organization of the file is like this:

writer1{tab}1st film credit
{tab}{tab}{tab}2nd film credit
...
{tab}{tab}{tab}kth film credit
writer2{tab}1st film credit

This organization means that when you insheet it the first film credit shows up as v2 but subsequent film credits show up as v4 in different rows. You could fix this in Stata (replace v2=v4 if v2==””) but remembering what I said about RAM you really wouldn’t want to. You’re much better off pre-cleaning the data in a good text editor (or if you plan on doing it routinely, perl). In addition to this systematic thing of first credit, later credit, there are also idiosyncratic errors. For instance, the rapper 50 Cent has a writing credit for a direct to video project called “Before I Self Destruct” and there are two tabs between his name and the credit instead of the usual one tab.

Now here’s the real trick. You insheet your data but half of it’s not there. Note that Stata doesn’t tell you this. You have to check it yourself by using your text editor to see how many rows are in your text file and then typing “desc” in Stata to see your n and notice if it matches. It took me about an hour to realize that the IMDB writers’ file has several hanging quotes (i.e. an odd-number of ” characters in a string). Because Stata uses ” as a string delimiter when you insheet, Stata ignores all the rows in your text file between your first hanging quote and your second hanging quote (and then between your third and fourth, and so on). If I needed the quotes and/or were more patient I’d figure out how to write a regular expression to find hanging quotes and close them, but because I don’t need them (IMDB uses quotes for print and tv but not films and I only care about films) I just turned them all into underscores which is usually a safe character for Stata to handle.

Anyway, I did the cleaning in TextWrangler so there’s no script per se but I did keep notes. You could turn these notes into a perl script but it would only be worth it if you needed to do it several times. The notes show find/replace general expression patterns. The notes are for the file “writers.list”. Because each IMDB file is formatted slightly differently (yeah, I know isn’t that great) you’ll need different code for different files.

\r\r
\r

\)  \(
\)\t\(

\) \(as
\)\t\(as 

}  \(
}\t\(

^\t\t\t
\t

(twice)
\t\t
\t

"
_

the next few following commands will save memory but are not necessary. use
each of them as a find pattern to be replaced with nothing. they eliminate
non-theatrical credits but only if they are not the writer's first credit in
the file. the last pattern matches credit for a tv episode.
^\t.+ \([1-2][0-9][0-9][0-9]\) \(TV\).+\r
^\t.+ \([1-2][0-9][0-9][0-9]\) \(VG\).+\r
^\t.+ \([1-2][0-9][0-9][0-9]\) \(V\).+\r
^\t".+" \([1-2][0-9][0-9][0-9]\) {.+\r

March 26, 2009 at 3:29 pm 1 comment


The Culture Geeks