Posts tagged ‘cleaning’

Oscar Appeal

| Gabriel |

This post contains two Stata do-files for constructing the “Oscar appeal” variable at the center of Rossman & Schilke “Close But No Cigar.”

(more…)

July 29, 2013 at 8:06 am 5 comments

Using filefilter to make insheet happy

| Gabriel |

As someone who mostly works with text files with a lot of strings, I often run into trouble with the Stata insheet command being extremely finicky about how it takes data. Frequently it ends up throwing out half the rows because at some point in the file there’s a stray character and Stata not only throws out that row but everything thereafter. In my recent work on IMDb I’ve gotten into the habit of first reading text files into Excel, then having Stata read the xlsx files. This is tolerable if you’re dealing with a relatively small number of files that you’re only importing once, but it won’t scale to repeated imports or a large number of files.

More recently, I’ve been dealing with data collected with the R library twitteR. Since tweets sometimes contain literal quotes, twitteR escapes them with a backslash. However Stata does not recognize this convention and it chokes on this when the quote characters are unbalanced. I realized this the other night when I was trying to test for left-censorship and fixed it using the batch find/replace in TextWrangler. Of course this is not scriptable and so I was contemplating taking the plunge into Perl when the Modeled Behavior hive mind suggested the Stata command filefilter. Using this command I can replace the escaped literal quotes (which chokes Stata’s insheet) with literal apostrophes (which Stata’s insheet can handle).

filefilter foo.txt foo2.txt, from(\BS\Q) to(\RQ)

Problem solved, natively in Stata, and I have about 25% more observations. Thanks guys.

June 8, 2012 at 1:24 pm 3 comments

Seven-inch heels, natural language processing, and sociology

The following is a guest post from Trey Causey, a long-time reader of codeandculture and a grad student at Washington who does a lot of work with web scraping. We got to discussing a dubious finding and at my request he graciously wrote up his thoughts into a guest post.

| Trey |

Recently, Gabriel pointed me to a piece in Ad Age (and original press release) about IBM researchers correlating the conversations of fashion bloggers with the state of the economy (make sure you file away the accompanying graph for the next time you teach data visualization). Trevor Davis, a “consumer-products expert” with IBM, claimed that as economic conditions improve, the average height of high heels mentioned by these bloggers decreases. Similarly, as economic conditions worsen, the average height would increase. As Gabriel pointed out, these findings seemed to lack any sort of face validity — how likely does it seem that, at any level of economic performance, the average high heel is seven inches tall (even among fashionistas)? I’ll return to the specific problems posed by what I’ll call the “seven-inch heel problem” in a moment, but first some background on the methods that most likely went into this study.

While amusing, if not very credible, the IBM study is part of a growing area (dubbed by some “computational social science”) situated at the intersection of natural language processing and machine learning. By taking advantage of the explosion of available digital text and computing power, researchers in this area are attempting to model structure in and test theories of large-scale social behavior. You’ve no doubt seen some of this work in the media, ranging from “predicting the Arab Spring” to using Twitter to predict GOP primary frontrunners. Many of these works hew towards the style end of the style-substance divide and are not typically motivated by any recognizable theory. However, this is changing as linguists use Twitter to discover regional dialect differences and model the daily cycle of positive and negative emotions.

Much of this work is being met by what I perceive to be reflexive criticism (as in automatic, rather than in the more sociological sense) from within the academy. The Golder and Macy piece in particular received sharp criticism in the comments on orgtheory, labeled variously “empiricism gone awry”, non-representative, and even fallacious (and in which yours truly was labeled “cavalier”). Some of this criticism is warranted, although as is often the case with new methods and data sources, much of the criticism seems rooted in misunderstanding. I suspect part of this is the surprisingly long-lived skepticism of scholarly work on “the internet” which, with the rise of Facebook and Twitter, seems to have been reinvigorated.

However, sociologists are doing themselves a disservice by seeing this work as research on the internet qua internet. Incredible amounts of textual and relational data are out there for the analyzing — and we all know if there’s one thing social scientists love, it’s original data. And these data are not limited to blog posts, status updates, and tweets. Newspapers, legislation, historical archives, and more are rapidly being digitized, providing pristine territory for analysis. Political scientists are warming to the approach, as evidenced by none other than the inimitable Gary King and his own start-up Crimson Hexagon, which performs sentiment analysis on social media using software developed for a piece in AJPS. Political Analysis, the top-ranked journal in political science and the methodological showcase for the discipline, devoted an entire issue in 2008 to the “text-as-data” approach. Additionally, a group of historians and literary scholars have adopted these methods, dubbing the new subfield the “digital humanities.”

Sociologists of culture and diffusion have already warmed to many of these ideas, but the potential for other subfields is significant and largely unrealized. Social movement scholars could find ways to empirically identify frames in wider public discourse. Sociologists of stratification have access to thousands of public- and private-sector reports, the texts of employment legislation, and more to analyze. Race, ethnicity, and immigration researchers can model changing symbolic boundaries across time and space. The real mistake, in my view, is dismissing these methods as an end in and of themselves rather than as a tool for exploring important and interesting sociological questions. Although many of the studies hitting the mass media seem more “proof of concept” than “test of theory,” this is changing; sociologists will not want to be left behind. Below, I will outline the basics of some of these methods and then return to the seven-inch heels problem.

The use of simple scripts or programs to scrape data from the web or Twitter has been featured several times on this blog. The data that I collected for my dissertation were crawled and then scraped from multiple English and Arabic news outlets that post their archives online, including Al Ahram, Al Masry Al Youm, Al Jazeera, and Asharq al Awsat. The actual scrapers are written in Python using the Scrapy framework.

Obtaining the data is the first and least interesting step (to sociologists). Using the scraped data, I am creating chains of topic models (specifically using Latent Dirichlet Allocation) to model latent discursive patterns in the media from the years leading up to the so-called “Arab Spring.” In doing so, I am trying to identify the convergence and divergence in discourse across and within sources to understand how contemporary actors were making sense of their social, political, and economic contexts prior to a major social upheaval. Estimating common knowledge prior to contentious political events is often problematic due to hindsight biases, because of the problems of conducting surveys in non-democracies, and for the obvious reason that we usually don’t know when a major social upheaval is about to happen even if we may know which places may be more susceptible.

Topic modeling is a method that will be look familiar in its generalities to anyone who has seen a cluster analysis. Essentially, topic models use unstructured text — i.e., text without labeled fields from a database or from a forced-choice survey — to model the underlying topical components that make up a document or set of documents. For instance, one modeled topic might be composed of the words “protest”, “revolution”, “dictator”, and “tahrir”. The model attempts to find the words that have the highest probability of being found with one another and with the lowest probability of being found with other words. The generated topics are devoid of meaning, however, without theoretically informed interpretation. This is analogous to survey researchers that perform cluster or factor analyses to find items that “hang together” and then attempt to figure out what the latent construct is that links them.

Collections of documents (a corpus) are usually represented as a document-term matrix, where each row is a document and the columns are all of the words that appear in your set of documents (the vocabulary). The contents of the individual cells are the per-document word frequencies. This produces a very sparse matrix, so some pre-processing is usually performed to reduce the dimensionality. The majority of all documents from any source are filled with words that convey little to no information — prepositions, articles, common adjectives, etc. (see Zipf’s law). Words that appear in every document or in a very small number of documents provide little explanatory power and are usually removed. The texts are often pre-processed using tools such as the Natural Language Toolkit for Python or RTextTools (which is developed in part here at the University of Washington) to remove these words and punctuation. Further, words are often “stemmed” or “lemmatized” so that the number of words with common suffixes and prefixes but with similar meanings is reduced. For example, “run”, “runner”, “running”, and “runs” might all be reduced to “run”.

This approach is known as a “bag-of-words” approach in that the order and context of the words is assumed to be unimportant (obviously, a contentious assumption, but perhaps that is a debate for another blog). Researchers that are uncomfortable with this assumption can use n-grams, groupings of two or more words, rather than single words. However, as the n increases, the number of possible combinations and the accompanying computing power required grows rapidly. You may be familiar with the Google Ngram Viewer. Most of the models are extendable to other languages and are indifferent to the actual content of the text although obviously the researcher needs to be able to read and make sense of the output.

Other methods require different assumptions. If you are interested in parts of speech, a part-of-speech tagger is required, which assumes that the document is fairly coherent and not riddled with typos. Tracking exact or near-exact phrases is difficult as well, as evidenced by the formidable team of computer scientists working on MemeTracker. The number of possible variations on even a short phrase quickly becomes unwieldy and requires substantial computational resources — which brings us back to the seven-inch heels.

Although IBM now develops the oft-maligned SPSS, they also produced Watson. This is why the total lack of validity of fashion blogging results is surprising. If one were seriously going to track the height of heels mentioned and attempt to correlate it with economic conditions, in order to have any confidence that you have captured a non-biased sample of mentions, at least two necessary steps would include:

  • Identifying possible combinations of size metrics and words for heels: seven-inch heels, seven inch heels, seven inch high heels, seven-inch high-heels, seven inch platforms, etc. And so on. This is further complicated by the fact that many text processing algorithms will treat “seven-inch” as one word.
  • Dealing with the problem of punctuational abbreviations for these metrics: 7″ heels, 7″ high heels, 7 and a 1/2 inch heels, etc. Since punctuation is usually stripped out, it would be necessary to leave it in, but then how to distinguish quotation marks that appear as size abbreviations and those that appear in other contexts?
  • Do we include all of these variations with “pumps?” Is there something systematic such as age, location, etc. about individuals that refer to “pumps” rather than “heels?”
  • Are there words or descriptions for heels that I’m not even aware of? Probably.

None of these is an insurmountable problem and I have no doubt that IBM researchers have easy access to substantial computing power. However, each of them requires careful thought prior to and following data collection; the combination of them together quickly complicates matters. Since IBM is unlikely to reveal their methods, though, I have serious doubts as to the validity of their findings.

As any content analyst can tell you, text is a truly unique data source as it is intentional language and is one of the few sources of observational data for which the observation process is totally unobtrusive. In some cases, the authors are no longer alive! Much of the available online text of interest to social scientists was not produced for scholarly inquiry and was not generated from survey responses. However, the sheer volume of the text requires some (but not much!) technical sophistication to acquire and make sense of and, like any other method, these analyses can produce results that are essentially meaningless. Just as your statistics package of choice will output meaningless regression results from just about any data you feed into it, automated and semi-automated text analysis produces its own share of seven-inch heels.

November 21, 2011 at 8:46 am 11 comments

Importnew.ado (requires R)

| Gabriel |

After hearing from two friends in a single day who are still on Stata 10 that they were having trouble opening Stata 12 .dta files, I rewrote my importspss.ado script to translate Stata files into an older format, by default Stata 9.

I’ve tested this with Stata 12 and in theory it should work with older versions, but please post positive or negative results in the comments. Remember that you need to have R installed. Anyway, I would recommend handling the backwards compatibility issue on the sender’s side with the native “saveold” command, but this should work in a pinch if for some reason you can’t impose on the sender to fix it and you need to fix it on the recipient’s end. Be especially careful if the dataset includes formats that Stata’s been updating a lot lately (e.g., the date formats).

The syntax is just:

importnew foo.dta

Here’s the code:

*importnew.ado
*by GHR 10/7/2011
*this script uses R to make new Stata files backwards compatible
* that is, use it when your collaborator forgot to use "saveold"

*use great caution if you are using data formats introduced in recent versions
* eg, %tb

*DEPENDENCY: R and library(foreign)
*if R exists but is not in PATH, change the reference to "R" in line 29 to be the specific location

capture program drop importnew
program define importnew
	set more off
	local future `1'
	local version=9  /* version number for your copy of Stata */ 
	local obsolete=round(runiform()*1000)
	local sourcefile=round(runiform()*1000)
	capture file close rsource
	file open rsource using `sourcefile'.R, write text replace
	file write rsource "library(foreign)" _n
	file write rsource `"directory <- "`c(pwd)'" "' _n
	file write rsource `"future <- "`future'" "' _n
	file write rsource `"obsolete <- paste("`obsolete'",".dta",sep="") "' _n
	file write rsource "setwd(directory)" _n
	file write rsource `"data <- read.dta(future, convert.factors=TRUE, missing.type=FALSE)"' _n
	file write rsource `"write.dta(data, file=obsolete, version=`version')"' _n
	file close rsource
	shell R --vanilla <`sourcefile'.R
	erase `sourcefile'.R
	use `obsolete'.dta, clear
	erase `obsolete'.dta
end
*have a nice day

October 10, 2011 at 11:15 am

Keep the best 5 (updated)

| Gabriel |

Last year I mentioned my policy of assigning about seven quizzes and then keeping the best 5. I then had a real Rube Goldberg-esque workflow that involved piping to Perl. Several people came up with simpler ideas in the comments, but the most “why didn’t I think of that” was definitely John-Paul Ferguson’s suggestions to just use reshape. Now that I’m teaching the class again, I’ve rewritten the script to work on that logic.

Also, I’ve made the script a bit more flexible by allowing it to specify in the header how many quizzes were offered and how many to keep. To make this work I made a loop that builds a local called sumstring.

[UPDATE 11/29/2010, applied Nick Cox’s suggestions. Old code remains but is commented out]

local numberofquizzes 6
local keepbest 5

*import grades, which look like this
*uid    name    mt  q1  q2  q3
*5001   Joe     40  5   4   6
*4228   Alex    20  6   3   5
insheet using grades.txt, clear
*rescale the quizzes from raw points to proportion 
forvalues qnum=1/`numberofquizzes' {
	quietly sum q`qnum'
	replace q`qnum'=q`qnum'/`r(max)'
}
/*
*build the sumstring local (original code)
local sumstring ""
forvalues i=1/`keepbest' {
	local sumstring "`sumstring' + q`i'"
	disp "`sumstring'"
	local sumstring=subinstr("`sumstring'","+","",1)
	disp "`sumstring'"
}
*/
*reshape long, keep top few quizzes
reshape long q, i( notes uid name mt) j(qnum)
recode q .=0
gsort uid -q
by uid: drop if _n>`keepbest'
by uid: replace qnum=_n
*reshape wide, calc average
reshape wide q, i(notes uid name mt) j(qnum)
*build the sumstring local (w/ Nick Cox's suggestions)
unab sumstring : q* 
disp "`sumstring'"
local sumstring : subinstr local sumstring " " "+", all
disp "`sumstring'"
gen q_avg=(`sumstring')/`keepbest'
sort name
sum q_avg

*have a nice day

November 24, 2010 at 4:24 am 4 comments

Weeks

| Gabriel |

I use a lot of data that exists at the daily level (stored as a string like “9/3/2010”), but which I prefer to handle at the weekly level. In part this is about keeping the memory manageable but it also lets me bracket the epiphenomenal issues associated with the weekly work pattern of the music industry (e.g., records drop on Tuesdays). There are a few ways to turn a date into a week in Stata: the official way, the way I used to do it, and the way I do it now.

1. Beginning with Stata 10, there is a data type (%tw) which stores dates by the week, in contrast to %td (formerly just “%d”) which stores it as a date. This is good for some purposes but it gets messy if you’re trying to look at things that cross calendar years since the last week of the year can have a funny number of days. (Another reason I don’t use these functions is simply that I’ve been working on these scripts since Stata 9). However if you want to do it, it looks like this:

gen fp_w=wofd(date( firstplayed,"MDY"))
format fp_w %tw

2a. The way I prefer to do it is to store data as %td, but to force it to count by sevens, so effectively the %td data really stands for “the week ending in this date” or “this date give or take a few days.” Until a few days ago, I’d do this by dividing by seven and forcing it to be an integer, then multiplying again.

gen fpdate=date(firstplayed,"MDY")
gen int fp1=fpdate/7
gen fp_w=fp1*7
format fp_w %td
drop fp1

This is really ugly code and I’m not proud of it. First, note that this is a lot of code considering how little it does. I could have done this more efficiently by using the “mod(x,y)” function to subtract the remainder. Second, this only works if you’re interested in rounding off to the closest Friday and not some other day of the week.

2b. My new approach still stores as “%td” but is both more flexible and slightly simpler. In particular, it lets me define “week ending on X” where X is any day of the week I choose, here specified as the local `dow’ so I can define the end of the week once in the header and have it apply throughout several places that I do something like this. Note that Stata treats Sunday as 0, Monday as 1, etc. What I do is subtract the actual day of the week, then add the target day of the week so it substantively means “this thing happened on or a few days before this date.”

gen fp_w=date(firstplayed,"MDY")-dow(date(firstplayed,"MDY"))+`dow'
format fp_w %td

September 3, 2010 at 4:32 am 2 comments

Getting long flat-files out of field-tagged data

| Gabriel |

Some field-tagged data can be pretty unproblematically reshaped into flat-files. However one of the reasons people like field-tagged data is that they can have internal long structures and this creates problems for reshaping. For instance in Web of Science, the “CR” field (works that are being cited by the main work) usually has dozens of references separated by carriage returns. To get this into most statistic packages it has to be reshaped into a long flat-file. In other words, you need a script to turn this:

PT J
AU LEVINSON, RM
TI TEACHING LABELING THEORY - 4 EXPERIENCES IN ILLNESS ATTRIBUTION
SO TEACHING SOCIOLOGY
LA English
DT Article
C1 EMORY UNIV,ATLANTA,GA 30322.
CR BECKER HS, 1963, OUTSIDERS
   MENDEL WM, 1969, ARCH GEN PSYCHIAT, V20, P321
   ROSENHAN DL, 1973, SCIENCE, V179, P250
   SCHEFF TJ, 1964, SOC PROBL, V11, P401
   SCHEFF TJ, 1966, BEING MENTALLY ILL
NR 5
TC 5
PU AMER SOCIOLOGICAL ASSOC
PI WASHINGTON
PA 1722 N ST NW, WASHINGTON, DC 20036-2981
SN 0092-055X
J9 TEACH SOCIOL
JI Teach. Sociol.
PY 1975
VL 2
IS 2
BP 207
EP 211
PG 5
SC Education & Educational Research; Sociology
GA AD732
UT ISI:A1975AD73200007
ER

Into this:

ISI:A1975AD73200007	TEACH SOCIOL	BECKER HS, 1963, OUTSIDERS
ISI:A1975AD73200007	TEACH SOCIOL	MENDEL WM, 1969, ARCH GEN PSYCHIAT, V20, P321
ISI:A1975AD73200007	TEACH SOCIOL	ROSENHAN DL, 1973, SCIENCE, V179, P250
ISI:A1975AD73200007	TEACH SOCIOL	SCHEFF TJ, 1964, SOC PROBL, V11, P401
ISI:A1975AD73200007	TEACH SOCIOL	SCHEFF TJ, 1966, BEING MENTALLY ILL

Note that the other fields usually lack carriage returns or other internal long delimiters so they can be cleaned like this. The two approaches can then be merged (in a flat-file like Stata) or linked (in a relational like R or Access) using the key.

I’ve actually already done this kind of thing twice, with my code for cleaning Memetracker and the IMDb business file. However those two datasets had the convenient property that the record key appears in the first row of the record. With data structures like these, you just remember the key then every time you come across a long entry, write it out along with the key.

Unfortunately, Web of Science has the record key appear towards the end of the record, a data structure that I propose to call “a huge pain in the ass.” This means that you have to collect all the long values in an array, then record the record key, then loop over the array to write out.

#!/usr/bin/perl
#wos2tab_cr.pl by ghr
#this script converts field-tagged WOS queries to tab-delimited text
#it extracts the CR field and attaches to it the fields UT and J9
#unlike wos2tab, this file outputs long-formatted data
#the two types of output can be merged with the UT field

#sice CR comes /before/ UT and J9, must save as array, then loop over array at ER

use warnings; use strict;
die "usage: wos2tab_cr.pl <wos data>\n" unless @ARGV==1;

my $rawdata = shift(@ARGV);

my $ut = "" ; #unique article identifier
my $j9 = "" ; #j9 coding of journal title
my $cr = "" ; #cited work
my @crlist = () ; #list of cited works
my $cr_continued = 0 ; #flag for recently hitting "^CR"

print "starting to read $rawdata\n";

open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.long") or die "error creating $rawdata.long\n";
print OUT "ut\tj9\tcr\n";
while (<IN>) {
	# if begins with non-whitespace character other than CR, flag cr_continued as not
	if($_ =~ m/^\S/) {
		if($_ =~ m/^[^CR]/) {$cr_continued = 0}
	}
	if($_ =~ m/^J9/) {
		$j9 = $_;
		$j9 =~ s/\015?\012//; #manual chomp
		$j9 =~ s/^J9 //; #drop leading tag
	}
	if($_ =~ m/^UT/) {
		$ut = $_;
		$ut =~ s/\015?\012//; #manual chomp
		$ut =~ s/^UT //; #drop leading tag
	}
	
	#first line of a CR field 
	if($_ =~ m/^CR/) {
		$cr = $_;
		$cr =~ s/\015?\012//; #manual chomp
		$cr =~ s/^CR //; #drop leading tag
		$cr_continued = 1 ; #flag to allow for multi-line CR field
		push @crlist, $cr; #add the new cite to the list
	}
	#subsequent lines of a CR field
	if($_ =~ m/^  /) {
		if($cr_continued==1) {
			$cr = $_ ;
			$cr =~ s/\015?\012//; #manual chomp
			$cr =~ s/^   //; #drop leading tag
			push @crlist, $cr; #add the new cite to the list
		}
	}
	
	#when "end record" code is reached, loop over array to write out as long file, then clear memory
	if($_=~ /^ER/) {
		#loop over CRLIST array, printing for each value so as to have long file
		for (my $i = 0; $i < @crlist; $i++) {
			print OUT "$ut\t$j9\t$crlist[$i]\n"; #write out
		}
		#clear memory, just do once per "end record"
		$j9 = "" ;
		$ut = "" ;
		$cr = "" ;
		$cr_continued = 0 ;
		@crlist = () ;
	}
}
close IN;
close OUT;
print "done writing $rawdata.long \n";

August 12, 2010 at 4:08 am 2 comments

Using R to parse (a lot of) HTML tables

| Gabriel |

For a few months I’ve been doing a daily scrape of a website but I’ve put off actually parsing the data until a colleague was dealing with a similar problem, and solving his problem reminded me of my problem. The scrape creates a folder named after the date with several dozen html files in it. So basically, the data is stored like this:

project/raw/
  20100601/
    page1.htm
    page2.htm
  20100601/
    page1.htm
    page2.htm

Each html file has one main table along with a couple of sidebar tables. For each html file, I want to extract the main table and write it to a text file. These text files will be put in a “clean” directory that mirrors the “raw” directory.

This is the kind of thing most people would do in Perl (or Python). I had trouble getting the Perl HTML libraries to load although I probably could have coded it from scratch since HTML table structure is pretty simple (push the contents of <td> tags to an array, then write it out and clear the memory when you hit a </tr> tag). In any case, I ended up using R’s XML library, which is funny because usually I clean data in Perl or Stata and use R only as a last resort. Nonetheless, in what is undoubtedly a sign of the end times, here I am using R for cleaning. Forty years of darkness; The dead rising from the grave; Cats and dogs living together; Mass hysteria!

Anyway, the first step is to get a list of the directories in “raw” and use that to seed the top level loop. (Though note that R’s XML library can also read data directly off the web). Within this loop I create a clean subdirectory to mirror the raw subdirectory. I then get a list of every file in the raw subdirectory and seed the lower level loop. The lower level loop reads each file with “readHTMLTable” and writes it out to the mirroring clean subdirectory. Then I come out of both loops and don’t really care if the top is still spinning.

# File-Name:       websiteclean.R
# Date:            2010-07-28
# Author:          Gabriel Rossman
# Purpose:         parse the scraped files
# Packages Used:   xml

timestamp()
library(XML)
parentpath<-"~/Documents/project"
rawdir<-paste(parentpath,"/raw",sep="")
setwd(rawdir)
dirlist <- list.files()
for (dir in dirlist) {
	setwd(rawdir)
	setwd(dir)
	filenames <- list.files()
	cleandir<-paste(parentpath,'/clean/',dir, sep="") #create ../../clean/`dir' and call `cleandir'
	shellcommand<-paste("mkdir ",cleandir, sep="")
	system(shellcommand)
	print(cleandir) #progress report
	for (targetfile in filenames) {
		setwd(rawdir)
		setwd(dir)
		datafromtarget = readHTMLTable(targetfile, header=FALSE)
		outputfile<-paste(targetfile,'.txt', sep="")
		setwd(cleandir)
		write.table(datafromtarget[1], file = outputfile , sep = "\t", quote=TRUE)  #when writing out, limit to subobject 1 to avoid the sidebar tables
	}
}

# have a nice day

July 29, 2010 at 4:58 am 2 comments

imdb_personnel.pl

| Gabriel |

As previously remarked, IMDb files have a weird structure that ain’t exactly ready to rock. I already posted a file for dealing with business.list (which could also be modified to work with files like certificates.list). The personnel files (actors.list, actresses.list, directors.list, writers.list, etc) look like this:

Gilligan, Vince		2-Face (2013)  (screenplay)
			A.M.P.E.D. (2007) (TV)  (writer)
			Hancock (2008)  (written by)  <1,2,1>
			Home Fries (1998)  (written by)  <1,1,1>
			The X Files: Revelations (2008) (V)  (written by) (segment "Bad Blood")  <8,1,1>
			The X Files: Revelations (2008) (V)  (written by) (segment "Memento Mori")  <6,1,3>
			Wilder Napalm (1993)  (written by)  <1,1,1>
			"Breaking Bad" (2008)  (creator)
			"Breaking Bad" (2008) {(#3.12)}  (creator)  <1,1,1>
			"Breaking Bad" (2008) {(#3.13)}  (creator)  <1,1,1>

Whereas we’re used to data that looks like this:

Gilligan, Vince	2-Face (2013)  (screenplay)
Gilligan, Vince	A.M.P.E.D. (2007) (TV)  (writer)
Gilligan, Vince	Hancock (2008)  (written by)  <1,2,1>
Gilligan, Vince	Home Fries (1998)  (written by)  <1,1,1>
Gilligan, Vince	The X Files: Revelations (2008) (V)  (written by) (segment "Bad Blood")  <8,1,1>
Gilligan, Vince	The X Files: Revelations (2008) (V)  (written by) (segment "Memento Mori")  <6,1,3>
Gilligan, Vince	Wilder Napalm (1993)  (written by)  <1,1,1>
Gilligan, Vince	"Breaking Bad" (2008)  (creator)
Gilligan, Vince	"Breaking Bad" (2008) {(#3.12)}  (creator)  <1,1,1>

Of course that’s still not complete since ideally you want to parse the title of the work (eg “Breaking Bad” (2008) ) from details of the artist’s contribution to the work (eg (creator) ). Likewise, depending on what your analysis is about you might want to drop certain kinds of works entirely. (I usually drop the porn, television, and direct to video ASAP). However you can do all that from within Stata (assuming memory isn’t an issue, which it might be) and this script will suffice to get you that far:

#!/usr/bin/perl
#imdb_personnel.pl by ghr
#this script cleans IMDB personnel files (eg, writers.list)
#works best if you delete the header (about the first 300 lines)
#raw data is organized by artist with
# "ARTIST\t\tCREDIT" for the first credit (though sometimes w a single tab) and
# subsequent records are "\t\t\tCREDIT"
#this script makes all rows "ARTIST\tCREDIT" and drops blank rows
#the resulting file is about 20% larger than the original but has a simpler structure that is easier for other programs (eg Stata) to read
#further cleaning would parse the "CREDIT" field but the contents of "CREDIT" 
#vary by personnel file
#in all files "CREDIT" begins with "FILM TITLE (YEAR)" but has further info
# eg, writers.list distinguishes screenplay vs story, etc and actors.list gives character name, etc

use warnings; use strict;
die "usage: imdb_personnel.pl <IMDB personnel file>\n" unless @ARGV==1;
my $rawdata = shift(@ARGV);

# if $_ matches leading non-tab, redefine the "artist" variable
# if $_ matches 3 leading tabs, drop two tabs and add current "artist"
my $artist ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n";
print OUT "artist\tcredit\n";
while (<IN>) {
	#match beginning of artist's credits by looking for lines NOT beginning with a tab
	if($_=~ /^[^\t].+\t.+/) {
		$artist = $_; 
		$artist =~ s/\015?\012//; #manual chomp
		$artist =~ s/\t.+$//; #drop the tab(s) and all else after it 
		$_ =~ s/\t\t/\t/; #go from two tabs to one
		print OUT "$_";
	}
	#match subsequent credits (three leading tabs)
	if ($_ =~ m/^\t\t\t/) {
		$_ =~ s/^\t\t\t//; #drop leading tabs
		print OUT "$artist\t$_";
	}
	#when matching blank line, clear "artist"
	if ($_ =~ m/^$/) {
		$artist = "";
	}
}
close IN;
close OUT;
print "\ndone\n";
#have a nice day

July 26, 2010 at 4:13 am 2 comments

wos2tab.pl

| Gabriel |

One of my grad students is doing some citation network analysis, for which the Python script (and .exe wrapper) wos2pajek is very well-suited. (Since most network packages can read “.net” this is a good idea even if you’re not using Pajek).

However the student is also interested in node level attributes, not just the network. Unfortunately WOS queries are field-tagged which is kind of a pain to work with and the grad student horrified me by expressing the willingness to spend weeks reshaping the data by hand in Excel. (Even in grad school your time is a lot more valuable than that). To get the data into tab-delimited text, I modified an earlier script I wrote for parsing field-tagged IMDb files (in my case business.list but most of the film-level IMDb files are structured similarly). The basic approach is to read a file line-by-line and match its contents by field-tag, saving the contents in a variable named after the tag. Then when you get to the new record delimiter (in this case, a blank line), dump the contents to disk and wipe the variables. Note that since the “CR” (cited reference) field has internal carriage returns it would require a little doing to integrate into this script, which is one of the reasons you’re better off relying on wos2pajek for that functionality.

#!/usr/bin/perl
#wos2tab.pl by ghr
#this script converts field-tagged Web Of Science queries to tab-delimited text
#for creating a network from the "CR" field, see wos2pajek
#note, you can use the info extracted by this script to replicate a wos2pajek key and thus merge

use warnings; use strict;
die "usage: wos2tab.pl <wos data>\n" unless @ARGV==1;

my $rawdata = shift(@ARGV);

my $au ; #author
my $ti ; #title
my $py ; #year
my $j9 ; #j9 coding of journal title
my $dt ; #document type

# to extract another field, work it in along the lines of the existing vars
# each var must be
# 1. declared with a "my statement" (eg, lines 12-16)
# 2. added to the header with the "print OUT" statement (ie, line 29)
# 3. written into a search and store loop following an "if" statement (eg, lines 37-41)
# 4. inside the blank line match loop (ie, lines 59-66)
#  4a. add to the print statement (ie, line 60)
#  4b. add a clear statement (eg, lines 61-65)

open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n";
print OUT "au\tdt\tpy\tti\tj9\n";
while (<IN>) {
	if($_ =~ m/^AU/) {
		$au = $_;
		$au =~ s/\015?\012//; #manual chomp
		$au =~ s/^AU //; #drop leading tag
		$au =~ s/,//; #drop comma -- author only
	}
	if($_ =~ m/^DT/) {
		$dt = $_;
		$dt =~ s/\015?\012//; #manual chomp
		$dt =~ s/^DT //; #drop leading tag
	}
	if($_ =~ m/^TI/) {
		$ti = $_;
		$ti =~ s/\015?\012//; #manual chomp
		$ti =~ s/^TI //; #drop leading tag
	}
	if($_ =~ m/^J9/) {
		$j9 = $_;
		$j9 =~ s/\015?\012//; #manual chomp
		$j9 =~ s/^J9 //; #drop leading tag
	}
	if($_ =~ m/^PY/) {
		$py = $_;
		$py =~ s/\015?\012//; #manual chomp
		$py =~ s/^PY //; #drop leading tag
	}
	
	#when blank line is reached, write out and clear memory 
	if($_=~ /^$/) {
		print OUT "$au\t$dt\t$py\t$ti\t$j9\n";
		$au = "" ;
		$dt = "" ;
		$ti = "" ;
		$py = "" ;
		$j9 = "" ;
	}
}
close IN;
close OUT;
print "\ndone\n";

July 19, 2010 at 2:13 pm 6 comments

Older Posts


The Culture Geeks