Archive for July, 2010

Using R to parse (a lot of) HTML tables

| Gabriel |

For a few months I’ve been doing a daily scrape of a website but I’ve put off actually parsing the data until a colleague was dealing with a similar problem, and solving his problem reminded me of my problem. The scrape creates a folder named after the date with several dozen html files in it. So basically, the data is stored like this:

project/raw/
  20100601/
    page1.htm
    page2.htm
  20100601/
    page1.htm
    page2.htm

Each html file has one main table along with a couple of sidebar tables. For each html file, I want to extract the main table and write it to a text file. These text files will be put in a “clean” directory that mirrors the “raw” directory.

This is the kind of thing most people would do in Perl (or Python). I had trouble getting the Perl HTML libraries to load although I probably could have coded it from scratch since HTML table structure is pretty simple (push the contents of <td> tags to an array, then write it out and clear the memory when you hit a </tr> tag). In any case, I ended up using R’s XML library, which is funny because usually I clean data in Perl or Stata and use R only as a last resort. Nonetheless, in what is undoubtedly a sign of the end times, here I am using R for cleaning. Forty years of darkness; The dead rising from the grave; Cats and dogs living together; Mass hysteria!

Anyway, the first step is to get a list of the directories in “raw” and use that to seed the top level loop. (Though note that R’s XML library can also read data directly off the web). Within this loop I create a clean subdirectory to mirror the raw subdirectory. I then get a list of every file in the raw subdirectory and seed the lower level loop. The lower level loop reads each file with “readHTMLTable” and writes it out to the mirroring clean subdirectory. Then I come out of both loops and don’t really care if the top is still spinning.

# File-Name:       websiteclean.R
# Date:            2010-07-28
# Author:          Gabriel Rossman
# Purpose:         parse the scraped files
# Packages Used:   xml

timestamp()
library(XML)
parentpath<-"~/Documents/project"
rawdir<-paste(parentpath,"/raw",sep="")
setwd(rawdir)
dirlist <- list.files()
for (dir in dirlist) {
	setwd(rawdir)
	setwd(dir)
	filenames <- list.files()
	cleandir<-paste(parentpath,'/clean/',dir, sep="") #create ../../clean/`dir' and call `cleandir'
	shellcommand<-paste("mkdir ",cleandir, sep="")
	system(shellcommand)
	print(cleandir) #progress report
	for (targetfile in filenames) {
		setwd(rawdir)
		setwd(dir)
		datafromtarget = readHTMLTable(targetfile, header=FALSE)
		outputfile<-paste(targetfile,'.txt', sep="")
		setwd(cleandir)
		write.table(datafromtarget[1], file = outputfile , sep = "\t", quote=TRUE)  #when writing out, limit to subobject 1 to avoid the sidebar tables
	}
}

# have a nice day
Advertisement

July 29, 2010 at 4:58 am 2 comments

imdb_personnel.pl

| Gabriel |

As previously remarked, IMDb files have a weird structure that ain’t exactly ready to rock. I already posted a file for dealing with business.list (which could also be modified to work with files like certificates.list). The personnel files (actors.list, actresses.list, directors.list, writers.list, etc) look like this:

Gilligan, Vince		2-Face (2013)  (screenplay)
			A.M.P.E.D. (2007) (TV)  (writer)
			Hancock (2008)  (written by)  <1,2,1>
			Home Fries (1998)  (written by)  <1,1,1>
			The X Files: Revelations (2008) (V)  (written by) (segment "Bad Blood")  <8,1,1>
			The X Files: Revelations (2008) (V)  (written by) (segment "Memento Mori")  <6,1,3>
			Wilder Napalm (1993)  (written by)  <1,1,1>
			"Breaking Bad" (2008)  (creator)
			"Breaking Bad" (2008) {(#3.12)}  (creator)  <1,1,1>
			"Breaking Bad" (2008) {(#3.13)}  (creator)  <1,1,1>

Whereas we’re used to data that looks like this:

Gilligan, Vince	2-Face (2013)  (screenplay)
Gilligan, Vince	A.M.P.E.D. (2007) (TV)  (writer)
Gilligan, Vince	Hancock (2008)  (written by)  <1,2,1>
Gilligan, Vince	Home Fries (1998)  (written by)  <1,1,1>
Gilligan, Vince	The X Files: Revelations (2008) (V)  (written by) (segment "Bad Blood")  <8,1,1>
Gilligan, Vince	The X Files: Revelations (2008) (V)  (written by) (segment "Memento Mori")  <6,1,3>
Gilligan, Vince	Wilder Napalm (1993)  (written by)  <1,1,1>
Gilligan, Vince	"Breaking Bad" (2008)  (creator)
Gilligan, Vince	"Breaking Bad" (2008) {(#3.12)}  (creator)  <1,1,1>

Of course that’s still not complete since ideally you want to parse the title of the work (eg “Breaking Bad” (2008) ) from details of the artist’s contribution to the work (eg (creator) ). Likewise, depending on what your analysis is about you might want to drop certain kinds of works entirely. (I usually drop the porn, television, and direct to video ASAP). However you can do all that from within Stata (assuming memory isn’t an issue, which it might be) and this script will suffice to get you that far:

#!/usr/bin/perl
#imdb_personnel.pl by ghr
#this script cleans IMDB personnel files (eg, writers.list)
#works best if you delete the header (about the first 300 lines)
#raw data is organized by artist with
# "ARTIST\t\tCREDIT" for the first credit (though sometimes w a single tab) and
# subsequent records are "\t\t\tCREDIT"
#this script makes all rows "ARTIST\tCREDIT" and drops blank rows
#the resulting file is about 20% larger than the original but has a simpler structure that is easier for other programs (eg Stata) to read
#further cleaning would parse the "CREDIT" field but the contents of "CREDIT" 
#vary by personnel file
#in all files "CREDIT" begins with "FILM TITLE (YEAR)" but has further info
# eg, writers.list distinguishes screenplay vs story, etc and actors.list gives character name, etc

use warnings; use strict;
die "usage: imdb_personnel.pl <IMDB personnel file>\n" unless @ARGV==1;
my $rawdata = shift(@ARGV);

# if $_ matches leading non-tab, redefine the "artist" variable
# if $_ matches 3 leading tabs, drop two tabs and add current "artist"
my $artist ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n";
print OUT "artist\tcredit\n";
while (<IN>) {
	#match beginning of artist's credits by looking for lines NOT beginning with a tab
	if($_=~ /^[^\t].+\t.+/) {
		$artist = $_; 
		$artist =~ s/\015?\012//; #manual chomp
		$artist =~ s/\t.+$//; #drop the tab(s) and all else after it 
		$_ =~ s/\t\t/\t/; #go from two tabs to one
		print OUT "$_";
	}
	#match subsequent credits (three leading tabs)
	if ($_ =~ m/^\t\t\t/) {
		$_ =~ s/^\t\t\t//; #drop leading tabs
		print OUT "$artist\t$_";
	}
	#when matching blank line, clear "artist"
	if ($_ =~ m/^$/) {
		$artist = "";
	}
}
close IN;
close OUT;
print "\ndone\n";
#have a nice day

July 26, 2010 at 4:13 am 2 comments

wos2tab.pl

| Gabriel |

One of my grad students is doing some citation network analysis, for which the Python script (and .exe wrapper) wos2pajek is very well-suited. (Since most network packages can read “.net” this is a good idea even if you’re not using Pajek).

However the student is also interested in node level attributes, not just the network. Unfortunately WOS queries are field-tagged which is kind of a pain to work with and the grad student horrified me by expressing the willingness to spend weeks reshaping the data by hand in Excel. (Even in grad school your time is a lot more valuable than that). To get the data into tab-delimited text, I modified an earlier script I wrote for parsing field-tagged IMDb files (in my case business.list but most of the film-level IMDb files are structured similarly). The basic approach is to read a file line-by-line and match its contents by field-tag, saving the contents in a variable named after the tag. Then when you get to the new record delimiter (in this case, a blank line), dump the contents to disk and wipe the variables. Note that since the “CR” (cited reference) field has internal carriage returns it would require a little doing to integrate into this script, which is one of the reasons you’re better off relying on wos2pajek for that functionality.

#!/usr/bin/perl
#wos2tab.pl by ghr
#this script converts field-tagged Web Of Science queries to tab-delimited text
#for creating a network from the "CR" field, see wos2pajek
#note, you can use the info extracted by this script to replicate a wos2pajek key and thus merge

use warnings; use strict;
die "usage: wos2tab.pl <wos data>\n" unless @ARGV==1;

my $rawdata = shift(@ARGV);

my $au ; #author
my $ti ; #title
my $py ; #year
my $j9 ; #j9 coding of journal title
my $dt ; #document type

# to extract another field, work it in along the lines of the existing vars
# each var must be
# 1. declared with a "my statement" (eg, lines 12-16)
# 2. added to the header with the "print OUT" statement (ie, line 29)
# 3. written into a search and store loop following an "if" statement (eg, lines 37-41)
# 4. inside the blank line match loop (ie, lines 59-66)
#  4a. add to the print statement (ie, line 60)
#  4b. add a clear statement (eg, lines 61-65)

open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n";
print OUT "au\tdt\tpy\tti\tj9\n";
while (<IN>) {
	if($_ =~ m/^AU/) {
		$au = $_;
		$au =~ s/\015?\012//; #manual chomp
		$au =~ s/^AU //; #drop leading tag
		$au =~ s/,//; #drop comma -- author only
	}
	if($_ =~ m/^DT/) {
		$dt = $_;
		$dt =~ s/\015?\012//; #manual chomp
		$dt =~ s/^DT //; #drop leading tag
	}
	if($_ =~ m/^TI/) {
		$ti = $_;
		$ti =~ s/\015?\012//; #manual chomp
		$ti =~ s/^TI //; #drop leading tag
	}
	if($_ =~ m/^J9/) {
		$j9 = $_;
		$j9 =~ s/\015?\012//; #manual chomp
		$j9 =~ s/^J9 //; #drop leading tag
	}
	if($_ =~ m/^PY/) {
		$py = $_;
		$py =~ s/\015?\012//; #manual chomp
		$py =~ s/^PY //; #drop leading tag
	}
	
	#when blank line is reached, write out and clear memory 
	if($_=~ /^$/) {
		print OUT "$au\t$dt\t$py\t$ti\t$j9\n";
		$au = "" ;
		$dt = "" ;
		$ti = "" ;
		$py = "" ;
		$j9 = "" ;
	}
}
close IN;
close OUT;
print "\ndone\n";

July 19, 2010 at 2:13 pm 6 comments

An approximate rant

| Gabriel |

<rant>

In response to the first amendment case US v Stevens, Eugene Volokh comments on the majority’s statement that hunting videos are “several orders of magnitude” more popular than crush fetish videos. [This is important because the law at issue was motivated by fetish videos which are obscenity, but was drafted by Congress and used by prosecutors to also apply to non-obscene content]. Volokh presents the results of a reader poll he did recently showing that about a third of lawyers don’t know what “order of magnitude” means and on this he counsels that it might have been better for SCOTUS to say “more than a thousand times greater.” Given that a lot of lawyers don’t understand the phrase, Volokh is probably right for that audience. [It’s worth noting that Volokh himself is hardly innumerate, having earned a BA in math and computer science at the age of 15].

Similarly, the ASA Style Guide says that “order of magnitude” is too wordy and it’s better to use the word “ten.” I’m generally sensitive to the problem of wordiness (for instance, see Mark Twain’s fisking of Cooper) however the problem with the ASA’s advice is that “order of magnitude” does not mean “ten” but something more like “about ten.”

OK, so why not just say “about ten”? Because people will ignore the “about.” In a comment thread discussion I noted that even people who know better tend tend to reify point estimates. For instance, the advertising market is sensitive to measured changes in audience size that are well within the margin of error for audience surveys (Napoli 2003). What I like about the phrase “order of magnitude” is that unlike “ten,” or even “about ten,” the term makes it impossible to ignore the error bars.

The reason I agree with Volokh as regards to lawyers but disagree with ASA as regards to sociology is that lawyers can be forgiven for having priorities other than appreciating the uncertainty inherent in numbers but this should be a core issue for people who aspire to be scientists and if it isn’t then why the fuck do we bother making statistics a requirement at both the undergrad and grad level?

</rant>

July 5, 2010 at 2:41 pm


The Culture Geeks