Posts tagged ‘cleaning’

Using R to parse (a lot of) HTML tables

| Gabriel |

For a few months I’ve been doing a daily scrape of a website but I’ve put off actually parsing the data until a colleague was dealing with a similar problem, and solving his problem reminded me of my problem. The scrape creates a folder named after the date with several dozen html files in it. So basically, the data is stored like this:

project/raw/
  20100601/
    page1.htm
    page2.htm
  20100601/
    page1.htm
    page2.htm

Each html file has one main table along with a couple of sidebar tables. For each html file, I want to extract the main table and write it to a text file. These text files will be put in a “clean” directory that mirrors the “raw” directory.

This is the kind of thing most people would do in Perl (or Python). I had trouble getting the Perl HTML libraries to load although I probably could have coded it from scratch since HTML table structure is pretty simple (push the contents of <td> tags to an array, then write it out and clear the memory when you hit a </tr> tag). In any case, I ended up using R’s XML library, which is funny because usually I clean data in Perl or Stata and use R only as a last resort. Nonetheless, in what is undoubtedly a sign of the end times, here I am using R for cleaning. Forty years of darkness; The dead rising from the grave; Cats and dogs living together; Mass hysteria!

Anyway, the first step is to get a list of the directories in “raw” and use that to seed the top level loop. (Though note that R’s XML library can also read data directly off the web). Within this loop I create a clean subdirectory to mirror the raw subdirectory. I then get a list of every file in the raw subdirectory and seed the lower level loop. The lower level loop reads each file with “readHTMLTable” and writes it out to the mirroring clean subdirectory. Then I come out of both loops and don’t really care if the top is still spinning.

# File-Name:       websiteclean.R
# Date:            2010-07-28
# Author:          Gabriel Rossman
# Purpose:         parse the scraped files
# Packages Used:   xml

timestamp()
library(XML)
parentpath<-"~/Documents/project"
rawdir<-paste(parentpath,"/raw",sep="")
setwd(rawdir)
dirlist <- list.files()
for (dir in dirlist) {
	setwd(rawdir)
	setwd(dir)
	filenames <- list.files()
	cleandir<-paste(parentpath,'/clean/',dir, sep="") #create ../../clean/`dir' and call `cleandir'
	shellcommand<-paste("mkdir ",cleandir, sep="")
	system(shellcommand)
	print(cleandir) #progress report
	for (targetfile in filenames) {
		setwd(rawdir)
		setwd(dir)
		datafromtarget = readHTMLTable(targetfile, header=FALSE)
		outputfile<-paste(targetfile,'.txt', sep="")
		setwd(cleandir)
		write.table(datafromtarget[1], file = outputfile , sep = "\t", quote=TRUE)  #when writing out, limit to subobject 1 to avoid the sidebar tables
	}
}

# have a nice day

July 29, 2010 at 4:58 am 2 comments

imdb_personnel.pl

| Gabriel |

As previously remarked, IMDb files have a weird structure that ain’t exactly ready to rock. I already posted a file for dealing with business.list (which could also be modified to work with files like certificates.list). The personnel files (actors.list, actresses.list, directors.list, writers.list, etc) look like this:

Gilligan, Vince		2-Face (2013)  (screenplay)
			A.M.P.E.D. (2007) (TV)  (writer)
			Hancock (2008)  (written by)  <1,2,1>
			Home Fries (1998)  (written by)  <1,1,1>
			The X Files: Revelations (2008) (V)  (written by) (segment "Bad Blood")  <8,1,1>
			The X Files: Revelations (2008) (V)  (written by) (segment "Memento Mori")  <6,1,3>
			Wilder Napalm (1993)  (written by)  <1,1,1>
			"Breaking Bad" (2008)  (creator)
			"Breaking Bad" (2008) {(#3.12)}  (creator)  <1,1,1>
			"Breaking Bad" (2008) {(#3.13)}  (creator)  <1,1,1>

Whereas we’re used to data that looks like this:

Gilligan, Vince	2-Face (2013)  (screenplay)
Gilligan, Vince	A.M.P.E.D. (2007) (TV)  (writer)
Gilligan, Vince	Hancock (2008)  (written by)  <1,2,1>
Gilligan, Vince	Home Fries (1998)  (written by)  <1,1,1>
Gilligan, Vince	The X Files: Revelations (2008) (V)  (written by) (segment "Bad Blood")  <8,1,1>
Gilligan, Vince	The X Files: Revelations (2008) (V)  (written by) (segment "Memento Mori")  <6,1,3>
Gilligan, Vince	Wilder Napalm (1993)  (written by)  <1,1,1>
Gilligan, Vince	"Breaking Bad" (2008)  (creator)
Gilligan, Vince	"Breaking Bad" (2008) {(#3.12)}  (creator)  <1,1,1>

Of course that’s still not complete since ideally you want to parse the title of the work (eg “Breaking Bad” (2008) ) from details of the artist’s contribution to the work (eg (creator) ). Likewise, depending on what your analysis is about you might want to drop certain kinds of works entirely. (I usually drop the porn, television, and direct to video ASAP). However you can do all that from within Stata (assuming memory isn’t an issue, which it might be) and this script will suffice to get you that far:

#!/usr/bin/perl
#imdb_personnel.pl by ghr
#this script cleans IMDB personnel files (eg, writers.list)
#works best if you delete the header (about the first 300 lines)
#raw data is organized by artist with
# "ARTIST\t\tCREDIT" for the first credit (though sometimes w a single tab) and
# subsequent records are "\t\t\tCREDIT"
#this script makes all rows "ARTIST\tCREDIT" and drops blank rows
#the resulting file is about 20% larger than the original but has a simpler structure that is easier for other programs (eg Stata) to read
#further cleaning would parse the "CREDIT" field but the contents of "CREDIT" 
#vary by personnel file
#in all files "CREDIT" begins with "FILM TITLE (YEAR)" but has further info
# eg, writers.list distinguishes screenplay vs story, etc and actors.list gives character name, etc

use warnings; use strict;
die "usage: imdb_personnel.pl <IMDB personnel file>\n" unless @ARGV==1;
my $rawdata = shift(@ARGV);

# if $_ matches leading non-tab, redefine the "artist" variable
# if $_ matches 3 leading tabs, drop two tabs and add current "artist"
my $artist ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n";
print OUT "artist\tcredit\n";
while (<IN>) {
	#match beginning of artist's credits by looking for lines NOT beginning with a tab
	if($_=~ /^[^\t].+\t.+/) {
		$artist = $_; 
		$artist =~ s/\015?\012//; #manual chomp
		$artist =~ s/\t.+$//; #drop the tab(s) and all else after it 
		$_ =~ s/\t\t/\t/; #go from two tabs to one
		print OUT "$_";
	}
	#match subsequent credits (three leading tabs)
	if ($_ =~ m/^\t\t\t/) {
		$_ =~ s/^\t\t\t//; #drop leading tabs
		print OUT "$artist\t$_";
	}
	#when matching blank line, clear "artist"
	if ($_ =~ m/^$/) {
		$artist = "";
	}
}
close IN;
close OUT;
print "\ndone\n";
#have a nice day

July 26, 2010 at 4:13 am 2 comments

wos2tab.pl

| Gabriel |

One of my grad students is doing some citation network analysis, for which the Python script (and .exe wrapper) wos2pajek is very well-suited. (Since most network packages can read “.net” this is a good idea even if you’re not using Pajek).

However the student is also interested in node level attributes, not just the network. Unfortunately WOS queries are field-tagged which is kind of a pain to work with and the grad student horrified me by expressing the willingness to spend weeks reshaping the data by hand in Excel. (Even in grad school your time is a lot more valuable than that). To get the data into tab-delimited text, I modified an earlier script I wrote for parsing field-tagged IMDb files (in my case business.list but most of the film-level IMDb files are structured similarly). The basic approach is to read a file line-by-line and match its contents by field-tag, saving the contents in a variable named after the tag. Then when you get to the new record delimiter (in this case, a blank line), dump the contents to disk and wipe the variables. Note that since the “CR” (cited reference) field has internal carriage returns it would require a little doing to integrate into this script, which is one of the reasons you’re better off relying on wos2pajek for that functionality.

#!/usr/bin/perl
#wos2tab.pl by ghr
#this script converts field-tagged Web Of Science queries to tab-delimited text
#for creating a network from the "CR" field, see wos2pajek
#note, you can use the info extracted by this script to replicate a wos2pajek key and thus merge

use warnings; use strict;
die "usage: wos2tab.pl <wos data>\n" unless @ARGV==1;

my $rawdata = shift(@ARGV);

my $au ; #author
my $ti ; #title
my $py ; #year
my $j9 ; #j9 coding of journal title
my $dt ; #document type

# to extract another field, work it in along the lines of the existing vars
# each var must be
# 1. declared with a "my statement" (eg, lines 12-16)
# 2. added to the header with the "print OUT" statement (ie, line 29)
# 3. written into a search and store loop following an "if" statement (eg, lines 37-41)
# 4. inside the blank line match loop (ie, lines 59-66)
#  4a. add to the print statement (ie, line 60)
#  4b. add a clear statement (eg, lines 61-65)

open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n";
print OUT "au\tdt\tpy\tti\tj9\n";
while (<IN>) {
	if($_ =~ m/^AU/) {
		$au = $_;
		$au =~ s/\015?\012//; #manual chomp
		$au =~ s/^AU //; #drop leading tag
		$au =~ s/,//; #drop comma -- author only
	}
	if($_ =~ m/^DT/) {
		$dt = $_;
		$dt =~ s/\015?\012//; #manual chomp
		$dt =~ s/^DT //; #drop leading tag
	}
	if($_ =~ m/^TI/) {
		$ti = $_;
		$ti =~ s/\015?\012//; #manual chomp
		$ti =~ s/^TI //; #drop leading tag
	}
	if($_ =~ m/^J9/) {
		$j9 = $_;
		$j9 =~ s/\015?\012//; #manual chomp
		$j9 =~ s/^J9 //; #drop leading tag
	}
	if($_ =~ m/^PY/) {
		$py = $_;
		$py =~ s/\015?\012//; #manual chomp
		$py =~ s/^PY //; #drop leading tag
	}
	
	#when blank line is reached, write out and clear memory 
	if($_=~ /^$/) {
		print OUT "$au\t$dt\t$py\t$ti\t$j9\n";
		$au = "" ;
		$dt = "" ;
		$ti = "" ;
		$py = "" ;
		$j9 = "" ;
	}
}
close IN;
close OUT;
print "\ndone\n";

July 19, 2010 at 2:13 pm 6 comments

importspss.ado (requires R)

| Gabriel |

Mike Gruszczynski has a post up pointing out that you can use R to translate files, for instance from SPSS to Stata. I like this a lot because it let’s you avoid using SPSS but I’d like it even better if it let you avoid using R as well.

As such I rewrote the script to work entirely from Stata. Mike wanted to do this in Bash but couldn’t figure out how to pass arguments from the shell to R. Frankly, I don’t know how to do this either which is why my solution is to have Stata write and execute an R source file so all the argument passing occurs within Stata. This follows my general philosophy of doing a lot of code mise en place in a user-friendly language so I can spend as little time as necessary in R. (Note that you could just as easily write this in Bash, but I figured this way you can a) make it cross-platform and b) attach it to “use” for a one-stop shop “import” command).

*importspss.ado
*by GHR 6/29/2010
*this script uses R to translate SPSS to Stata
*it takes as arguments the SPSS file and Stata file
*adapted from http://mikegruz.tumblr.com/post/704966440/convert-spss-to-stata-without-stat-transfer 

*DEPENDENCY: R and library(foreign) 
*if R exists but is not in PATH, change the reference to "R" in line 27 to be the specific location

capture program drop importspss
program define importspss
	set more off
	local spssfile `1'
	if "`2'"=="" {
		local statafile "`spssfile'.dta"
	}
	else {
		local statafile `2'	
	}
	local sourcefile=round(runiform()*1000)
	capture file close rsource
	file open rsource using `sourcefile'.R, write text replace
	file write rsource "library(foreign)" _n
	file write rsource `"data <- read.spss("`spssfile'", to.data.frame=TRUE)"' _n
	file write rsource `"write.dta(data, file="`statafile'")"' _n
	file close rsource
	shell R --vanilla <`sourcefile'.R
	erase `sourcefile'.R
	use `statafile', clear
end

June 29, 2010 at 3:01 pm 6 comments

Grepmerge

| Gabriel |

Over at the Orgtheory mothership, Fabio asked how to do a partial string match in Stata, specifically to see if certain keywords appear in scientific abstracts. This turns out to be hard, not because there are no appropriate functions in Stata (both strmatch() and regexm() can do it) but because Stata can only handle 244 characters in a string variable. Many of the kinds of data we’d want to do content analysis on are much bigger than this. For instance, scientific abstracts are about 2000 characters and news stories are about 10000 characters.

OW suggested using SPSS, and her advice is well-taken as she’s a master at ginormous content analysis projects. Andrew Perrin suggested using Perl and this is closer to my own sympathies. I agree that Perl is generally a good idea for content analysis, but in this case I think a simple grep will suffice.

grep "searchterm" filein.csv | cut -d "," -f 1 > fileout.csv

The way this works is you start with a csv file called filein.csv (or whatever) where the record id key is in the first column. You do a grep search for “searchterm” in that file and pipe the output to the “cut” command. The -d “,” option tells cut that the stream is comma delimited and the -f 1 option tells it to only keep the first field (which is your unique record id). The “> fileout.csv” part says to pipe the output to disk. (Note that in Unix “>” as a file operator means replace and “>>” means append). You then have a text file called fileout.csv that’s just a list of records where your search term appears. You can merge this into Stata and treat a _merge==3 as meaning that the case includes the search term.

You can also wrap the whole thing in a Stata command that takes as arguments (in order): the term to search for, the file to look for it in, the name of the key variable in the master data, and (optionally) the name of the new variable that indicates a match. However for some reason the Stata-wrapped version only works with literal strings and not regexp searches. Also note that all this is for Mac/Linux. You might be able to get it to work on Windows with CygWin or Powershell.

capture program drop grepmerge
program define grepmerge
	local searchterm	"`1'"
	local fileread	"`2'"
	local key "`3'"
	if "`4'"=="" {
		local newvar "`1'"
	}
	else {
		local newvar "`4'"
	}
	tempfile filewrite
	shell grep "`searchterm'" `fileread' | cut -d "," -f 1 > `filewrite'
	tempvar sortorder
	gen `sortorder'=[_n]
	tempfile masterdata
	save `masterdata'
	insheet using `filewrite', clear
	ren v1 `key'
	merge 1:1 `key' using `masterdata', gen(`newvar')
	sort `sortorder'
	recode `newvar' 1=.a 2=0 3=1
	notes `newvar' : "`searchterm'" appears in this case
	lab val `newvar'
end

April 29, 2010 at 12:45 pm 8 comments

Using grep (or mdfind) to reshape data

| Gabriel |

Sometimes you have cross-class data that’s arranged the opposite of how you want. For instance, suppose I have a bunch of files organized by song, and I’m interested in finding all the song files that mention a particlar radio station, say KIIS- FM. I can run the following command that finds all the song files in my song directory (or its subdirectories) and puts the names of these files in a text file called “kiis.txt”

grep -l -r ’KIIS’ ~/Documents/book/stata/rawsongs/ > kiis.txt

Of course to run it from within Stata I can prefix it with “shell”. By extension, I could then write a program around this shell command that will let me query station data from my song files (or vice versa). You could do something similar to see what news stories saved from Lexis-Nexis or scraped web pages contain a certain keyword.

Unfortunately grep is pretty slow, but you can do it faster by accessing your desktop search index. It’s basically the difference between reading a book looking for a reference versus looking the reference up in the book’s index. This is especially important if you’re searching over a lot of data — grep is fine for a few dozen files but you want indexed search if you’re looking over thousands of files, let alone your whole file system. On a Mac, you can access your Spotlight index from shell scripts (or the Terminal) with “mdfind“. The syntax is a little different than grep so the example above should be rewritten as

mdfind -onlyin ~/Documents/book/stata/rawsongs/ "KIIS" > kiis.txt

While grep is slower than mdfind, it’s also more flexible. Fortunately (as described here), you can get the best of both worlds by doing a broad search with mdfind then piping the results to grep for more refined work.

April 7, 2010 at 5:13 am 1 comment

Mail.app and server logs

| Gabriel |

I recently started scraping a website using curl and cron (for earlier thoughts on this see here). Because I don’t leave my mac turned on at 2am, I’m hosting the scrape on one of the UCLA servers. I get a daily log of the scrape by email, but I know myself well enough to know that I’ll get bored with reading the logs after a few days.

As such, I added a “rule” to Mail.app that looks for error messages. When curl fails for any reason, the standard error message is “Warning: Failed to create the file …” Using the “rules” tab of the Mail.app preferences, I told Mail to turn any message red if it has my server’s log boilerplate in the subject line and contains a curl error message anywhere in the text. Now when I open my email in the morning and see a black message I know everything is fine whereas a red message (or no message at all) means there’s a problem.

April 2, 2010 at 5:10 am

Older Posts Newer Posts


The Culture Geeks