Archive for July, 2010
Using R to parse (a lot of) HTML tables
| Gabriel |
For a few months I’ve been doing a daily scrape of a website but I’ve put off actually parsing the data until a colleague was dealing with a similar problem, and solving his problem reminded me of my problem. The scrape creates a folder named after the date with several dozen html files in it. So basically, the data is stored like this:
project/raw/ 20100601/ page1.htm page2.htm 20100601/ page1.htm page2.htm
Each html file has one main table along with a couple of sidebar tables. For each html file, I want to extract the main table and write it to a text file. These text files will be put in a “clean” directory that mirrors the “raw” directory.
This is the kind of thing most people would do in Perl (or Python). I had trouble getting the Perl HTML libraries to load although I probably could have coded it from scratch since HTML table structure is pretty simple (push the contents of <td> tags to an array, then write it out and clear the memory when you hit a </tr> tag). In any case, I ended up using R’s XML library, which is funny because usually I clean data in Perl or Stata and use R only as a last resort. Nonetheless, in what is undoubtedly a sign of the end times, here I am using R for cleaning. Forty years of darkness; The dead rising from the grave; Cats and dogs living together; Mass hysteria!
Anyway, the first step is to get a list of the directories in “raw” and use that to seed the top level loop. (Though note that R’s XML library can also read data directly off the web). Within this loop I create a clean subdirectory to mirror the raw subdirectory. I then get a list of every file in the raw subdirectory and seed the lower level loop. The lower level loop reads each file with “readHTMLTable” and writes it out to the mirroring clean subdirectory. Then I come out of both loops and don’t really care if the top is still spinning.
# File-Name: websiteclean.R # Date: 2010-07-28 # Author: Gabriel Rossman # Purpose: parse the scraped files # Packages Used: xml timestamp() library(XML) parentpath<-"~/Documents/project" rawdir<-paste(parentpath,"/raw",sep="") setwd(rawdir) dirlist <- list.files() for (dir in dirlist) { setwd(rawdir) setwd(dir) filenames <- list.files() cleandir<-paste(parentpath,'/clean/',dir, sep="") #create ../../clean/`dir' and call `cleandir' shellcommand<-paste("mkdir ",cleandir, sep="") system(shellcommand) print(cleandir) #progress report for (targetfile in filenames) { setwd(rawdir) setwd(dir) datafromtarget = readHTMLTable(targetfile, header=FALSE) outputfile<-paste(targetfile,'.txt', sep="") setwd(cleandir) write.table(datafromtarget[1], file = outputfile , sep = "\t", quote=TRUE) #when writing out, limit to subobject 1 to avoid the sidebar tables } } # have a nice day
Diffusion Simulation [updated]
| Gabriel |
A few days ago, I posted a diffusion model written in NetLogo. I’ve now uploaded a copy to the NetLogo Community Models, which means you can run it in your browser.
imdb_personnel.pl
| Gabriel |
As previously remarked, IMDb files have a weird structure that ain’t exactly ready to rock. I already posted a file for dealing with business.list (which could also be modified to work with files like certificates.list). The personnel files (actors.list, actresses.list, directors.list, writers.list, etc) look like this:
Gilligan, Vince 2-Face (2013) (screenplay) A.M.P.E.D. (2007) (TV) (writer) Hancock (2008) (written by) <1,2,1> Home Fries (1998) (written by) <1,1,1> The X Files: Revelations (2008) (V) (written by) (segment "Bad Blood") <8,1,1> The X Files: Revelations (2008) (V) (written by) (segment "Memento Mori") <6,1,3> Wilder Napalm (1993) (written by) <1,1,1> "Breaking Bad" (2008) (creator) "Breaking Bad" (2008) {(#3.12)} (creator) <1,1,1> "Breaking Bad" (2008) {(#3.13)} (creator) <1,1,1>
Whereas we’re used to data that looks like this:
Gilligan, Vince 2-Face (2013) (screenplay) Gilligan, Vince A.M.P.E.D. (2007) (TV) (writer) Gilligan, Vince Hancock (2008) (written by) <1,2,1> Gilligan, Vince Home Fries (1998) (written by) <1,1,1> Gilligan, Vince The X Files: Revelations (2008) (V) (written by) (segment "Bad Blood") <8,1,1> Gilligan, Vince The X Files: Revelations (2008) (V) (written by) (segment "Memento Mori") <6,1,3> Gilligan, Vince Wilder Napalm (1993) (written by) <1,1,1> Gilligan, Vince "Breaking Bad" (2008) (creator) Gilligan, Vince "Breaking Bad" (2008) {(#3.12)} (creator) <1,1,1>
Of course that’s still not complete since ideally you want to parse the title of the work (eg “Breaking Bad” (2008) ) from details of the artist’s contribution to the work (eg (creator) ). Likewise, depending on what your analysis is about you might want to drop certain kinds of works entirely. (I usually drop the porn, television, and direct to video ASAP). However you can do all that from within Stata (assuming memory isn’t an issue, which it might be) and this script will suffice to get you that far:
#!/usr/bin/perl #imdb_personnel.pl by ghr #this script cleans IMDB personnel files (eg, writers.list) #works best if you delete the header (about the first 300 lines) #raw data is organized by artist with # "ARTIST\t\tCREDIT" for the first credit (though sometimes w a single tab) and # subsequent records are "\t\t\tCREDIT" #this script makes all rows "ARTIST\tCREDIT" and drops blank rows #the resulting file is about 20% larger than the original but has a simpler structure that is easier for other programs (eg Stata) to read #further cleaning would parse the "CREDIT" field but the contents of "CREDIT" #vary by personnel file #in all files "CREDIT" begins with "FILM TITLE (YEAR)" but has further info # eg, writers.list distinguishes screenplay vs story, etc and actors.list gives character name, etc use warnings; use strict; die "usage: imdb_personnel.pl <IMDB personnel file>\n" unless @ARGV==1; my $rawdata = shift(@ARGV); # if $_ matches leading non-tab, redefine the "artist" variable # if $_ matches 3 leading tabs, drop two tabs and add current "artist" my $artist ; open(IN, "<$rawdata") or die "error opening $rawdata for reading\n"; open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n"; print OUT "artist\tcredit\n"; while (<IN>) { #match beginning of artist's credits by looking for lines NOT beginning with a tab if($_=~ /^[^\t].+\t.+/) { $artist = $_; $artist =~ s/\015?\012//; #manual chomp $artist =~ s/\t.+$//; #drop the tab(s) and all else after it $_ =~ s/\t\t/\t/; #go from two tabs to one print OUT "$_"; } #match subsequent credits (three leading tabs) if ($_ =~ m/^\t\t\t/) { $_ =~ s/^\t\t\t//; #drop leading tabs print OUT "$artist\t$_"; } #when matching blank line, clear "artist" if ($_ =~ m/^$/) { $artist = ""; } } close IN; close OUT; print "\ndone\n"; #have a nice day
Diffusion Simulation (now with Network Structure)
| Gabriel |
In an earlier post I posted a simple NetLogo model of the Bass model. Of course the whole point of NetLogo is that it makes it easy to build in network structure. As such I’ve revised the model to also allow a a network contagion effect (with a preferential attachment network structure). Here’s a screenshot of the model, mid-run. [Update, click on the screenshot to go to a live Java version of the simulation hosted by the NetLogo Community Models].
I still haven’t figured out how to embed a live NetLogo sim in a post, so to run it you’ll need to download the code below and save it as a text file with the “.nlogo” extension.
(more…)
wos2tab.pl
| Gabriel |
One of my grad students is doing some citation network analysis, for which the Python script (and .exe wrapper) wos2pajek is very well-suited. (Since most network packages can read “.net” this is a good idea even if you’re not using Pajek).
However the student is also interested in node level attributes, not just the network. Unfortunately WOS queries are field-tagged which is kind of a pain to work with and the grad student horrified me by expressing the willingness to spend weeks reshaping the data by hand in Excel. (Even in grad school your time is a lot more valuable than that). To get the data into tab-delimited text, I modified an earlier script I wrote for parsing field-tagged IMDb files (in my case business.list but most of the film-level IMDb files are structured similarly). The basic approach is to read a file line-by-line and match its contents by field-tag, saving the contents in a variable named after the tag. Then when you get to the new record delimiter (in this case, a blank line), dump the contents to disk and wipe the variables. Note that since the “CR” (cited reference) field has internal carriage returns it would require a little doing to integrate into this script, which is one of the reasons you’re better off relying on wos2pajek for that functionality.
#!/usr/bin/perl #wos2tab.pl by ghr #this script converts field-tagged Web Of Science queries to tab-delimited text #for creating a network from the "CR" field, see wos2pajek #note, you can use the info extracted by this script to replicate a wos2pajek key and thus merge use warnings; use strict; die "usage: wos2tab.pl <wos data>\n" unless @ARGV==1; my $rawdata = shift(@ARGV); my $au ; #author my $ti ; #title my $py ; #year my $j9 ; #j9 coding of journal title my $dt ; #document type # to extract another field, work it in along the lines of the existing vars # each var must be # 1. declared with a "my statement" (eg, lines 12-16) # 2. added to the header with the "print OUT" statement (ie, line 29) # 3. written into a search and store loop following an "if" statement (eg, lines 37-41) # 4. inside the blank line match loop (ie, lines 59-66) # 4a. add to the print statement (ie, line 60) # 4b. add a clear statement (eg, lines 61-65) open(IN, "<$rawdata") or die "error opening $rawdata for reading\n"; open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n"; print OUT "au\tdt\tpy\tti\tj9\n"; while (<IN>) { if($_ =~ m/^AU/) { $au = $_; $au =~ s/\015?\012//; #manual chomp $au =~ s/^AU //; #drop leading tag $au =~ s/,//; #drop comma -- author only } if($_ =~ m/^DT/) { $dt = $_; $dt =~ s/\015?\012//; #manual chomp $dt =~ s/^DT //; #drop leading tag } if($_ =~ m/^TI/) { $ti = $_; $ti =~ s/\015?\012//; #manual chomp $ti =~ s/^TI //; #drop leading tag } if($_ =~ m/^J9/) { $j9 = $_; $j9 =~ s/\015?\012//; #manual chomp $j9 =~ s/^J9 //; #drop leading tag } if($_ =~ m/^PY/) { $py = $_; $py =~ s/\015?\012//; #manual chomp $py =~ s/^PY //; #drop leading tag } #when blank line is reached, write out and clear memory if($_=~ /^$/) { print OUT "$au\t$dt\t$py\t$ti\t$j9\n"; $au = "" ; $dt = "" ; $ti = "" ; $py = "" ; $j9 = "" ; } } close IN; close OUT; print "\ndone\n";
NetLogo
| Gabriel |
As part of my prep work for attending James Kitts’ “Modeling Emergence” workshop at ASA I’ve been playing with NetLogo and there’s some really cool stuff in there. The basic “models library” includes biology models like wolf-sheep as well as some social science models that I already lecture on but didn’t have a dynamic illustration for, most notably Barabasi’s preferential attachment model and the team assembly model Uzzi et al developed on Broadway musicals.
It’s also pretty easy to learn. After just a few hours of reading the tutorial, I was able to write the script below, which is kind of a replication of an older Stata script. The really exciting thing though is that NetLogo makes it much easier to do simulations with network and/or spatial structure so I could refine the model to have structure. (Actually the model library already has this in the form of the “virus on a network” model but the point is that it’s flexible). So while this sim doesn’t do anything new it’s nice that I can extend it to do new things.
Note that in theory you can embed a NetLogo simulation in a webpage but I don’t know how to get WordPress to do this. Even doing it locally on my mac, the embed doesn’t works in Safari (but does work in Chrome, Firefox, and Opera). In lieu of embedding the whole simulation, here’s the “procedure” source code, which will only work if you hook the variables “constant-hazard” and “endogenous-hazard” and the plot-pen “Totals:adoption” to GUI elements.
to setup clear-all setup-turtles end to setup-turtles crt 1000 ask turtles [set color white] ask turtles [ setxy random-xcor random-ycor ] end to decide-adopt let nadopt count turtles with [color = red] ask turtles [ if color = white [ if random 100 <= constant-hazard + ( (nadopt / 1000 ) * endogenous-hazard ) [set color red ] ;; adopt if random number (0-100) is lower than ;; constant hazard + proportion adopted * endogenous hazard ] ] end to go if ticks >= 100 [stop] decide-adopt tick do-plots end to do-plots set-current-plot "Totals" set-current-plot-pen "adoptions" plot count turtles with [color = red] end
Update, I realized that a .nlogo file is just a text file so I can put the whole thing in the post, including the GUI elements. To use it just save what’s below as a text file with the extension to “.nlogo”. Not quite as nice as embedding, but it’ll work. Code below the jump.
An approximate rant
| Gabriel |
<rant>
In response to the first amendment case US v Stevens, Eugene Volokh comments on the majority’s statement that hunting videos are “several orders of magnitude” more popular than crush fetish videos. [This is important because the law at issue was motivated by fetish videos which are obscenity, but was drafted by Congress and used by prosecutors to also apply to non-obscene content]. Volokh presents the results of a reader poll he did recently showing that about a third of lawyers don’t know what “order of magnitude” means and on this he counsels that it might have been better for SCOTUS to say “more than a thousand times greater.” Given that a lot of lawyers don’t understand the phrase, Volokh is probably right for that audience. [It’s worth noting that Volokh himself is hardly innumerate, having earned a BA in math and computer science at the age of 15].
Similarly, the ASA Style Guide says that “order of magnitude” is too wordy and it’s better to use the word “ten.” I’m generally sensitive to the problem of wordiness (for instance, see Mark Twain’s fisking of Cooper) however the problem with the ASA’s advice is that “order of magnitude” does not mean “ten” but something more like “about ten.”
OK, so why not just say “about ten”? Because people will ignore the “about.” In a comment thread discussion I noted that even people who know better tend tend to reify point estimates. For instance, the advertising market is sensitive to measured changes in audience size that are well within the margin of error for audience surveys (Napoli 2003). What I like about the phrase “order of magnitude” is that unlike “ten,” or even “about ten,” the term makes it impossible to ignore the error bars.
The reason I agree with Volokh as regards to lawyers but disagree with ASA as regards to sociology is that lawyers can be forgiven for having priorities other than appreciating the uncertainty inherent in numbers but this should be a core issue for people who aspire to be scientists and if it isn’t then why the fuck do we bother making statistics a requirement at both the undergrad and grad level?
</rant>
Recent Comments