Using R to parse (a lot of) HTML tables

July 29, 2010 at 4:58 am GR 2 comments

| Gabriel |

For a few months I’ve been doing a daily scrape of a website but I’ve put off actually parsing the data until a colleague was dealing with a similar problem, and solving his problem reminded me of my problem. The scrape creates a folder named after the date with several dozen html files in it. So basically, the data is stored like this:

project/raw/
  20100601/
    page1.htm
    page2.htm
  20100601/
    page1.htm
    page2.htm

Each html file has one main table along with a couple of sidebar tables. For each html file, I want to extract the main table and write it to a text file. These text files will be put in a “clean” directory that mirrors the “raw” directory.

This is the kind of thing most people would do in Perl (or Python). I had trouble getting the Perl HTML libraries to load although I probably could have coded it from scratch since HTML table structure is pretty simple (push the contents of <td> tags to an array, then write it out and clear the memory when you hit a </tr> tag). In any case, I ended up using R’s XML library, which is funny because usually I clean data in Perl or Stata and use R only as a last resort. Nonetheless, in what is undoubtedly a sign of the end times, here I am using R for cleaning. Forty years of darkness; The dead rising from the grave; Cats and dogs living together; Mass hysteria!

Anyway, the first step is to get a list of the directories in “raw” and use that to seed the top level loop. (Though note that R’s XML library can also read data directly off the web). Within this loop I create a clean subdirectory to mirror the raw subdirectory. I then get a list of every file in the raw subdirectory and seed the lower level loop. The lower level loop reads each file with “readHTMLTable” and writes it out to the mirroring clean subdirectory. Then I come out of both loops and don’t really care if the top is still spinning.

# File-Name:       websiteclean.R
# Date:            2010-07-28
# Author:          Gabriel Rossman
# Purpose:         parse the scraped files
# Packages Used:   xml

timestamp()
library(XML)
parentpath<-"~/Documents/project"
rawdir<-paste(parentpath,"/raw",sep="")
setwd(rawdir)
dirlist <- list.files()
for (dir in dirlist) {
	setwd(rawdir)
	setwd(dir)
	filenames <- list.files()
	cleandir<-paste(parentpath,'/clean/',dir, sep="") #create ../../clean/`dir' and call `cleandir'
	shellcommand<-paste("mkdir ",cleandir, sep="")
	system(shellcommand)
	print(cleandir) #progress report
	for (targetfile in filenames) {
		setwd(rawdir)
		setwd(dir)
		datafromtarget = readHTMLTable(targetfile, header=FALSE)
		outputfile<-paste(targetfile,'.txt', sep="")
		setwd(cleandir)
		write.table(datafromtarget[1], file = outputfile , sep = "\t", quote=TRUE)  #when writing out, limit to subobject 1 to avoid the sidebar tables
	}
}

# have a nice day

Entry filed under: Uncategorized. Tags: cleaning, R, scraping.

Diffusion Simulation [updated] Really big headers

2 Comments

1. jgm | August 4, 2010 at 1:23 pm

Nice work. I had no idea that R could handle HTML.

I’m in the process of scraping some complex data from the web, which sadly isn’t in tables, although it is somewhat structured. After messing about with perl libraries, iMacros, and java xml parsing libraries (and some more) I came to the conclusion that unless it is well formed HTML (which it never appears to be) it isn’t worth the effort.

SO, I went back to the shell:
lynx -dump -width=4096 [URL]

which dumps a nice, clean text file that I can parse more easily. Of course, if the site structure changes radically I’m pooched but this data collection is a one time deal (famous last words, I know).

Of course, after all this work you tell me that R can parse HTML… hmm.

BTW – great site. Always nice to hear what other people are doing.
- 2. gabrielrossman | August 4, 2010 at 1:42 pm
  
  it never occurred to me to scrape with lynx. i had been scraping with “curl” but more recently i’ve had good results with “wget”
  
  good luck with your scraping and cleaning. i don’t know what you’re trying to get out of the files but if it’s not a table then “grep” is a surprisingly useful tool. for more advanced query needs you might want something like the tool that Mark Kennedy is developing.

Code and Culture