Posts tagged ‘scraping’

Scraping Twitter with Python

| Gabriel |

As long-time readers will remember, I have been collecting Twitter with the R library(twitteR). Unfortunately that workflow has proven to be buggy, mostly for reasons having to do with authentication. As such I decided to learn Python and migrate my project to the Twython module. Overall, I’ve been very impressed by the language and the module.  I haven’t had any dependency problems and authentication works pretty smoothly. On the other hand, it requires a lot more manual coding to get around rate limits than does twitteR and this is a big part of what my scripts are doing.

I’ll let you follow the standard instructions for installing Python 3 and the Twython module before showing you my workflow. Note that all of my code was run on Python 3.5.1 and OSX 10.9. You want to use Python 3, not Python 2 as tweets are UTF-8. If you’re a Mac person, OSX comes with 2.7 but you will need to install Python3. For the same reason, use Stata 14 for tweets.

One tip on installation, pip tends to default to 2.7 so use this syntax in bash.

python3   -m pip install twython

I use three py scripts, one to write Twython queries to disk, one to query information about a set of Twitter users, and one to query tweets from a particular user. Note that the query scripts can be slow to execute, which is deliberate as otherwise you end up hitting rate limits. (Twitter’s API allows fifteen queries per fifteen minutes). I call the two query scripts from bash with argument passing. The disk writing script is called by the query scripts and doesn’t require user intervention, though you do need to be sure Python knows where to find it (usually by keeping it in the current working directory). Note that you will need to adjust things like file paths and authentication keys. (When accessing Twitter through scripts instead of your phone, you don’t use usernames and passwords but keys and secrets, you can generate the keys by registering an application).

I am discussing this script first even though it is not directly called by the user because it is the most natural place to discuss Twython’s somewhat complicated data structure. A Twython data object is a list of dictionaries. (I adapted this script for exporting lists of dictionaries). You can get a pretty good feel for what these objects look like by using type() and the pprint module. In this sample code, I explore a data object created by

type(users) #shows that users is a list
type(users[0]) #shows that each element of users is a dictionary
#the objects are a bunch of brackets and commas, use pprint to make a dictionary (sub)object human-readable with whitespace
import pprint
pp.pprint(users[0]['status']) #you can also zoom in on daughter objects, in this case the user's most recent tweet object. Note that this tweet is a sub-object within the user object, but may itself have sub-objects

As you can see if you use the pprint command, some of the dictionary values are themselves dictionaries. It’s a real fleas upon fleas kind of deal. In the script I pull some of these objects out and delete others for the “clean” version of the data. Also note that tw2csv defaults to writing these second-level fields as one first-level field with escaped internal delimiters. So if you open a file in Excel, some of the cells will be really long and have a lot of commas in them. While Excel automatically parses the escaped commas correctly, Stata assumes you don’t want them escaped unless you use this command:

import delimited "foo.csv", delimiter(comma) bindquote(strict) varnames(1) asdouble encoding(UTF-8) clear

Another tricky thing about Twython data is there can be variable number of dictionary entries (ie, some fields are missing from some cases). For instance, if a tweet is not a retweet it will be missing the “retweeted_status” dictionary within a dictionary. This was the biggest problem with reusing the Stack Overflow code and required adapting another piece of code for getting the union set of dictionary keys. Note this will give you all the keys used in any entry from the current query, but not those found uniquely in past or future queries. Likewise, Python sorts field order randomly. For these two reasons, I hard-coded tw2csv as overwrite, not append, and build in a timestamp to the query scripts. If you tweak the code to append, you will run into problems with the fields not lining up.

Anyway, here’s the actual tw2csv code.
def tw2csv(twdata,csvfile_out):
    import csv
    import functools
    allkey = functools.reduce(lambda x, y: x.union(y.keys()), twdata, set())
    with open(csvfile_out,'wt') as output_file:

One of the queries I like to run is getting basic information like date created, description, and follower counts. Basically, all the stuff that shows up on a user’s profile page. The Twitter API allows you to do this for 100 users simultaneously and I do this with the script. It assumes that your list of target users is stored in a text file, but there’s a commented out line that lets you hard code the users, which may be easier if you’re doing it interactively. Likewise, it’s designed to only query 100 users at a time, but there’s a commented out line that’s much simpler in interactive use if you’re only querying a few users.

You can call it from the command line and it takes as an argument the location of the input file. I hard-coded the location of the output. Note the “3” in the command-line call is important as operating systems like OSX default to calling Python 2.7.

python3 list.txt

And here’s the actual script. Note that I’ve taken out my key and secret. You’ll have to register as an “application” and generate these yourself.
from twython import Twython
import sys
import time
from math import ceil
import tw2csv #custom module

targetlist=sys.argv[1] #text file listing feeds to query, one per line. full path ok.
today = time.strftime("%Y%m%d")

APP_KEY='' #25 alphanumeric characters
APP_SECRET='' #50 alphanumeric characters
twitter=Twython(APP_KEY,APP_SECRET,oauth_version=2) #simple authentication object

handles = [line.rstrip() for line in open(targetlist)] #read from text file given as cmd-line argument
#handles=("gabrielrossman,sociologicalsci,twitter") #alternately, hard-code the list of handles

#API allows 100 users per query. Cycle through, 100 at a time
#users = twitter.lookup_user(screen_name=handles) #this one line is all you need if len(handles) < 100
users=[] #initialize data object
#unlike a get_user_timeline query, there is no need to cap total cycles
for i in range(0, cycles): ## iterate through all tweets up to max of 3200
    del handles[0:100]
    incremental = twitter.lookup_user(screen_name=h)
    time.sleep(90) ## 90 second rest between api calls. The API allows 15 calls per 15 minutes so this is conservative


This last script collects tweets for a specified user. The tricky thing about this code is that the Twitter API allows you to query the last 3200 tweets per user, but only 200 at a time, so you have to cycle over them. moreover, you have to build in a delay so you don’t get rate-limited. I adapted the script from this code but made some tweaks.

One change I made was to only scrape as deep as necessary for any given user. For instance, as of this writing, @SociologicalSci has 1192 tweets, so it cycles six times, but if you run it in a few weeks @SociologicalSci would have over 1200 and so it would run at least seven cycles. This change makes the script run faster, but ultimately gets you to the same place.

The other change I made is that I save two versions of the file, one as is and the other that pulls out some objects from the subdictionaries and deletes the rest. If for some reason you don’t care about retweet count but are very interested in retweeting user’s profile background color, go ahead and modify the code. See above for tips on exploring the data structure interactively so you can see what there is to choose from.

As above, you’ll need to register as an application and supply a key and secret.

You call it from bash with the target screenname as an argument.

python3 sociologicalsci
from twython import Twython
import sys
import time
import simplejson
from math import ceil
import tw2csv #custom module

handle=sys.argv[1] #takes target twitter screenname as command-line argument
today = time.strftime("%Y%m%d")

APP_KEY='' #25 alphanumeric characters
APP_SECRET='' #50 alphanumeric characters
twitter=Twython(APP_KEY,APP_SECRET,oauth_version=2) #simple authentication object

#adapted from
#user_timeline=twitter.get_user_timeline(screen_name=handle,count=200) #if doing 200 or less, just do this one line
user_timeline=twitter.get_user_timeline(screen_name=handle,count=1) #get most recent tweet
lis=user_timeline[0]['id']-1 #tweet id # for most recent tweet
#only query as deep as necessary
tweetsum= user_timeline[0]['user']['statuses_count']
cycles=ceil(tweetsum / 200)
if cycles>16:
    cycles=16 #API only allows depth of 3200 so no point trying deeper than 200*16
for i in range(0, cycles): ## iterate through all tweets up to max of 3200
    incremental = twitter.get_user_timeline(screen_name=handle,
    count=200, include_retweets=True, max_id=lis)
    time.sleep(90) ## 90 second rest between api calls. The API allows 15 calls per 15 minutes so this is conservative


#clean the file and save it
for i, val in enumerate(user_timeline):
    if 'retweeted_status' in user_timeline[i].keys():
        user_timeline[i]['rt_count'] = user_timeline[i]['retweeted_status']['retweet_count']
        user_timeline[i]['qt_id'] = user_timeline[i]['retweeted_status']['id']
        user_timeline[i]['rt_created'] = user_timeline[i]['retweeted_status']['created_at']
        user_timeline[i]['rt_user_screenname'] = user_timeline[i]['retweeted_status']['user']['name']
        user_timeline[i]['rt_user_id'] = user_timeline[i]['retweeted_status']['user']['id']
        user_timeline[i]['rt_user_followers'] = user_timeline[i]['retweeted_status']['user']['followers_count']
        del user_timeline[i]['retweeted_status']
    if 'quoted_status' in user_timeline[i].keys():
        user_timeline[i]['qt_created'] = user_timeline[i]['quoted_status']['created_at']
        user_timeline[i]['qt_id'] = user_timeline[i]['quoted_status']['id']
        user_timeline[i]['qt_text'] = user_timeline[i]['quoted_status']['text']
        user_timeline[i]['qt_user_screenname'] = user_timeline[i]['quoted_status']['user']['name']
        user_timeline[i]['qt_user_id'] = user_timeline[i]['quoted_status']['user']['id']
        user_timeline[i]['qt_user_followers'] = user_timeline[i]['quoted_status']['user']['followers_count']
        del user_timeline[i]['quoted_status']
    if user_timeline[i]['entities']['urls']: #list
        for j, val in enumerate(user_timeline[i]['entities']['urls']):
    if user_timeline[i]['entities']['user_mentions']: #list
        for j, val in enumerate(user_timeline[i]['entities']['user_mentions']):
            user_timeline[i][mentionj] = user_timeline[i]['entities']['user_mentions'][j]['screen_name']
    if user_timeline[i]['entities']['hashtags']: #list
        for j, val in enumerate(user_timeline[i]['entities']['hashtags']):
            user_timeline[i][hashtagj] = user_timeline[i]['entities']['hashtags'][j]['text']
    if user_timeline[i]['coordinates'] is not None:  #NoneType or Dict
        user_timeline[i]['coord_long'] = user_timeline[i]['coordinates']['coordinates'][0]
        user_timeline[i]['coord_lat'] = user_timeline[i]['coordinates']['coordinates'][1]
    del user_timeline[i]['coordinates']
    del user_timeline[i]['user']
    del user_timeline[i]['entities']
    if 'place' in user_timeline[i].keys():  #NoneType or Dict
        del user_timeline[i]['place']
    if 'extended_entities' in user_timeline[i].keys():
        del user_timeline[i]['extended_entities']
    if 'geo' in user_timeline[i].keys():
        del user_timeline[i]['geo']


January 19, 2016 at 8:10 am

After Surveys

The following is a guest post from Trey Causey, a long-time reader of codeandculture and a grad student at Washington who does a lot of work with web scraping. This is his second guest post here, his first was on natural language processing SNAFUs.

| Trey |

“We all know that survey methods are becoming increasingly outdated and clunky.” While at the ASA Annual Meeting in Denver, I attended a methods panel where a graduate student opened with this remark; I’m paraphrasing here (but not by much). This remark was met with sympathetic laughter, mock(?) indignation, and some groans. Subsequent presenters, whose papers were mostly based on survey research, duly referred back to this comment. Survey research has been the workhorse of social scientists for generations and large portions of our methods training are usually survey-oriented in one way or another.

While the phrasing may have been indelicate, the graduate student is on to something important. I later spoke to a faculty member from the same department as the aforementioned graduate student. We were discussing the rise of “big data” and non-survey-based methods in the social sciences and he commented that he felt that the discipline was inevitably headed towards these methods, as survey response rates continue to fall.

I was reminded of both of these anecdotes earlier this week when Claude Fischer blogged about the crisis of survey response rates. He writes that Pew is averaging an astonishing 9% completion rate these days. He closes by speculating on what we can use instead of surveys to measure public opinion: “…letters to the editor, Facebook “likes,” calls to congressional offices, tweet vocabulary, street demonstrations — or the most common way we do it, assuming that what we and our friends think is typical. We might track Americans’ health by admissions to the hospitals, death rates, consumption of Lipitor. We could estimate changes in poverty by counting beggars on the street, malnourished kids at school. We could try to figure out the “dark” crime number by… I don’t know.”

Fischer is right to be pessimistic about the use of surveys. It’s not clear that the dwindling numbers of participants in many surveys are representative of the larger population (for many reasons). I think the writing is on the wall for all but the most well-funded and well-staffed survey organizations to provide large, representative samples of the American public.

However, I’m optimistic that new forms of data and methods are waiting to pick up the slack. As readers of this blog are no doubt aware, I am a proponent of using data scraped from the web; this could include Facebook likes or Twitter vocabulary, as Fischer points out, or newspaper articles, message board posts, or music sales data. Although debates inevitably (and rightly) arise when studies using these kinds of data are discussed amongst social scientists, the representativeness argument wielded (mostly) by survey researchers is overblown and perhaps deserves to be redirected back at survey research.

Are data scraped from the web representative of a larger population? That depends on which population we’re talking about. Questions that take the form of “does subgroup A differ significantly on average from subgroup B or population P on some attitudinal measure Y” usually require some kind of representative sample of a general population. However, many questions don’t take this form and social scientists are often interested in studying specific subpopulations, often underrepresented in probability samples.

Data scraped from the web have some real benefits. They are often naturalistic and intentional — individuals decided to write something of their own accord, unprompted by any researcher, posted it online for others to read, hoping to communicate something. They are generated in real-time. Reading message board threads like this one from Metafilter on September 11, 2001 gives us real-time insight into how people were making sense of an unfolding tragedy. No retrospective bias, no need to get a survey team on the ground a day or two after things happen. We can combine what individuals say with behavioral data. Scraping is one of the more unobtrusive forms of data collection — the act of observation by the researcher does not alter the individual’s behavior. It’s cheap and fast–even by sociologists’ standards.

Obviously, as people become more savvy about what companies and governments track about them from their online activities, this will probably change. However, it’s still more naturalistic than answers to closed-form survey questions over the phone. Rather than trying to figure out if survey responses represent a stable attitude or if they are epiphenomenal, the individual generates the data independently. While presentation management will always be an underlying concern when trying to figure out what people “really” mean or “really” want, it is not obvious to me that this problem was ever solved in survey research. Amazingly, people offer up copious information about sensitive topics–their sex lives, their drug habits, and their personal prejudices–without being asked and without the questions being cleared by an IRB (although analyzing this information will still require IRB approval). Combined with new methods to analyze unstructured text, advances in network analysis, and increases in computing power, the possibilities are really quite amazing.

It’s also important to note that data scraped from the web are not limited to the activities of computer users themselves — governments store loads of information online, media outlets often have full transcripts online, corporations publish earnings reports, and universities post course schedules and enrollment figures. They aren’t posted on the ICPSR as “datasets”, but they’re easily obtainable.

Arguing that data from the internet are somehow less real, less valid, or less reliable does nothing to further the cause of social science. We all know about the Literary Digest moments for survey research. But we didn’t give up on surveys right away, we figured out how to make them better. Surveys allowed us to peer into a section of the public’s mind and figure out what people were thinking on an unprecedented scale. New forms of data and new ways to analyze those data offer some of the same promise. Note that I’m not a “big data” polyanna; bigger data don’t spell the end of science. But updating our methods of inference to deal with big data and new kinds of data will allow us to study social interaction on a new scale yet again.

Is Twitter representative of the overall American or world population? Of course not. Will it ever be? Who knows. The key is to figure out the sources of bias and correct for them — the return on investment to doing so would most likely exceed that of trying to figure out how to salvage an ailing survey research patient.

September 12, 2012 at 11:44 am 5 comments

Is Facebook “Naturally Occurring”?

| Gabriel |

Lewis, Gonzalez, and Kaufman have a forthcoming paper in PNAS on “Social selection and peer influence in an online social network.” The project uses Facebook data from the entire college experience of a single cohort of undergrads at one school in order to pick at the perennial homophily/influence question. (Also see earlier papers from this project).

Overall it’s an excellent study. The data collection and modeling efforts are extremely impressive. Moreover I’m very sympathetic to (and plan to regularly cite) the conclusion that contagious diffusion is over-rated and we need to consider the micro-motives and mechanisms underlying contagion. I especially liked how they synthesize the Bourdieu tradition with diffusion to argue that diffusion is most likely for taste markers that are distinctive in both sense of the term. As is often the case with PNAS or Science, the really good stuff is in the appendix and in this case it gets downright comical as they apply some very heavy analytical firepower to trying to understand why hipsters are such pretentious assholes before giving up and delegating the issue to ethnography.

The thing that really got me thinking though was a claim they make in the methods section:

Because data on Facebook are naturally occurring, we avoided interviewer effects, recall limitations, and other sources of measurement error endemic to survey-based network research

That is, the authors are reifying Facebook as “natural.” If all they mean is that they’re taking a fly on the wall observational approach, without even the intervention of survey interviews, then yes, this is naturally occurring data. However I don’t think that observational necessarily means natural. If researchers themselves imposed reciprocity, used a triadic closure algorithm to prime recall, and discouraged the deletion of old ties; we’d recognize this as a measurement issue. It’s debatable whether it’s any more natural if Mark Zuckerberg is the one making these operational measurement decisions instead of Kevin Lewis.

Another way to put this is to ask where does social reality end and observation of it begin? In asking the question I’m not saying that there’s a clean answer. On one end of the spectrum we might have your basic random-digit dialing opinion survey that asks people to answer ambiguously-worded Likert-scale questions about issues they don’t otherwise think about. On the other end of the spectrum we might have well-executed ethnography. Sure, scraping Facebook isn’t as unnatural as the survey but neither is it as natural as the ethnography. Of course, as the information regimes literature suggests to us, you can’t really say that polls aren’t natural either insofar as their unnatural results leak out of the ivory tower and become a part of society themselves. (This is most obviously true for things like the unemployment rate and presidential approval ratings).

At a certain point something goes from figure to ground and it becomes practical, and perhaps even ontologically valid, to treat it as natural. You can make a very good argument that market exchange is a social construction that was either entirely unknown or only marginally important for most of human history. However at the present the market so thoroughly structures and saturates our lives that it’s practical to more or less take it for granted when understanding modern societies and only invoke the market’s contingent nature as a scope condition to avoid excessive generalization of economics beyond modern life and into the past, across cultures, and the deep grammar of human nature.

We are, God help us, rapidly approaching a situation where online social networks structure and constitute interaction. Once we do, the biases built into these systems are no longer measurement issues but will be constitutive of social structure. During the transitional period we find ourselves in though, let’s recognize that these networks are human artifices that are in the process of being incorporated into social life. We need a middle ground between “worthless” and “natural” for understanding social media data.

December 22, 2011 at 11:07 am 16 comments

Scraping Using twitteR (updated)

| Gabriel |

Last time I described using the twitteR library for R. In that post I had R itself read over a list to loop. In this post I make the have looping occur in Bash with argument passing to R through the commandArgs() function.

First, one major limitation of using Twitter is that it times you out for an hour after 150 queries. (You can double this if you use OAuth but I’ve yet to get that to work). For reasons I don’t really understand, getting one feed can mean multiple queries, especially if you’re trying to go far back in the timeline. For this reason you need to break up your list into a bunch of small lists and cron them at least 80 minutes apart. This bit of Bash code will split up a file called “list.txt” into several files. Also, to avoid trouble later on, it makes sure you have Unix EOL.

split -l 50 list.txt short_tw
perl -pi -e 's/\r\n/\n/g' short_tw*

The next thing to keep in mind is that you’ll need to pass arguments to R. Argument passing is when a script takes input from outside the script and processes it as variables. The enthusiastic use of argument passing in Unix is the reason why there is a fine line between a file and a command in that operating system.

In theory you could have R read the target list itself but this crashes when you hit your first dead URL. Running the loop from outside R makes it more robust but this requires passing arguments to R. I’d previously solved this problem by having Stata write an entire R script, which Stata understood as having variables (or “macros”) but which from R’s perspective was hard-coded. However I was recently delighted to discover that R can accept command-line arguments with the commandArgs() function. Not surprisingly, this is more difficult than $1 in Bash, @ARGV in Perl, or `1′ in Stata, but it’s not that bad. To use it you have to use the “–args” option when invoking R and then inside of R you use the commandArgs() function to pass arguments to an array object, which behaves just like the @ARGV array in Perl.

Here’s an R script that accepts a Twitter screenname as a command-line argument, uses the twitteR library to collect that feed, and then saves it as a tab-delimited text file of the same name. (It appends if there’s an existing file). Also note that (thanks to commenters on the previous post) it turns internal EOL into regular spaces. It’s currently set to collect the last 200 tweets but you can adjust this with the third line (or you could rewrite the script to make this a command-line argument as well).

args <- commandArgs(trailingOnly = TRUE)
howmany <- 200 #how many past tweets to collect

user <- args[1]
outputfile <- paste('~/project/feeds/',user,'.txt',sep="")

tmptimeline <- userTimeline(user,n=as.character(howmany))
tmptimeline.df <- twListToDF(tmptimeline)
tmptimeline.df$text <- gsub("\\n|\\r|\\t", " ", tmptimeline.df$text)


To use the script to get just a single feed, you invoke it like this from the command-line.

R --vanilla --args asanews < datacollection.R

Of course the whole reason to write the script this way is to loop it over the lists. Here it is for the list “short_twaa”.

for i in `cat short_twaa`; do R --vanilla --args $i < datacollection.R ; done

Keep in mind that you’ll probably want to cron this, either because you want a running scrape or because it makes it easier to space put the “short_tw*” files so you don’t get timed out.

December 20, 2011 at 11:04 pm 1 comment

Scraping Using twitteR

| Gabriel |

Previously I’d discussed scraping Twitter using Bash and Perl. Then yesterday on an orgtheory thread Trey mentioned the R library twitteR and with some help from Trey I worked out a simple script that replaces the and scripts from the earlier workflow. The advantage of this script is that it’s a lot shorter, it can get an arbitrary number of tweets instead of just 20, and it captures some of the meta-text that could be useful for constructing social networks.

To use it you need a text file that consists of a list of Twitter feeds, one per line. The location of this file is given in the “inputfile” line.

The “howmany” line controls how many tweets back it goes in each feed.

The “outputfile” line says where the output goes. Note that it treats it as append. As such you can get some redundant data, which you can fix by running this bash code:

sort mytweets.txt | uniq > tmp
mv tmp mytweets.txt

The outputfile has no headers, but they are as follows:

#v1 ignore field, just shows number w/in query
#v2 text of the Tweet
#v3 favorited dummy
#v4 replytosn (mention screenname)
#v5 created (date in YMDhms)
#v6 truncated dummy
#v7 replytosid
#v8 id
#v9 replytouid
#v10 statussource (Twitter client)
#v11 screenname

Unfortunately, the script doesn’t handle multi-line tweets very well, but I’m not sufficiently good at R to regexp out internal EOL characters. I’ll be happy to work this in if anyone cares to post some code to the comments on how to do a find and replace that zaps the internal EOL in the field tmptimeline.df$text.

howmany <- 30 #how many past tweets to collect
inputfile <- "~/feedme.txt"
outputfile <- "~/mytweets.txt"

feeds <- as.vector(t(read.table(inputfile)))
for (user in feeds) {
	tmptimeline <- userTimeline(user,n=as.character(howmany))
	tmptimeline.df <- twListToDF(tmptimeline)

Finally, if you import it into Stata you’ll probably want to run this:

drop v1
ren v2 text
ren v3 favorited 
ren v4 replytosn
ren v5 created
ren v6 truncated 
ren v7 replytosid
ren v8 id
ren v9 replytouid
ren v10 statussource
ren v11 screenname
gen double timestamp=clock(subinstr(created,"-","/",.),"YMDhms")
format timestamp %tc

December 13, 2011 at 1:21 pm 5 comments

Seven-inch heels, natural language processing, and sociology

The following is a guest post from Trey Causey, a long-time reader of codeandculture and a grad student at Washington who does a lot of work with web scraping. We got to discussing a dubious finding and at my request he graciously wrote up his thoughts into a guest post.

| Trey |

Recently, Gabriel pointed me to a piece in Ad Age (and original press release) about IBM researchers correlating the conversations of fashion bloggers with the state of the economy (make sure you file away the accompanying graph for the next time you teach data visualization). Trevor Davis, a “consumer-products expert” with IBM, claimed that as economic conditions improve, the average height of high heels mentioned by these bloggers decreases. Similarly, as economic conditions worsen, the average height would increase. As Gabriel pointed out, these findings seemed to lack any sort of face validity — how likely does it seem that, at any level of economic performance, the average high heel is seven inches tall (even among fashionistas)? I’ll return to the specific problems posed by what I’ll call the “seven-inch heel problem” in a moment, but first some background on the methods that most likely went into this study.

While amusing, if not very credible, the IBM study is part of a growing area (dubbed by some “computational social science”) situated at the intersection of natural language processing and machine learning. By taking advantage of the explosion of available digital text and computing power, researchers in this area are attempting to model structure in and test theories of large-scale social behavior. You’ve no doubt seen some of this work in the media, ranging from “predicting the Arab Spring” to using Twitter to predict GOP primary frontrunners. Many of these works hew towards the style end of the style-substance divide and are not typically motivated by any recognizable theory. However, this is changing as linguists use Twitter to discover regional dialect differences and model the daily cycle of positive and negative emotions.

Much of this work is being met by what I perceive to be reflexive criticism (as in automatic, rather than in the more sociological sense) from within the academy. The Golder and Macy piece in particular received sharp criticism in the comments on orgtheory, labeled variously “empiricism gone awry”, non-representative, and even fallacious (and in which yours truly was labeled “cavalier”). Some of this criticism is warranted, although as is often the case with new methods and data sources, much of the criticism seems rooted in misunderstanding. I suspect part of this is the surprisingly long-lived skepticism of scholarly work on “the internet” which, with the rise of Facebook and Twitter, seems to have been reinvigorated.

However, sociologists are doing themselves a disservice by seeing this work as research on the internet qua internet. Incredible amounts of textual and relational data are out there for the analyzing — and we all know if there’s one thing social scientists love, it’s original data. And these data are not limited to blog posts, status updates, and tweets. Newspapers, legislation, historical archives, and more are rapidly being digitized, providing pristine territory for analysis. Political scientists are warming to the approach, as evidenced by none other than the inimitable Gary King and his own start-up Crimson Hexagon, which performs sentiment analysis on social media using software developed for a piece in AJPS. Political Analysis, the top-ranked journal in political science and the methodological showcase for the discipline, devoted an entire issue in 2008 to the “text-as-data” approach. Additionally, a group of historians and literary scholars have adopted these methods, dubbing the new subfield the “digital humanities.”

Sociologists of culture and diffusion have already warmed to many of these ideas, but the potential for other subfields is significant and largely unrealized. Social movement scholars could find ways to empirically identify frames in wider public discourse. Sociologists of stratification have access to thousands of public- and private-sector reports, the texts of employment legislation, and more to analyze. Race, ethnicity, and immigration researchers can model changing symbolic boundaries across time and space. The real mistake, in my view, is dismissing these methods as an end in and of themselves rather than as a tool for exploring important and interesting sociological questions. Although many of the studies hitting the mass media seem more “proof of concept” than “test of theory,” this is changing; sociologists will not want to be left behind. Below, I will outline the basics of some of these methods and then return to the seven-inch heels problem.

The use of simple scripts or programs to scrape data from the web or Twitter has been featured several times on this blog. The data that I collected for my dissertation were crawled and then scraped from multiple English and Arabic news outlets that post their archives online, including Al Ahram, Al Masry Al Youm, Al Jazeera, and Asharq al Awsat. The actual scrapers are written in Python using the Scrapy framework.

Obtaining the data is the first and least interesting step (to sociologists). Using the scraped data, I am creating chains of topic models (specifically using Latent Dirichlet Allocation) to model latent discursive patterns in the media from the years leading up to the so-called “Arab Spring.” In doing so, I am trying to identify the convergence and divergence in discourse across and within sources to understand how contemporary actors were making sense of their social, political, and economic contexts prior to a major social upheaval. Estimating common knowledge prior to contentious political events is often problematic due to hindsight biases, because of the problems of conducting surveys in non-democracies, and for the obvious reason that we usually don’t know when a major social upheaval is about to happen even if we may know which places may be more susceptible.

Topic modeling is a method that will be look familiar in its generalities to anyone who has seen a cluster analysis. Essentially, topic models use unstructured text — i.e., text without labeled fields from a database or from a forced-choice survey — to model the underlying topical components that make up a document or set of documents. For instance, one modeled topic might be composed of the words “protest”, “revolution”, “dictator”, and “tahrir”. The model attempts to find the words that have the highest probability of being found with one another and with the lowest probability of being found with other words. The generated topics are devoid of meaning, however, without theoretically informed interpretation. This is analogous to survey researchers that perform cluster or factor analyses to find items that “hang together” and then attempt to figure out what the latent construct is that links them.

Collections of documents (a corpus) are usually represented as a document-term matrix, where each row is a document and the columns are all of the words that appear in your set of documents (the vocabulary). The contents of the individual cells are the per-document word frequencies. This produces a very sparse matrix, so some pre-processing is usually performed to reduce the dimensionality. The majority of all documents from any source are filled with words that convey little to no information — prepositions, articles, common adjectives, etc. (see Zipf’s law). Words that appear in every document or in a very small number of documents provide little explanatory power and are usually removed. The texts are often pre-processed using tools such as the Natural Language Toolkit for Python or RTextTools (which is developed in part here at the University of Washington) to remove these words and punctuation. Further, words are often “stemmed” or “lemmatized” so that the number of words with common suffixes and prefixes but with similar meanings is reduced. For example, “run”, “runner”, “running”, and “runs” might all be reduced to “run”.

This approach is known as a “bag-of-words” approach in that the order and context of the words is assumed to be unimportant (obviously, a contentious assumption, but perhaps that is a debate for another blog). Researchers that are uncomfortable with this assumption can use n-grams, groupings of two or more words, rather than single words. However, as the n increases, the number of possible combinations and the accompanying computing power required grows rapidly. You may be familiar with the Google Ngram Viewer. Most of the models are extendable to other languages and are indifferent to the actual content of the text although obviously the researcher needs to be able to read and make sense of the output.

Other methods require different assumptions. If you are interested in parts of speech, a part-of-speech tagger is required, which assumes that the document is fairly coherent and not riddled with typos. Tracking exact or near-exact phrases is difficult as well, as evidenced by the formidable team of computer scientists working on MemeTracker. The number of possible variations on even a short phrase quickly becomes unwieldy and requires substantial computational resources — which brings us back to the seven-inch heels.

Although IBM now develops the oft-maligned SPSS, they also produced Watson. This is why the total lack of validity of fashion blogging results is surprising. If one were seriously going to track the height of heels mentioned and attempt to correlate it with economic conditions, in order to have any confidence that you have captured a non-biased sample of mentions, at least two necessary steps would include:

  • Identifying possible combinations of size metrics and words for heels: seven-inch heels, seven inch heels, seven inch high heels, seven-inch high-heels, seven inch platforms, etc. And so on. This is further complicated by the fact that many text processing algorithms will treat “seven-inch” as one word.
  • Dealing with the problem of punctuational abbreviations for these metrics: 7″ heels, 7″ high heels, 7 and a 1/2 inch heels, etc. Since punctuation is usually stripped out, it would be necessary to leave it in, but then how to distinguish quotation marks that appear as size abbreviations and those that appear in other contexts?
  • Do we include all of these variations with “pumps?” Is there something systematic such as age, location, etc. about individuals that refer to “pumps” rather than “heels?”
  • Are there words or descriptions for heels that I’m not even aware of? Probably.

None of these is an insurmountable problem and I have no doubt that IBM researchers have easy access to substantial computing power. However, each of them requires careful thought prior to and following data collection; the combination of them together quickly complicates matters. Since IBM is unlikely to reveal their methods, though, I have serious doubts as to the validity of their findings.

As any content analyst can tell you, text is a truly unique data source as it is intentional language and is one of the few sources of observational data for which the observation process is totally unobtrusive. In some cases, the authors are no longer alive! Much of the available online text of interest to social scientists was not produced for scholarly inquiry and was not generated from survey responses. However, the sheer volume of the text requires some (but not much!) technical sophistication to acquire and make sense of and, like any other method, these analyses can produce results that are essentially meaningless. Just as your statistics package of choice will output meaningless regression results from just about any data you feed into it, automated and semi-automated text analysis produces its own share of seven-inch heels.

November 21, 2011 at 8:46 am 11 comments

Scraping Twitter

| Gabriel |

I recently got interested in a set of communities that have a big Twitter presence and so I wrote some code to collect it. I started out creating OPML bookmark files which I can give to RSSOwl,* but ultimately decided to do everything in wget so I can cron it on my server. Nonetheless, I’m including the code for creating OPML in case people are interested.

Anyway, here’s the directory structure for the project:


I have a list of Twitter web URLs for each community that I store in “projects/lists” and call “twitterlist_a.txt”, “twitterlist_b.txt”, etc. I collected these lists by hand but if there’s a directory listing on the web you could parse it to get the URLs. Each of these files is just a list and looks like this:

I then run these lists through a Bash script, called “” which is run from “project/” and takes the name of the community as an argument. It collects the Twitter page from each URL and extracts the RSS feed and “Web” link for each. It gives the RSS feeds to an OPML file, which is an XML file that RSS readers treat as a bookmark list and to a plain text file. The script also finds the “Web” link in the Twitter feeds and saves them as a file suitable for later use with wget.

#take list of twitter feeds
#extract rss feed links and convert to OPML (XML feed list) format
#extract weblinks 

#get current first page of twitter feeds
cd $1
wget -N --input-file=../lists/twitterlist_$1.txt
cd ..

#parse feeds for RSS, gen opml file (for use with RSS readers like RSSOwl)
echo -e "<?xml version\"1.0\" encoding=\"UTF-8\"?>\n<opml version=\"1.0\">\n\t<head>\n\t<body>" > lists/twitrss_$1.opml
grep -r 'xref rss favorites' ./$1 | perl -pe 's/.+\/(.+):     .+href="\/favorites\/(.+)" class.+\n/\t\t<outline text="$1" type="rss" xmlUrl="http:\/\/\/statuses\/user_timeline\/$2"\/>\n/' >> lists/twitrss_$1.opml
echo -e "\t</body>\n</opml>\n" >> lists/twitrss_$1.opml

#make simple text list out of OPML (for use w wget)
grep 'http\:\/' lists/twitrss_$1.opml | perl -pe 's/\s+\<outline .+(http.+\.rss).+$/\1/' > lists/twitrss_$1.txt

#parse Twitter feeds for link to real websites (for use w wget)
grep -h -r '>Web</span>' ./$1 | perl -pe 's/.+href="(.+)" class.+\n/$1\n/' > lists/web_$1.txt

echo -e "\nIf using GUI RSS, please remember to import the OPML feed into RSSOwl or Thunderbird\nIf cronning, set up\n"

#have a nice day

This basically gives you a list of RSS feeds (in both OPML and TXT), but you still need to scrape them daily (or however often). If you’re using RSSOwl, “import” the OPML file. I started by doing this, but decided to cron it instead with two scripts.

The script collects the RSS files, calls the perl script to do some of the cleaning, combines the cleaned information into a cumulative file, and then deletes the temp files. Note that Twitter only lets you get 150 RSS feeds within a short amount of time — any more and it cuts you off. As such you’ll want to stagger the cron jobs. To see whether you’re running into trouble, the file project/masterlog.txt counts how many “400 Error” messages turn up per run. Usually these are Twitter turning you down because you’ve already collected a lot of data in a short amount of time. If you get this a lot, try splitting a large community in half and/or spacing out your crons a bit more and/or changing your IP address.

#collect twitter feeds, reshape from individual rss/html files into single tab-delimited text file

DATESTAMP=`date '+%Y%m%d'`

#get current first page of twitter feeds
mkdir $TEMPDIR
wget -N --random-wait --output-file=log.txt --input-file=$PARENTPATH/lists/twitrss_$1.txt

#count "400" errors (ie, server refusals) in log.txt, report to master log file
echo "$1  $DATESTAMP" >> $PARENTPATH/masterlog.txt
grep 'ERROR 400\: Bad Request' log.txt | wc -l >> $PARENTPATH/masterlog.txt

#(re)create simple list of files
sed -e 's/http:\/\/\/statuses\/user_timeline\///' $PARENTPATH/lists/twitrss_$1.txt > $PARENTPATH/lists/twitrssfilesonly_$1.txt

for i in $(cat $PARENTPATH/lists/twitrssfilesonly_$1.txt); do perl $PARENTPATH/ $i; done
for i in $(cat $PARENTPATH/lists/twitrssfilesonly_$1.txt); do cat $TEMPDIR/$i.txt >> $PARENTPATH/$1/cumulativefile_$1.txt ; done

#delete the individual feeds (keep only "cumulativefile") to save disk space
#alternately, could save as tgz
rm -r $TEMPDIR

#delete duplicate lines
sort $PARENTPATH/$1/cumulativefile_$1.txt | uniq > $PARENTPATH/$1/tmp 
mv $PARENTPATH/$1/tmp $PARENTPATH/$1/cumulativefile_$1.txt

#have a nice day

Most of the cleaning is accomplished by It’s unnecessary to cron this script as it’s called by, but it should be in the same directory.

#!/usr/bin/perl by ghr
#this script cleans RSS files scraped by WGET 
#usually run automatically by

use warnings; use strict;
die "usage: <foo.rss>\n" unless @ARGV==1;

my $rawdata = shift(@ARGV);

my $channelheader = 1 ; #flag for in the <channel> (as opposed to <item>)
my $feed = "" ;   #name of the twitter feed <channel><title>
my $title = "" ;  #item title/content <item><title> (or <item><description> for Twitter)
my $date = "" ;   #item date <item><pubDate>
my $source = "" ; #item source (aka, iphone, blackberry, web, etc) <item><twitter:source>

print "starting to read $rawdata\n";

open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.txt") or die "error creating $rawdata.txt\n";
while (<IN>) {
	#find if in <item> (ie, have left <channel>)
	if($_ =~ m/^\s+\<item\>/) {
		$channelheader = 0;
	#find title of channel
	if($channelheader==1) {	
		if($_ =~ m/\<title\>/) {
			$feed = $_;
			$feed =~ s/\s+\<title\>(.+)\<\/title\>\n/$1/; #drop tags and EOL
			print "feed identifed as: $feed\n";

	#find all <item> info and write out at </item>
	if($channelheader==0) {	
		#note, cannot handle interal LF characters. 
		#doesn't crash but leaves in leading tag and 
		#only an issue for title/description
		#ignore for now
		if($_ =~ m/\<title\>/) {
			$title = $_;
			$title =~ s/\015?\012?//g; #manual chomp, global to allow internal \n
			$title =~ s/\s+\<title\>//; #drop leading tag
			$title =~ s/\<\/title\>//; #drop closing tag
		if($_ =~ m/\<pubDate\>/) {
			$date = $_;
			$date =~ s/\s+\<pubDate\>(.+)\<\/pubDate\>\n/$1/; #drop tags and EOL
		if($_ =~ m/\<twitter\:source\>/) {
			$source = $_;
			$source =~ s/\s+\<twitter\:source\>(.+)\<\/twitter\:source\>\n/$1/; #drop tags and CRLF
			$source =~ s/&lt;a href=&quot;http:\/\/twitter\.com\/&quot; rel=&quot;nofollow&quot;&gt;(.+)&lt;\/a&gt;/$1/; #cleanup long sources
		#when item close tag is reached, write out then clear memory
		if($_ =~ m/\<\/item\>/) {
			print OUT "\"$feed\"\t\"$date\"\t\"$title\"\t\"$source\"\n";
			#clear memory (for <item> fields) 
			$title = "" ;
			$date = "" ;
			$source = "" ;
close IN;
close OUT;
print "done writing $rawdata.txt \n";

*In principle you could use Thunderbird instead of RSSOwl, but its RSS has some annoying bugs. If you do use Thunderbird, you’ll need to use this plug-in to export the mailboxes. Also note that by default, RSSOwl only keeps the 200 most recent posts. You want to disable this setting, either globally in the “preferences” or specifically in the “properties” of the particular feed.

March 1, 2011 at 4:51 am 4 comments

Scraping for Event History

| Gabriel |

As I’ve previously mentioned, there’s a lot of great data out there but much of it is ephemeral so if you’re interested in change (which given our obsession with event history, many sociologists are) you’ve got to know how to grab it. I provided a script (using cron and curl) for grabbing specific pages and timestamping them but this doesn’t scale up very well to getting entire sites, both because you need to specify each specific URL and because it saves a complete copy each time rather than the diff. I’ve recently developed another approach that relies on wget and rsync and is much better for scaling up to a more ambitious scraping project.

Note that because of subtle differences between dialects of Unix, I’m assuming Linux for the data collection but Mac for the data cleaning.* Using one or the other for everything requires some adjustments. Also note that because you’ll want to “cron” this, I don’t recommend running it on your regular desktop computer unless you leave it on all night. If you don’t have server space (or an old computer on which you can install Linux and then treat as a server), your cheapest option is probably to run it on a wall wart computer for about $100 (plus hard drive).

Wget is similar to curl in that it’s a tool for downloading internet content but it has several useful features, some of which aren’t available in curl. First, wget can do recursion, which means it will automatically follows links and thus can get an entire site as compared to just a page. Second, it reads links from a text file a bit better than curl. Third, it has a good time-stamping feature where you can tell it to only download new or modified files. Fourth, you can exclude files (e.g., video files) that are huge and you’re unlikely to ever make use of. Put these all together and it means that wget is scalable — it’s very good at getting and updating several websites.

Unfortunately, wget is good at updating, but not at archiving. It assumes that you only want the current version, not the current version and several archival copies. Of course this is exactly what you do need for any kind of event history analysis. That’s where rsync comes in.

Rsync is, as the name implies, a syncing utility. It’s commonly used as a backup tool (both remote and local). However the simplest use for it is just to sync several directories and we’ll be applying it to a directory structure like this:


In this set up, wget only ever works on the “current” directory, which it freely updates. That is, whatever is in “current” is a pretty close reflection of the current state of the websites you’re monitoring. The timestamped stuff, which you’ll eventually be using for event history analysis, goes in the “backup” directories. Every time you run wget you then run rsync after it so that next week’s wget run doesn’t throw this week’s wget run down the memory hole.

The first time you do a scrape you basically just copy current/ to backup/t0. However if you were to do this for each scrape it would waste a lot of disk space since you’d have a lot of identical files. This is where incremental backup comes in, which Mac users will know as Time Machine. You can use hard links (similar to aliases or shortcuts) to get rsync to accomplish this.** The net result is that backup/t0 takes the same disk space as current/ but each subsequent “backup” directory takes only about 15% as much space. (A lot of web pages are generated dynamically and so they show up as “recently modified” every time, even if there’s no actual difference with the “old” file.) Note that your disk space requirements get big fast. If a complete scrape is X, then the amount of disk space you need is approximately 2 * X + .15 * X * number of updates. So if your baseline scrape is 100 gigabytes, this works out to a full terabyte after about a year of weekly updates.

Finally, when you’re ready to analyze it, just use mdfind (or grep) to search the backup/ directory (and its subdirectories) for the term whose diffusion you’re trying to track and pipe the results to a text file. Then use a regular expression to parse each line of this query into the timestamp and website components of the file path to see on which dates each website used your query term — exactly the kind of data you need for event history. Furthermore, you can actually read the underlying files to get the qualitative side of it.

So on to the code. The wget part of the script looks like this

DATESTAMP=`date '+%Y%m%d'`
cd ~/Documents/project
mkdir logs/$DATESTAMP
cd current
wget -S --output-file=../logs/$DATESTAMP/wget.log --input-file=../links.txt -r --level=3 -R mpg,mpeg,mp4,au,mp3,flv,jpg,gif,swf,wmv,wma,avi,m4v,mov,zip --tries=10 --random-wait --user-agent=""

That’s what it looks like the first time you run it. When you’re just trying to update “current/” you need to change “wget -S” to “wget -N” but aside from that this first part is exactly the same. Also note that if links.txt is long, I suggest you break it into several parts. This will make it easier to rerun only part of a large scrape, for instance if you’re debugging, or there’s a crash, or if you want to run the scrape only at night but it’s too big to completely run in a single night. Likewise it will also allow you to parallelize the scraping.

Now for the rsync part. After your first run of wget, run this code.

cd ..
rsync -a current/ backup/baseline/

After your update wget runs, you do this.

cd ..
cp -al backup/baseline/ backup/$DATESTAMP/
rsync -av --delete current/ backup/$DATESTAMP/

* The reason to use Linux for data collection is that OS X doesn’t include wget and has an older version of the cp command, though it’s possible to solve both issues by using Fink to install wget and by rewriting cp in Mac/BSD syntax. The reason to use Mac for data analysis is that mdfind is faster (at least once it builds an index) and can read a lot of important binary file formats (like “.doc”) out of the box, whereas grep only likes to read text-based formats (like “.htm”). There are apparently Linux programs (e.g., Beagle) that allow indexed search of many file formats, but I don’t have personal experience with using them as part of a script.

** I adapted this use of hard links and rsync from this tutorial, but note that there are some important differences. He’s interested in a rolling “two weeks ago,” “last week,” “this week” type of thing, whereas I’m interested in absolute dates and don’t want to overwrite them after a few weeks

September 28, 2010 at 4:26 am 5 comments

(A lot of) Misc Links

| Gabriel |

  • A few years ago Jeremy revealed that his incredibly nerdy hobby. was writing interactive fiction (think “Zork”) about a procrastinating grad student trying to lash himself to the mast so he can get some work done instead of playing around on the internet. This gem of 21st century literature might never have existed if Jeremy just spent $10 on Freedom, a program that disables your internet connection for a few hours. I’d like to see my university buy a site license for this thing — we’d shoot to the top of the NRC rankings within a few years.
  • Internet culture has now produced the second derivative of awesome with the North Carolina A&T marching band doing an awesome instrumental cover of the Gregory Brothers’ awesome remix of Antoine Dodson’s awesome rapped news interview about his awesome defense of his sister from a decidedly not awesome attempted rapist.
  • Somebody at Coach handbags has been reading Joel Podolny (ht MR)
  • Good stuff on scraping websites with Python (ht Gelman’s blog)
  • Speaking of data mining, Ad Age has another article on how major corporations use it to assess the health of their brands.
  • One of my pet peeves is political journalism that explains statistically insignificant blips in opinion polls on the basis of obscure events, you know, stories like “Is Obama dropping in the polls because his American flag lapel pin was a little crooked last week?” Meanwhile back in reality, most voters can’t name a single Supreme Court justice. As such I thoroughly enjoyed this satire showing what political journalism would look like were it written by academic political scientists.
  • Matt Salganik has released the data from his amazingly good dissertation (i.e., the MP3 experiment with manipulated download counts). Also in there are the actual MP3s so you can finally hear what Stunt Monkey sounds like.
  • Rethinking Markets has a good account of how the writers at Scienceblogs revolted when the site tried to create a nutrition blog brought to you by Pepsi (yes, really). Most people see this as a glass half empty thing in that structural factors such as dependence on revenue skewed science journalism to promote corporate interests. Myself, I see the glass half full since labor went on strike when its professional ideology was threatened. More generally, the strength of the journalist ideology of “objectivity” is the main reason why I don’t believe in Marxist political economy, either structural or instrumental, about the mass media. I mean, how is the boss man supposed to exercise his theoretical power if every time he tries the journalists raise hell? In the long-run this could theoretically change, especially as the objectivity ideology is conspicuously weakening, but in the short to medium run I’m pretty confident about this.
  • Felix Salmon has a good column on how Hollywood managed to squash the HSX (i.e., betting on motion picture box office). I was personally disappointed by this because I was hoping to analyze the data, which would be a lot more interesting if it switched from Monopoly money to cash.
  • Rudix is a minimalist Unix package manager for Mac. It has a very limited selection of packages but could be good for people who have trouble getting Fink or MacPorts to work. The two packages of most interest to sociologists are probably gnuplot (statistical graphs) and wget (web scraping).
  • As previously expressed, I refuse to buy Kindle e-books because the architecture allows the server to delete files from the client, for instance if Amazon gets a court order to this effect. While I more or less trust US courts to respect a free press, I am profoundly cynical about foreign courts and the libel and hate/blasphemy laws they adjudicate. Likewise, I have no confidence in Amazon’s willingness to defy court orders to throw my books down the memory hole if a court can credibly threaten to freeze its foreign assets. You’d have to be nuts to buy a Kindle edition of, for instance, Orianna Fallaci’s last few books. Fortunately Congress passed a law barring US courts from enforcing foreign (read: British) libel judgements. Now if they’d only extend this to cover blasphemy/hate speech I can begin to wean myself off of paper. Better yet, Amazon could just change the architecture of the device so that it’s impossible to delete anything from it without client-side approval.

September 1, 2010 at 4:17 am

Using R to parse (a lot of) HTML tables

| Gabriel |

For a few months I’ve been doing a daily scrape of a website but I’ve put off actually parsing the data until a colleague was dealing with a similar problem, and solving his problem reminded me of my problem. The scrape creates a folder named after the date with several dozen html files in it. So basically, the data is stored like this:


Each html file has one main table along with a couple of sidebar tables. For each html file, I want to extract the main table and write it to a text file. These text files will be put in a “clean” directory that mirrors the “raw” directory.

This is the kind of thing most people would do in Perl (or Python). I had trouble getting the Perl HTML libraries to load although I probably could have coded it from scratch since HTML table structure is pretty simple (push the contents of <td> tags to an array, then write it out and clear the memory when you hit a </tr> tag). In any case, I ended up using R’s XML library, which is funny because usually I clean data in Perl or Stata and use R only as a last resort. Nonetheless, in what is undoubtedly a sign of the end times, here I am using R for cleaning. Forty years of darkness; The dead rising from the grave; Cats and dogs living together; Mass hysteria!

Anyway, the first step is to get a list of the directories in “raw” and use that to seed the top level loop. (Though note that R’s XML library can also read data directly off the web). Within this loop I create a clean subdirectory to mirror the raw subdirectory. I then get a list of every file in the raw subdirectory and seed the lower level loop. The lower level loop reads each file with “readHTMLTable” and writes it out to the mirroring clean subdirectory. Then I come out of both loops and don’t really care if the top is still spinning.

# File-Name:       websiteclean.R
# Date:            2010-07-28
# Author:          Gabriel Rossman
# Purpose:         parse the scraped files
# Packages Used:   xml

dirlist <- list.files()
for (dir in dirlist) {
	filenames <- list.files()
	cleandir<-paste(parentpath,'/clean/',dir, sep="") #create ../../clean/`dir' and call `cleandir'
	shellcommand<-paste("mkdir ",cleandir, sep="")
	print(cleandir) #progress report
	for (targetfile in filenames) {
		datafromtarget = readHTMLTable(targetfile, header=FALSE)
		outputfile<-paste(targetfile,'.txt', sep="")
		write.table(datafromtarget[1], file = outputfile , sep = "\t", quote=TRUE)  #when writing out, limit to subobject 1 to avoid the sidebar tables

# have a nice day

July 29, 2010 at 4:58 am 2 comments

Older Posts

The Culture Geeks