Scraping 101

June 9, 2009 at 5:59 am 4 comments

| Gabriel |

One of the great things about being a cultural sociologist is that there’s so much data out there for the taking. However much of it is ephemeral so you need to know how to get it quickly and effortlessly.

In grad school I spent hundreds of hours dragging and dropping radio data from IE to Excel. While this was relaxing and gave me a Sisyphean sense of accomplishment, it was otherwise a waste of time as there are much more efficient ways to do this. The two most basic things to know are the Unix commands “cron” (a scheduling daemon) and “curl” (an html scraper). This will be most effective if you have a computer or server that’s always on such as a server. Also note that while I don’t think there’s anything dangerous involved here, it does involve going “sudo” (ie, taking the safety off of UNIX) so to be cautious I suggest bracketing it from your main computer, either by doing it inside a virtual machine or by putting it on a dedicated user account created only for this purpose. (Even though all this should work on my Mac I created and debugged it with a Ubuntu virtual machine running and plan to ultimately run it on my old Ubuntu desktop).

First you need to make sure curl is installed. In Ubuntu go to the terminal and type

sudo apt-get install curl

Then you need to create a directory to store the data and keep the script. I decided to call the directory “~/scrape” and the script “~/scrape/” You can create the shell script with any text editor. After the comments, the first thing in the script should be to create a timestamp variable. (I’m using “YYYYMMDD” to make it easier to sort). This will let you create new versions of each file rather than overwriting the old one each time. Then you use a series of cURL commands to scrape various websites.

# this script scrapes the following URLs and saves them to timestamped files
TIMESTAMP=`date '+%Y%m%d_%H%M'`
curl -o ~/scrape/cc_$TIMESTAMP.htm
curl -o ~/scrape/se_$TIMESTAMP.htm

The next step is to use the terminal to make the shell script executable.

sudo chmod +x ~/scrape/

Finally, you need to cron it so it runs by itself at regular intervals. In Ubuntu you go to the terminal and type

crontab -e

You then get a choice of terminal-based text editors. If you’re not a masochist you’ll choose nano (which is very familiar if you’ve ever used pine mail).  Then you have to add a line scheduling the job. For instance, if you want it to run once a day, just before midnight you would enter

59 11 * * * ~/scrape/

That’s all you really need but if you don’t already back up the entire account you might want to add a cp or rsync command to cron to back up your scrapes every day or every week.

Once you have all this data collected you’ll need to clean it, probably with regular expressions in perl but TextWrangler (on the mac) is about as good as an interactive GUI program can be for such a thing. Also note that this is going to produce a lot of files and so after you clean them you’re going to want a Stata script that can recognize a large batch of files and loop them.

Entry filed under: Uncategorized. Tags: , .

Off by 50 or off by 10? we can draw the line some other time


  • 1. Vincent  |  June 9, 2009 at 8:12 pm

    Off-topic, but I wanted to say that I enjoy the blog tremendously. It is one of the best additions I’ve made to my rss reader in a looong time.



  • 2. flanuese  |  June 16, 2009 at 2:03 am

    Great post. Very useful. Rare to find a cultural sociologist familiar with unix.

  • 3. and server logs « Code and Culture  |  April 2, 2010 at 5:15 am

    […] recently started scraping a website using curl and cron (for earlier thoughts on this see here). Because I don’t leave my mac turned on at 2am, I’m hosting the scrape on one of the […]

  • 4. Scraping for Event History « Code and Culture  |  September 28, 2010 at 4:31 am

    […] interested in change (which given our obsession with event history, many sociologists are) you’ve got to know how to grab it. I provided a script (using cron and curl) for grabbing specific pages and timestamping them but […]

The Culture Geeks

%d bloggers like this: