| Gabriel |
One of the great things about being a cultural sociologist is that there’s so much data out there for the taking. However much of it is ephemeral so you need to know how to get it quickly and effortlessly.
In grad school I spent hundreds of hours dragging and dropping radio data from IE to Excel. While this was relaxing and gave me a Sisyphean sense of accomplishment, it was otherwise a waste of time as there are much more efficient ways to do this. The two most basic things to know are the Unix commands “cron” (a scheduling daemon) and “curl” (an html scraper). This will be most effective if you have a computer or server that’s always on such as a server. Also note that while I don’t think there’s anything dangerous involved here, it does involve going “sudo” (ie, taking the safety off of UNIX) so to be cautious I suggest bracketing it from your main computer, either by doing it inside a virtual machine or by putting it on a dedicated user account created only for this purpose. (Even though all this should work on my Mac I created and debugged it with a Ubuntu virtual machine running and plan to ultimately run it on my old Ubuntu desktop).
First you need to make sure curl is installed. In Ubuntu go to the terminal and type
sudo apt-get install curl
Then you need to create a directory to store the data and keep the script. I decided to call the directory “~/scrape” and the script “~/scrape/scrapescript.sh” You can create the shell script with any text editor. After the comments, the first thing in the script should be to create a timestamp variable. (I’m using “YYYYMMDD” to make it easier to sort). This will let you create new versions of each file rather than overwriting the old one each time. Then you use a series of cURL commands to scrape various websites.
# this script scrapes the following URLs and saves them to timestamped files TIMESTAMP=`date '+%Y%m%d_%H%M'` curl -o ~/scrape/cc_$TIMESTAMP.htm https://codeandculture.wordpress.com curl -o ~/scrape/se_$TIMESTAMP.htm http://soc2econ.wordpress.com
The next step is to use the terminal to make the shell script executable.
sudo chmod +x ~/scrape/scrapescript.sh
Finally, you need to cron it so it runs by itself at regular intervals. In Ubuntu you go to the terminal and type
You then get a choice of terminal-based text editors. If you’re not a masochist you’ll choose nano (which is very familiar if you’ve ever used pine mail). Then you have to add a line scheduling the job. For instance, if you want it to run once a day, just before midnight you would enter
59 11 * * * ~/scrape/scrapescript.sh
That’s all you really need but if you don’t already back up the entire account you might want to add a cp or rsync command to cron to back up your scrapes every day or every week.
Once you have all this data collected you’ll need to clean it, probably with regular expressions in perl but TextWrangler (on the mac) is about as good as an interactive GUI program can be for such a thing. Also note that this is going to produce a lot of files and so after you clean them you’re going to want a Stata script that can recognize a large batch of files and loop them.