Posts tagged ‘shell’

Scraping for Event History

| Gabriel |

As I’ve previously mentioned, there’s a lot of great data out there but much of it is ephemeral so if you’re interested in change (which given our obsession with event history, many sociologists are) you’ve got to know how to grab it. I provided a script (using cron and curl) for grabbing specific pages and timestamping them but this doesn’t scale up very well to getting entire sites, both because you need to specify each specific URL and because it saves a complete copy each time rather than the diff. I’ve recently developed another approach that relies on wget and rsync and is much better for scaling up to a more ambitious scraping project.

Note that because of subtle differences between dialects of Unix, I’m assuming Linux for the data collection but Mac for the data cleaning.* Using one or the other for everything requires some adjustments. Also note that because you’ll want to “cron” this, I don’t recommend running it on your regular desktop computer unless you leave it on all night. If you don’t have server space (or an old computer on which you can install Linux and then treat as a server), your cheapest option is probably to run it on a wall wart computer for about $100 (plus hard drive).

Wget is similar to curl in that it’s a tool for downloading internet content but it has several useful features, some of which aren’t available in curl. First, wget can do recursion, which means it will automatically follows links and thus can get an entire site as compared to just a page. Second, it reads links from a text file a bit better than curl. Third, it has a good time-stamping feature where you can tell it to only download new or modified files. Fourth, you can exclude files (e.g., video files) that are huge and you’re unlikely to ever make use of. Put these all together and it means that wget is scalable — it’s very good at getting and updating several websites.

Unfortunately, wget is good at updating, but not at archiving. It assumes that you only want the current version, not the current version and several archival copies. Of course this is exactly what you do need for any kind of event history analysis. That’s where rsync comes in.

Rsync is, as the name implies, a syncing utility. It’s commonly used as a backup tool (both remote and local). However the simplest use for it is just to sync several directories and we’ll be applying it to a directory structure like this:

project/
  current/
  backup/
    t0/
    t1/
    t2/
  logs/

In this set up, wget only ever works on the “current” directory, which it freely updates. That is, whatever is in “current” is a pretty close reflection of the current state of the websites you’re monitoring. The timestamped stuff, which you’ll eventually be using for event history analysis, goes in the “backup” directories. Every time you run wget you then run rsync after it so that next week’s wget run doesn’t throw this week’s wget run down the memory hole.

The first time you do a scrape you basically just copy current/ to backup/t0. However if you were to do this for each scrape it would waste a lot of disk space since you’d have a lot of identical files. This is where incremental backup comes in, which Mac users will know as Time Machine. You can use hard links (similar to aliases or shortcuts) to get rsync to accomplish this.** The net result is that backup/t0 takes the same disk space as current/ but each subsequent “backup” directory takes only about 15% as much space. (A lot of web pages are generated dynamically and so they show up as “recently modified” every time, even if there’s no actual difference with the “old” file.) Note that your disk space requirements get big fast. If a complete scrape is X, then the amount of disk space you need is approximately 2 * X + .15 * X * number of updates. So if your baseline scrape is 100 gigabytes, this works out to a full terabyte after about a year of weekly updates.

Finally, when you’re ready to analyze it, just use mdfind (or grep) to search the backup/ directory (and its subdirectories) for the term whose diffusion you’re trying to track and pipe the results to a text file. Then use a regular expression to parse each line of this query into the timestamp and website components of the file path to see on which dates each website used your query term — exactly the kind of data you need for event history. Furthermore, you can actually read the underlying files to get the qualitative side of it.

So on to the code. The wget part of the script looks like this

DATESTAMP=`date '+%Y%m%d'`
cd ~/Documents/project
mkdir logs/$DATESTAMP
cd current
wget -S --output-file=../logs/$DATESTAMP/wget.log --input-file=../links.txt -r --level=3 -R mpg,mpeg,mp4,au,mp3,flv,jpg,gif,swf,wmv,wma,avi,m4v,mov,zip --tries=10 --random-wait --user-agent=""

That’s what it looks like the first time you run it. When you’re just trying to update “current/” you need to change “wget -S” to “wget -N” but aside from that this first part is exactly the same. Also note that if links.txt is long, I suggest you break it into several parts. This will make it easier to rerun only part of a large scrape, for instance if you’re debugging, or there’s a crash, or if you want to run the scrape only at night but it’s too big to completely run in a single night. Likewise it will also allow you to parallelize the scraping.

Now for the rsync part. After your first run of wget, run this code.

cd ..
rsync -a current/ backup/baseline/

After your update wget runs, you do this.

cd ..
cp -al backup/baseline/ backup/$DATESTAMP/
rsync -av --delete current/ backup/$DATESTAMP/

* The reason to use Linux for data collection is that OS X doesn’t include wget and has an older version of the cp command, though it’s possible to solve both issues by using Fink to install wget and by rewriting cp in Mac/BSD syntax. The reason to use Mac for data analysis is that mdfind is faster (at least once it builds an index) and can read a lot of important binary file formats (like “.doc”) out of the box, whereas grep only likes to read text-based formats (like “.htm”). There are apparently Linux programs (e.g., Beagle) that allow indexed search of many file formats, but I don’t have personal experience with using them as part of a script.

** I adapted this use of hard links and rsync from this tutorial, but note that there are some important differences. He’s interested in a rolling “two weeks ago,” “last week,” “this week” type of thing, whereas I’m interested in absolute dates and don’t want to overwrite them after a few weeks

September 28, 2010 at 4:26 am 5 comments

Hide iTunes Store/Ping

| Gabriel |

As a cultural sociologist who has published research on music as cultural capital, I understand how my successful presentation of self depends on me making y’all believe that I only listen to George Gershwin, John Adams, Hank Williams, the Raveonettes, and Sleater-Kinney, as compared to what I actually listen to 90% of the time, which is none of your fucking business. For this reason, along with generally being a web 2.0 curmudgeon, I’m not exactly excited about iTunes new social networking feature “Ping.” Sure, it’s opt-in and I appreciate that, but I am so actively uninterested in it that I don’t even want to see its icon in the sidebar. Hence I wrote this one-liner that hides it (along with the iTunes store).

defaults write com.apple.iTunes disableMusicStore -bool TRUE

My preferred way of running it is to use Automator to run it as a service in iTunes that takes no input.

Unfortunately it also hides the iTunes Store, which I actually use a couple times a month. To get the store (and unfortunately, Ping) back, either click through an iTunes store link in your web browser or run the command again but with the last word as “FALSE”.

September 13, 2010 at 4:44 am 2 comments

Some ways Stata is an unusual language

| Gabriel |

As I’ve tried to learn other languages, I’ve realized that part of the difficulty isn’t that they’re hard (although in some cases they are) but that I’m used to Stata’s very distinctive paradigm and nomenclature. Some aspects of Stata are pretty standard (e.g., “while”/”foreach”/”forvalues” loops, log files, and the “file” syntax for using text files on disk), but other bits are pretty strange. Or rather, they’re strange from a computer science perspective but intuitive from a social science perspective.

Stata seems to have been designed to make sense to social scientists and if this makes it confusing to programmers, then so be it. A simple example of this is that Stata uses the word “variable” in the sense meant by social scientists. More broadly, Stata is pretty bold about defaults so as to make things easy for beginners. It presumes that anything you’re doing applies to the dataset (aka the master data)  — which is always a flat-file database. Other things that might be held in memory have a secondary status and beginning users don’t even know that they’re there. Likewise, commands distinguish between the important arguments (usually variables) and the secondary arguments, which Stata calls “options”. There’s also the very sensible assumptions about what to report and what to put in ephemeral data objects that can be accessed immediately after the primary command (but need not be stored as part of the original command, as they would in most other languages).

Note, I’m not complaining about any of this. Very few of Stata’s quirks are pointlessly arbitrary. (The only arbitrary deviation I can think of is using “*” instead of “#” for commenting). Most of Stata’s quirks are necessary in order to make it so user-friendly to social scientists. In a lot of ways R is a more conventional language than Stata, but most social scientists find Stata much easier to learn. In part because Stata is willing to deviate from the conventions of general purpose programming languages, running and interpreting a regression in Stata looks like this “reg y x” instead of this “summary(lm(y~x))” and loading a dataset looks like this “use mydata, clear” instead of this “data <- read.table(mydata.txt)”. Stata has some pretty complicated syntax (e.g., the entire Mata language) but you can get a lot done with just a handful of simple commands like “use,” “gen,” and “reg”.

Nonetheless all this means that when Stata native speakers like me learn a second programming language it can be a bit confusing. And FWIW, I worry that rumored improvements to Stata (such as allowing relational data in memory) will detract from its user-friendliness. Anyway, the point is that I love Stata and I think it’s entirely appropriate for social scientists to learn it first. I do most of my work in Stata and I teach/mentor my graduate students in Stata unless there’s a specific reason for them to learn something else. At the same time I know that many social scientists would benefit a lot from also learning other languages. For instance, people into social networks should learn R, people who want to do content analysis should learn Perl or Python, and people who want to do simulations should learn NetLogo or Java. The thing is that when you do, you’re in for a culture shock and so I’m making explicit some ways in which Stata is weird.

Do-files and Ado-files. In any other language a do-file would be called a script and an ado-file would be called a library. Also note that Stata very conveniently reads all your ado-files automatically, whereas most other languages require you to specifically load the relevant libraries into memory at the beginning of each script.

Commands, Programs, and Functions. In Stata a program is basically just a command that you wrote yourself. Stata is somewhat unusual in drawing a distinction between a command/program and a function. So in Stata a function usually means some kind of transformation that attaches its output to a variable or macro, as in “gen ln_income=log(income)”. In contrast a command/program is pretty much anything that doesn’t directly attach to an operator and includes all file operations (e.g., “use”) and estimations (e.g, “regress”). Other languages don’t really draw this distinction but consider everything a function, no matter what it does and whether the user wrote it or not. (Some languages use “primitive” to mean something like the Stata command vs. program distinction, but it’s not terribly important).

Because most languages only have functions this means that pretty much everything has to be assigned to an object via an operator. Hence Stata users would usually type “reg y x” whereas R users would usually type “myregression <- lm(y~x)”. This is because “regress” in Stata is a command whereas “lm()” in R is a function. Also note that Stata distinguishes between commands and everything else by word order syntax with the command being the first word. In contrast functions in other languages (just like Stata functions) have the function being the thing outside the parentheses and inside the parentheses goes all of the arguments, both data objects and options.

The Dataset. Stata is one of the only languages where it’s appropriate to use the definite article in reference to data. (NetLogo is arguably another case of this). In other languages it’s more appropriate to speak of “a data object” than “the dataset,” even if there only happens to be one data object in memory. For the same reason, most languages don’t “use” or “open” data, but “read” the data and assign it to an object. Another way to think about it is that only Stata has a “dataset” whereas other languages only have “matrices.” Of course, Stata/Mata also has matrices but most Stata end users don’t bother with them as they tend to be kind of a backend thing that’s usually handled by ado-files. Furthermore, in other languages (e.g., Perl) it’s common to not even load a file into memory but to process it line-by-line, which in Stata terms is kind of like a cross between the “file read/write” syntax and a “while” loop.

Variables. Stata uses the term “variable” in the statistical or social scientific meaning of the term. In other languages this would usually be called a field or vector.

Macros. What most other languages call variables, Stata calls local and global “macros.” Stata’s usage of the local vs global distinction is standard. In other languages the concept of “declaring” a variable is usually a little more explicit than it is in Stata.

Stata is extremely good about expanding macros in situ and this can spoil us Stata users. In other languages you often have to do some kind of crude work around by first using some kind of concatenate function to create a string object containing the expansion and then you use that string object. For instance, if you wanted to access a series of numbered files in Stata you could just loop over this:

use ~/project/file`i', clear 

In other languages you’d have to add a separate line for the expansion. So in R you’d loop over:

filename <- paste('~/project/file',i, sep="")
data <- read.table(filename)

[Update: Also see this Statalist post by Nick Cox on the distinction between variables and macros]

Reporting. Stata allows you to pass estimations on for further work (that’s what return macros, ereturn matrices, and postestimation commands are all about), but it assumes you probably won’t and so it is unusually generous in reporting most of the really interesting things after a command. In other languages you usually have to specifically ask to get this level of reporting. Another way to put it is that in Stata verbosity is assumed by default and can be suppressed with “quietly,” whereas in R silence is assumed by default and verbosity can be invoked by wrapping the estimation (or an object saving the estimation) in the “summary()” function.

August 6, 2010 at 4:36 am 12 comments

importspss.ado (requires R)

| Gabriel |

Mike Gruszczynski has a post up pointing out that you can use R to translate files, for instance from SPSS to Stata. I like this a lot because it let’s you avoid using SPSS but I’d like it even better if it let you avoid using R as well.

As such I rewrote the script to work entirely from Stata. Mike wanted to do this in Bash but couldn’t figure out how to pass arguments from the shell to R. Frankly, I don’t know how to do this either which is why my solution is to have Stata write and execute an R source file so all the argument passing occurs within Stata. This follows my general philosophy of doing a lot of code mise en place in a user-friendly language so I can spend as little time as necessary in R. (Note that you could just as easily write this in Bash, but I figured this way you can a) make it cross-platform and b) attach it to “use” for a one-stop shop “import” command).

*importspss.ado
*by GHR 6/29/2010
*this script uses R to translate SPSS to Stata
*it takes as arguments the SPSS file and Stata file
*adapted from http://mikegruz.tumblr.com/post/704966440/convert-spss-to-stata-without-stat-transfer 

*DEPENDENCY: R and library(foreign) 
*if R exists but is not in PATH, change the reference to "R" in line 27 to be the specific location

capture program drop importspss
program define importspss
	set more off
	local spssfile `1'
	if "`2'"=="" {
		local statafile "`spssfile'.dta"
	}
	else {
		local statafile `2'	
	}
	local sourcefile=round(runiform()*1000)
	capture file close rsource
	file open rsource using `sourcefile'.R, write text replace
	file write rsource "library(foreign)" _n
	file write rsource `"data <- read.spss("`spssfile'", to.data.frame=TRUE)"' _n
	file write rsource `"write.dta(data, file="`statafile'")"' _n
	file close rsource
	shell R --vanilla <`sourcefile'.R
	erase `sourcefile'.R
	use `statafile', clear
end

June 29, 2010 at 3:01 pm 6 comments

Gnuplot [updated]

| Gabriel |

Here’s an updated version of my old script for piping from Stata to Gnuplot. Gnuplot is a generic graphing utility. The main thing it can do that Stata can’t is good 3D graphs. (Although the ado files “surface” and “tddens” make a good effort at adding this capability). In this version I set it to a coarser smooth, set the horizontal plane to the minimum value for Z, and allow either contour or surface.

Note that the dgrid3d smoothing algorithm in Gnuplot 4.2 (which is still the version maintained in Fink and Ubuntu) has some problems — the most annoying of which is that it can go haywire at the borders. I decided to handle this by setting it to a coarse smooth (5,5,1). However if unlike me you’re lucky enough to have Gnuplot 4.3 you have access to several much better smoothing algorithms and you may want to experiment with increasing the resolution and using one of the new algorithms by editing line 41 of this script.

*GHR 5/25/2010
*gnuplotpm3d 1.1
*pipes data to gnuplot's splot command, useful for color surface graphs
capture program drop gnuplotpm3d
program define gnuplotpm3d
syntax varlist(min=3 max=3 numeric) [ , title(string asis) xlabel(string asis) ylabel(string asis) using(string asis) contour rotate(string asis)]

disp "`rotate'"

if "`xlabel'"=="" {
local xlabel="`x'"
}
if "`ylabel'"=="" {
local ylabel="`y'"
}
tempfile gnuplotpm3cmd
tempfile data

preserve
keep `varlist'
order `varlist'
disp "`using'"
outsheet using `data'.txt
restore

*assumes that the software is running on a mac using Gnuplot.app
*if installed with fink or macports, set to the default (Unix) mode of just "gnuplot"
local gnuplot="gnuplot"
if "`c(os)'"=="MacOSX" {
	local gnuplot="exec '/Applications/Gnuplot.app/Contents/Resources/bin/gnuplot'"  /*assumes binary install, set to simply "gnuplot" if fink or macports install*/
}
if "`c(os)'"=="Windows" {
	*not tested
	local gnuplot="gnuplot.exe"  
}

file open gpcmd using `gnuplotpm3cmd', write text
file write gpcmd "cd '`c(pwd)'' ; " _n
file write gpcmd "set terminal postscript color ; set output '`using'.eps' ; set palette color positive ; " _n
file write gpcmd "set auto ; set parametric ; " _n
file write gpcmd "set dgrid3d 5,5,1 ; " _n  /*interpolation algorithm, allows tolerance for data irregularities -- set to low numbers for smooth graph and high numbers for bumpier graph*/
file write gpcmd `"set title "`title'" ; set ylabel "`ylabel'"; set xlabel "`xlabel'"; "' _n
file write gpcmd "unset contour; unset surface; " _n
if "`contour'"=="contour" {
	file write gpcmd "set view map; " _n
}
if "`rotate'"!="" & "`contour'"=="contour" {
	file write gpcmd "set view `rotate'; " _n
}
if "`contour'"=="contour" & "`rotate'"!="" {
	disp "rotate options ignored as they are incompatible with contour view"
}
file write gpcmd "set xyplane 0; " _n  /*put the xy plane right at min(z) */
file write gpcmd "set pm3d;" _n
file write gpcmd "set pm3d interpolate 10,10; " _n
file write gpcmd `"splot "`data'.txt"; "' _n
file close gpcmd

shell `gnuplot' `gnuplotpm3cmd'
end

May 25, 2010 at 3:01 pm 2 comments

Cited reference search time-series

| Gabriel |

[Update: it looks like Google redirects you to a captcha after the first 100 pages, so the script won’t work for mega-cites like DiMaggio and Powell 1983].

I was recently talking to somebody who suspected an article he wrote 30 years ago was something of a “sleeper hit” and wanted to see an actual time-series. I wrote this little script to read Google Scholar and extract the dates. You have to tell it the Google Scholar serial number for the focal cite and how many pages to collect.

For instance if you search GS for Strang and Soule’s ARS and click where it says “Cited by 493” you get the URL “http://scholar.google.com/scholar?cites=3071200965662451019&hl=en&as_sdt=2000&#8221;. The important part of the URL is the number between “cites=” and “&”. To figure out how many pages to collect divide the number of citations by 10 and round down. So the syntax to scrape for this cite would be:

bash gscholarscrape.sh 3071200965662451019 49

Here’s the time-series for citations to that article

Here’s the code. Note that with fairly little modification you could get it to also give the names citing journal or book and authors.

#!/bin/bash
# gscholarscrape.sh
# this script scrapes google scholar for references to a given cite
# GHR, rossman@soc.ucla.edu

#takes as arguments the serial number of the cite followed by the number of pages deep to scrape (# of cites / 10)
#eg, for DiMaggio and Powell ASR 1983 the syntax is
#gscholarscrape.sh 11439231157488236678 1103
for (( i = 0; i < $2; i++ )); do
	j=$(($i*10))
	curl -A "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.307.11 Safari/532.9" -o gs$1_page$i.htm "http://scholar.google.com/scholar?start=$j&hl=en&cites=$1"
done

echo "date" >  $1.txt
for (( i = 0; i < $2; i++ )); do
	perl -nle 'print for m/ ([0-9][0-9][0-9][0-9]) - /g' gs$1_page$i.htm >> $1.txt
done

# have a nice day

May 13, 2010 at 4:58 am 2 comments

Older Posts Newer Posts


The Culture Geeks