Archive for September, 2010

Scraping for Event History

| Gabriel |

As I’ve previously mentioned, there’s a lot of great data out there but much of it is ephemeral so if you’re interested in change (which given our obsession with event history, many sociologists are) you’ve got to know how to grab it. I provided a script (using cron and curl) for grabbing specific pages and timestamping them but this doesn’t scale up very well to getting entire sites, both because you need to specify each specific URL and because it saves a complete copy each time rather than the diff. I’ve recently developed another approach that relies on wget and rsync and is much better for scaling up to a more ambitious scraping project.

Note that because of subtle differences between dialects of Unix, I’m assuming Linux for the data collection but Mac for the data cleaning.* Using one or the other for everything requires some adjustments. Also note that because you’ll want to “cron” this, I don’t recommend running it on your regular desktop computer unless you leave it on all night. If you don’t have server space (or an old computer on which you can install Linux and then treat as a server), your cheapest option is probably to run it on a wall wart computer for about $100 (plus hard drive).

Wget is similar to curl in that it’s a tool for downloading internet content but it has several useful features, some of which aren’t available in curl. First, wget can do recursion, which means it will automatically follows links and thus can get an entire site as compared to just a page. Second, it reads links from a text file a bit better than curl. Third, it has a good time-stamping feature where you can tell it to only download new or modified files. Fourth, you can exclude files (e.g., video files) that are huge and you’re unlikely to ever make use of. Put these all together and it means that wget is scalable — it’s very good at getting and updating several websites.

Unfortunately, wget is good at updating, but not at archiving. It assumes that you only want the current version, not the current version and several archival copies. Of course this is exactly what you do need for any kind of event history analysis. That’s where rsync comes in.

Rsync is, as the name implies, a syncing utility. It’s commonly used as a backup tool (both remote and local). However the simplest use for it is just to sync several directories and we’ll be applying it to a directory structure like this:

project/
  current/
  backup/
    t0/
    t1/
    t2/
  logs/

In this set up, wget only ever works on the “current” directory, which it freely updates. That is, whatever is in “current” is a pretty close reflection of the current state of the websites you’re monitoring. The timestamped stuff, which you’ll eventually be using for event history analysis, goes in the “backup” directories. Every time you run wget you then run rsync after it so that next week’s wget run doesn’t throw this week’s wget run down the memory hole.

The first time you do a scrape you basically just copy current/ to backup/t0. However if you were to do this for each scrape it would waste a lot of disk space since you’d have a lot of identical files. This is where incremental backup comes in, which Mac users will know as Time Machine. You can use hard links (similar to aliases or shortcuts) to get rsync to accomplish this.** The net result is that backup/t0 takes the same disk space as current/ but each subsequent “backup” directory takes only about 15% as much space. (A lot of web pages are generated dynamically and so they show up as “recently modified” every time, even if there’s no actual difference with the “old” file.) Note that your disk space requirements get big fast. If a complete scrape is X, then the amount of disk space you need is approximately 2 * X + .15 * X * number of updates. So if your baseline scrape is 100 gigabytes, this works out to a full terabyte after about a year of weekly updates.

Finally, when you’re ready to analyze it, just use mdfind (or grep) to search the backup/ directory (and its subdirectories) for the term whose diffusion you’re trying to track and pipe the results to a text file. Then use a regular expression to parse each line of this query into the timestamp and website components of the file path to see on which dates each website used your query term — exactly the kind of data you need for event history. Furthermore, you can actually read the underlying files to get the qualitative side of it.

So on to the code. The wget part of the script looks like this

DATESTAMP=`date '+%Y%m%d'`
cd ~/Documents/project
mkdir logs/$DATESTAMP
cd current
wget -S --output-file=../logs/$DATESTAMP/wget.log --input-file=../links.txt -r --level=3 -R mpg,mpeg,mp4,au,mp3,flv,jpg,gif,swf,wmv,wma,avi,m4v,mov,zip --tries=10 --random-wait --user-agent=""

That’s what it looks like the first time you run it. When you’re just trying to update “current/” you need to change “wget -S” to “wget -N” but aside from that this first part is exactly the same. Also note that if links.txt is long, I suggest you break it into several parts. This will make it easier to rerun only part of a large scrape, for instance if you’re debugging, or there’s a crash, or if you want to run the scrape only at night but it’s too big to completely run in a single night. Likewise it will also allow you to parallelize the scraping.

Now for the rsync part. After your first run of wget, run this code.

cd ..
rsync -a current/ backup/baseline/

After your update wget runs, you do this.

cd ..
cp -al backup/baseline/ backup/$DATESTAMP/
rsync -av --delete current/ backup/$DATESTAMP/

* The reason to use Linux for data collection is that OS X doesn’t include wget and has an older version of the cp command, though it’s possible to solve both issues by using Fink to install wget and by rewriting cp in Mac/BSD syntax. The reason to use Mac for data analysis is that mdfind is faster (at least once it builds an index) and can read a lot of important binary file formats (like “.doc”) out of the box, whereas grep only likes to read text-based formats (like “.htm”). There are apparently Linux programs (e.g., Beagle) that allow indexed search of many file formats, but I don’t have personal experience with using them as part of a script.

** I adapted this use of hard links and rsync from this tutorial, but note that there are some important differences. He’s interested in a rolling “two weeks ago,” “last week,” “this week” type of thing, whereas I’m interested in absolute dates and don’t want to overwrite them after a few weeks

Advertisements

September 28, 2010 at 4:26 am 5 comments

Stata 11 Factor Variable / Margins Links

| Gabriel |

Michael Mitchell posted a brief tutorial on factor variables. (This has nothing to do with the “factor” command — think interaction terms and “xi,” not eigenvectors and structural equations). People who are already familiar with FV won’t really get anything out of the tutorial, but should serve as a very good introduction for people who are new to this syntax. Like all of Mitchell’s writing (e.g., the graphics book) it is very clearly written and generally well-suited to the user who is transitioning from novice to advanced usage.

The new Stata newsletter has a good write-up of “margins” syntax, which is useful in conjunction with factor variables for purposes of interpreting the betas (especially when there’s some nonlinearity involved). In a recent post explaining a comparable approach, I said that my impression was that margins only really works with categorical independent variables, but I’m please to see that I was mistaken and it in fact works with continuous variables as well.

September 23, 2010 at 4:01 am 1 comment

Zotero’s Bibtex export filter

| Gabriel |

Two issues with Zotero’s export filter to BibTex.

First, in the new version they broke backwards compatibility (a little) so you can get some missing citation errors if you try to use Lyx/Latex files that you originally wrote based on BibTex files generated by older versions of Zotero. Specifically, the new version handles colons in the title differently when generating the BibTex key. The old version left the colon out, the new version keeps it in. I had to go through my book manuscript and change all the BibTex keys for which this makes a difference. An alternate approach would be to freeze your Zotero library folder and use the last export created with the old version and start a new Zotero library folder for citations you collect from here on out.

Second, while it is possible to create a BibTex “author” field that TeX doesn’t read as “Last, First,” you can’t do it from Zotero. This is a problem as you end up getting ridiculous bibliography entries like “Bureau, US Census”. To get these and similar cases formatted correctly, the easiest thing is just to delete them from Zotero and put them in a hand-coded BibTex file. (Don’t worry, a single TeX document can draw citations from multiple .bib files so you don’t have to commit to hand-coding everything or merging your hand-coded file into the Zotero-generated file). The trick is that you use quotes (not curly brackets as with Zotero’s export) for the field delimiters and then use curly brackets inside quotes to tell BibTex “don’t break this.” For instance, here’s my entry for the Statistical Abstract of the United States:

@book{u.s._census_bureau_statistical_2007,
	address = "Washington, DC",
	edition = "127",
	title = "Statistical Abstract of the United States, 2008.",
	isbn = "9780160795848",
	publisher = "{U.S.} Census Bureau",
	author = "{U.S. Census Bureau}",
	year = "2007"
}

September 21, 2010 at 11:31 pm

We have a protractor

| Gabriel |

Neal Stephenson’s Anathem opens with the worst instrument in the history of survey research. A monastery of cloistered science-monks is about to open its gates to the outside world for a brief decennial festival and they are interviewing one of the few ordinary people with whom they have regular contact about what they can expect outside. The questions are as vague and etic as imaginable and the respondent has a hard time interpreting them. The reason the questions are so bad is that the monks are almost completely cut off from society and the instrument has been in continuous use for millenia.

The monks call society the “secular world” which sounds strange given that these monks are atheists, but makes sense if you remember that “secular” means “in time” and in English we use this word to mean “non-religious” because St. Augustine argued that God exists outside of time and Pope Gelasius elaborated this argument to develop a theory of  the Church’s magisterium. Anyway, the monks in Anathem are so separate from society that in a very real sense they too exist outside of time. To the extent that the outside world does impinge on their experience, it is mostly with the threat of a “sack,” an anti-intellectual pogrom that tends to happen every few centuries.

Thus the novel, especially the first third, is primarily a thought experiment about what it would look like if we were to take the ivory tower as a serious aspiration. I mean, imagine never struggling to figure out what the broader impacts are of your research for purposes of a grant proposal because you’re opposed in principle to the very idea of broader impacts and strive for such perfect lack of them that you asymptotically approach “causal domain shear,” meaning that nothing in the monastery affects the outside world and vice versa. Also, you never go through tenure review because you can stay in the monastery as long as you want (intellectual deadweight are gently shifted to full-time semi-skilled manual labor). OK, there are some pretty big downsides compared to academia here on Earth. You have to do all your calculations by hand as you are forbidden computers and most other modern technology. You spend half the day chanting and gardening. Your only possessions are a bathrobe and a magic beach ball. When you break the rules they punish you by sending you into solitary for a few months to study for a field exam on gibberish. And as previously mentioned, once or twice every thousand years the yokels storm your office and lynch you.

The most easily ridiculed and stereotypically science-fiction-y aspect of the book is the abundance of neologisms. When I started reading it, I found the whole alternative vocabulary very distracting and I did a lot of on-the-fly translation from Stephenson to English. I mean, I understand the need to coin terms like “centarian” (members of a monastery whose cloistered status is relaxed only once a century) when there is no good English equivalent, but it’s mostly* gratuitous to talk about a “Counter-Bazian ark” instead of a “Protestant church,” a “jee-jaw” instead of an “iPhone,” “Syntactic Faculty” instead of “nominalism,” “Semantic Faculty” instead of “naturalism,” or “Orth” instead of “Latin.” Likewise, I found myself constantly interpreting the dates by adding 2200.** Fortunately after a few hundred pages this didn’t bother me, not so much because I thought it was justified, but because I was sufficiently acclimated to it and enjoying the novel that I didn’t notice anymore. Still, I would have preferred it if he just set the book in the future of an alternate Earth which had science monks but didn’t have a bunch of silly vocabulary.

Also, for better or worse, several of the secondary characters were basically recycled from Cryptonomicon. So the outdoorsman Yul and his tom boy girlfriend Cord are basically the same people as the outdoorsman Doug Shaftoe and his daughter Amy. Likewise, the asshole master Procian Fraa Lodoghir is basically the same person as the asshole celebrity postmodernist GEB Kivistik.

Oh, and there’s kung fu.

*If you want to know what I mean by “mostly,” read the book’s spoiler-tastic Wikipedia page and figure it out.

**Aside from the whole science monasteries thing, the book’s backstory closely parallels actual political and intellectual history through the “Praxic” (read: modern) age. Their dating system is pegged to a horrific nuclear war in about the year 2200 AD rather than to the foundation of the Bazian church (read: the birth of Christ). The novel’s present is 3700 years after the nuclear apocalypse, or about the equivalent of the year 6000 AD.

September 20, 2010 at 2:52 pm 1 comment

Status, Sorting, and Meritocracy

| Gabriel |

Over at OrgTheory, Fabio asked about how much turnover we expect to see in the NRC rankings. In the comments, myself and a few other people discussed the analysis of the rankings in Burris 2004 ASR. Kieran mentioned the interpretation of the data that it could all be sorting.

To see how plausible this is I wrote a simulation with 500 grad students, each of whom has a latent amount of talent that can only be observed with some noise. The students are admitted in cohorts of 15 each to 34 PhD granting departments and are strictly sorted so the (apparently) best students go to the best schools. There they work on their dissertations, the quality of which is a function of their talent, luck, and (to represent the possibility that top departments teach you more) a parameter proportional to the inverse root of the department’s rank. There is then a job market, with one job line per PhD granting department, and again, strict sorting (without even an exception for the incest taboo). I then summarize the amount of reproduction as the proportion of top 10 jobs that are taken by grad students from the top ten schools.

So how plausible is the meritocracy explanation? It turns out it’s pretty plausible. This table shows the average closure for the top 10 jobs averaged over 100 runs each for several combinations of assumptions. Each cell shows, on average, what proportion of the top 10 jobs we expect to be taken by students from the top 10 schools if we take as assumptions the row and column parameters. The rows represent different assumptions about how noisy is our observation of talent when we read an application to grad school or a job search. The columns represent a scaling parameter for how much you learn at different ranked schools. For instance, if we assume a learning parameter of “1.5,” a student at the 4th highest-ranked school would learn 1.5/(4^0.5), or .75. It turns out that unless you assume noise to be very high (something like a unit signal:noise ratio or worse), meritocracy is pretty plausible. Furthermore, if you assume that the top schools actually educate grad students better then meritocracy looks very plausible even if there’s a lot of noise.

P of top 10 jobs taken by students from top 10 schools
----------------------------------------
Noisiness |
of        |
Admission |
s and     |
Diss /    |How Much More Do You Learn at
Job       |         Top Schools
Market    |    0    .5     1   1.5     2
----------+-----------------------------
        0 |    1     1     1     1     1
       .1 |    1     1     1     1     1
       .2 |    1     1     1     1     1
       .3 | .999     1     1     1     1
       .4 | .997     1     1     1     1
       .5 | .983  .995  .999     1     1
       .6 | .966   .99  .991  .999  .999
       .7 | .915   .96  .982  .991  .995
       .8 | .867  .932  .963  .975  .986
       .9 | .817  .887  .904  .957  .977
        1 | .788  .853  .873  .919   .95
----------------------------------------

Of course, keep in mind this is all in a world of frictionless planes and perfectly spherical cows. If we assume that lots of people are choosing on other margins, or that there’s not a strict dual queue of positions and occupants (e.g., because searches are focused rather than “open”), then it gets a bit looser. Furthermore, I’m still not sure that the meritocracy model has a good explanation for the fact that academic productivity figures (citation counts, etc) have only a loose correlation with ranking.

Here’s the code, knock yourself out using different metrics of reproduction, inputting different assumptions, etc.

[Update: also see Jim Moody’s much more elaborate/realistic simulation, which gives similar results].

capture program drop socmeritocracy
program define socmeritocracy
	local gre_noise=round(`1',.001) /* size of error term, relative to standard normal, for apparenttalent=f(talent) */
	local diss_noise=round(`2',.001) /* size of error term, relative to standard normal, for dissquality=f(talent) */
	local quality=round(`3',.001) /* scaling parameter for valueadded (by quality grad school) */
	local cohortsize=round(`4',.001) /* size of annual graduate cohort (for each programs) */
	local facultylines=round(`5',.001) /* number of faculty lines (for each program)*/
	local batch `6'

	clear
	quietly set obs 500 /*create 500 BAs applying to grad school*/
	quietly gen talent=rnormal() /* draw talent from normal */
	quietly gen apparenttalent=talent + rnormal(0,`gre_noise') /*observe talent w error */
	*grad school admissions follows strict dual queue by apparent talent and dept rank
	gsort -apparenttalent
	quietly gen gradschool=1 + floor(([_n]-1)/`cohortsize')
	lab var gradschool "dept rank of grad school"
	*how much more do you actually learn at prestigious schools
	quietly gen valueadded=`quality'*(1/(gradschool^0.5))
	*how good is dissertation, as f(talent, gschool value added, noise)
	quietly gen dissquality=talent+rnormal(0,`diss_noise') + valueadded
	*grad school admissions follows strict dual queue of diss quality and dept rank (no incest taboo/preference)
	gsort -dissquality
	quietly gen placement=1 + floor(([_n]-1)/`facultylines')
	lab var placement "dept rank of 1st job"
	quietly sum gradschool
	quietly replace placement=. if placement>`r(max)' /*those not placed in PhD granting departments do not have research jobs (and may not even have finished PhD)*/
	*recode outcomes in a few ways for convenience of presentation
	quietly gen researchjob=placement
	quietly recode researchjob 0/999=1 .=0
	lab var researchjob "finished PhD and has research job"
	quietly gen gschool_type= gradschool
	quietly recode gschool_type 1/10=1 11/999=2 .=3
	quietly gen job_type= placement
	quietly recode job_type 1/10=1 11/999=2 .=3
	quietly gen job_top10= placement
	quietly recode job_top10 1/10=1 11/999=0
	lab def typology 1 "top 10" 2 "lower ranked" 3 "non-research"
	lab val gschool_type job_type typology
	if "`batch'"=="1" {
		quietly tab gschool_type job_type, matcell(xtab)
		local p_reproduction=xtab[1,1]/(xtab[1,1]+xtab[2,1])
		shell echo "`gre_noise' `diss_noise' `quality' `cohortsize' `facultylines' `p_reproduction'" >> socmeritocracyresults.txt
	}
	else {
		twoway (lowess researchjob gradschool), ytitle(Proportion Placed) xtitle(Grad School Rank)
		tab gschool_type job_type, chi2
	}
end

shell echo "gre_noise diss_noise quality cohortsize facultylines p_reproduction" > socmeritocracyresults.txt

forvalues gnoise=0(.1)1 {
	local dnoise=`gnoise'
	forvalues qualitylearning=0(.5)2 {
		forvalues i=1/100 {
			disp "`gnoise' `dnoise' `qualitylearning' 15 1 1 tick `i'"
			socmeritocracy `gnoise' `dnoise' `qualitylearning' 15 1 1
		}
	}
}

insheet using socmeritocracyresults.txt, clear delim(" ")
lab var gre_noise "Noisiness of Admissions and Diss / Job Market"
lab var quality "How Much More Do You Learn at Top Schools"
table gre_noise quality, c(m p_reproduction)

September 15, 2010 at 4:51 am 1 comment

Seeing like a state used to see

| Gabriel |

The Census Bureau website has not only the current edition of the Statistical Abstract of the United States, but most of the old editions going back to 1878. A lot of the time you just need the basic summary statistics not the raw micro data so this is a great resource.

It’s also just plain interesting, especially if you’re interested in categorical schema. For instance, in the 1930s revenues for the radio industry are in the section on “National Government Finances” (because this is the tax base), whereas the same figure is now in a section on “Information & Communications.” This suggests a very different conception about what the information is there for and who is meant to consume it for what purposes.

What really surprised me though was treating the deaf or blind as commensurable with convicted felons, but there they are in the “Defectives and Delinquents” chapter of editions through the early 1940s. The logic of such a category seems to be based not on a logic of “people who might deserve help vs people who might hurt us” but on a causal model of deviations from normality that assumes a eugenics/phrenology conception of crime as based on a malformed brain. Given that we now use MRI to study the brains of murderers, the main thing that’s really shocking about the category is the casual harshness of the word “defective” rather than applying a materialist etiology to both social deviance and physical disability.

September 14, 2010 at 4:45 am

Hide iTunes Store/Ping

| Gabriel |

As a cultural sociologist who has published research on music as cultural capital, I understand how my successful presentation of self depends on me making y’all believe that I only listen to George Gershwin, John Adams, Hank Williams, the Raveonettes, and Sleater-Kinney, as compared to what I actually listen to 90% of the time, which is none of your fucking business. For this reason, along with generally being a web 2.0 curmudgeon, I’m not exactly excited about iTunes new social networking feature “Ping.” Sure, it’s opt-in and I appreciate that, but I am so actively uninterested in it that I don’t even want to see its icon in the sidebar. Hence I wrote this one-liner that hides it (along with the iTunes store).

defaults write com.apple.iTunes disableMusicStore -bool TRUE

My preferred way of running it is to use Automator to run it as a service in iTunes that takes no input.

Unfortunately it also hides the iTunes Store, which I actually use a couple times a month. To get the store (and unfortunately, Ping) back, either click through an iTunes store link in your web browser or run the command again but with the last word as “FALSE”.

September 13, 2010 at 4:44 am 2 comments

Older Posts


The Culture Geeks