Cited reference search time-series

May 13, 2010 at 4:58 am 2 comments

| Gabriel |

[Update: it looks like Google redirects you to a captcha after the first 100 pages, so the script won’t work for mega-cites like DiMaggio and Powell 1983].

I was recently talking to somebody who suspected an article he wrote 30 years ago was something of a “sleeper hit” and wanted to see an actual time-series. I wrote this little script to read Google Scholar and extract the dates. You have to tell it the Google Scholar serial number for the focal cite and how many pages to collect.

For instance if you search GS for Strang and Soule’s ARS and click where it says “Cited by 493” you get the URL “http://scholar.google.com/scholar?cites=3071200965662451019&hl=en&as_sdt=2000”. The important part of the URL is the number between “cites=” and “&”. To figure out how many pages to collect divide the number of citations by 10 and round down. So the syntax to scrape for this cite would be:

bash gscholarscrape.sh 3071200965662451019 49

Here’s the time-series for citations to that article

Here’s the code. Note that with fairly little modification you could get it to also give the names citing journal or book and authors.

#!/bin/bash
# gscholarscrape.sh
# this script scrapes google scholar for references to a given cite
# GHR, rossman@soc.ucla.edu

#takes as arguments the serial number of the cite followed by the number of pages deep to scrape (# of cites / 10)
#eg, for DiMaggio and Powell ASR 1983 the syntax is
#gscholarscrape.sh 11439231157488236678 1103
for (( i = 0; i < $2; i++ )); do
	j=$(($i*10))
	curl -A "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.307.11 Safari/532.9" -o gs$1_page$i.htm "http://scholar.google.com/scholar?start=$j&hl=en&cites=$1"
done

echo "date" >  $1.txt
for (( i = 0; i < $2; i++ )); do
	perl -nle 'print for m/ ([0-9][0-9][0-9][0-9]) - /g' gs$1_page$i.htm >> $1.txt
done

# have a nice day

Entry filed under: Uncategorized. Tags: , .

Social network packages poll Matrices within matrices

2 Comments

  • 1. A world class sycophant...  |  May 15, 2010 at 2:49 am

    …who is stopping in to say “Thank you for making your Sociology of Mass Communications course available on iTunes. Awesome listen, very interesting.”

  • 2. Bioinformatics  |  September 9, 2011 at 3:52 pm

    Thanks for the code.


The Culture Geeks


%d bloggers like this: