Cited reference search time-series
| Gabriel |
[Update: it looks like Google redirects you to a captcha after the first 100 pages, so the script won’t work for mega-cites like DiMaggio and Powell 1983].
I was recently talking to somebody who suspected an article he wrote 30 years ago was something of a “sleeper hit” and wanted to see an actual time-series. I wrote this little script to read Google Scholar and extract the dates. You have to tell it the Google Scholar serial number for the focal cite and how many pages to collect.
For instance if you search GS for Strang and Soule’s ARS and click where it says “Cited by 493” you get the URL “http://scholar.google.com/scholar?cites=3071200965662451019&hl=en&as_sdt=2000”. The important part of the URL is the number between “cites=” and “&”. To figure out how many pages to collect divide the number of citations by 10 and round down. So the syntax to scrape for this cite would be:
bash gscholarscrape.sh 3071200965662451019 49
Here’s the time-series for citations to that article
Here’s the code. Note that with fairly little modification you could get it to also give the names citing journal or book and authors.
#!/bin/bash # gscholarscrape.sh # this script scrapes google scholar for references to a given cite # GHR, firstname.lastname@example.org #takes as arguments the serial number of the cite followed by the number of pages deep to scrape (# of cites / 10) #eg, for DiMaggio and Powell ASR 1983 the syntax is #gscholarscrape.sh 11439231157488236678 1103 for (( i = 0; i < $2; i++ )); do j=$(($i*10)) curl -A "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.307.11 Safari/532.9" -o gs$1_page$i.htm "http://scholar.google.com/scholar?start=$j&hl=en&cites=$1" done echo "date" > $1.txt for (( i = 0; i < $2; i++ )); do perl -nle 'print for m/ ([0-9][0-9][0-9][0-9]) - /g' gs$1_page$i.htm >> $1.txt done # have a nice day