Scraping Using twitteR (updated)
| Gabriel |
Last time I described using the twitteR library for R. In that post I had R itself read over a list to loop. In this post I make the have looping occur in Bash with argument passing to R through the commandArgs() function.
First, one major limitation of using Twitter is that it times you out for an hour after 150 queries. (You can double this if you use OAuth but I’ve yet to get that to work). For reasons I don’t really understand, getting one feed can mean multiple queries, especially if you’re trying to go far back in the timeline. For this reason you need to break up your list into a bunch of small lists and cron them at least 80 minutes apart. This bit of Bash code will split up a file called “list.txt” into several files. Also, to avoid trouble later on, it makes sure you have Unix EOL.
split -l 50 list.txt short_tw perl -pi -e 's/\r\n/\n/g' short_tw*
The next thing to keep in mind is that you’ll need to pass arguments to R. Argument passing is when a script takes input from outside the script and processes it as variables. The enthusiastic use of argument passing in Unix is the reason why there is a fine line between a file and a command in that operating system.
In theory you could have R read the target list itself but this crashes when you hit your first dead URL. Running the loop from outside R makes it more robust but this requires passing arguments to R. I’d previously solved this problem by having Stata write an entire R script, which Stata understood as having variables (or “macros”) but which from R’s perspective was hard-coded. However I was recently delighted to discover that R can accept command-line arguments with the commandArgs() function. Not surprisingly, this is more difficult than $1 in Bash, @ARGV in Perl, or `1′ in Stata, but it’s not that bad. To use it you have to use the “–args” option when invoking R and then inside of R you use the commandArgs() function to pass arguments to an array object, which behaves just like the @ARGV array in Perl.
Here’s an R script that accepts a Twitter screenname as a command-line argument, uses the twitteR library to collect that feed, and then saves it as a tab-delimited text file of the same name. (It appends if there’s an existing file). Also note that (thanks to commenters on the previous post) it turns internal EOL into regular spaces. It’s currently set to collect the last 200 tweets but you can adjust this with the third line (or you could rewrite the script to make this a command-line argument as well).
args <- commandArgs(trailingOnly = TRUE) library(twitteR) howmany <- 200 #how many past tweets to collect user <- args outputfile <- paste('~/project/feeds/',user,'.txt',sep="") print(user) print(outputfile) tmptimeline <- userTimeline(user,n=as.character(howmany)) tmptimeline.df <- twListToDF(tmptimeline) tmptimeline.df$text <- gsub("\\n|\\r|\\t", " ", tmptimeline.df$text) write.table(tmptimeline.df,file=outputfile,append=TRUE,sep="\t",col.names=FALSE) quit()
To use the script to get just a single feed, you invoke it like this from the command-line.
R --vanilla --args asanews < datacollection.R
Of course the whole reason to write the script this way is to loop it over the lists. Here it is for the list “short_twaa”.
for i in `cat short_twaa`; do R --vanilla --args $i < datacollection.R ; done
Keep in mind that you’ll probably want to cron this, either because you want a running scrape or because it makes it easier to space put the “short_tw*” files so you don’t get timed out.