Scraping Using twitteR
December 13, 2011 at 1:21 pm gabrielrossman 5 comments
| Gabriel |
Previously I’d discussed scraping Twitter using Bash and Perl. Then yesterday on an orgtheory thread Trey mentioned the R library twitteR and with some help from Trey I worked out a simple script that replaces the twitterscrape_daily.sh and twitterparse.pl scripts from the earlier workflow. The advantage of this script is that it’s a lot shorter, it can get an arbitrary number of tweets instead of just 20, and it captures some of the meta-text that could be useful for constructing social networks.
To use it you need a text file that consists of a list of Twitter feeds, one per line. The location of this file is given in the “inputfile” line.
The “howmany” line controls how many tweets back it goes in each feed.
The “outputfile” line says where the output goes. Note that it treats it as append. As such you can get some redundant data, which you can fix by running this bash code:
sort mytweets.txt | uniq > tmp mv tmp mytweets.txt
The outputfile has no headers, but they are as follows:
#v1 ignore field, just shows number w/in query #v2 text of the Tweet #v3 favorited dummy #v4 replytosn (mention screenname) #v5 created (date in YMDhms) #v6 truncated dummy #v7 replytosid #v8 id #v9 replytouid #v10 statussource (Twitter client) #v11 screenname
Unfortunately, the script doesn’t handle multi-line tweets very well, but I’m not sufficiently good at R to regexp out internal EOL characters. I’ll be happy to work this in if anyone cares to post some code to the comments on how to do a find and replace that zaps the internal EOL in the field tmptimeline.df$text.
library(twitteR)
howmany <- 30 #how many past tweets to collect
inputfile <- "~/feedme.txt"
outputfile <- "~/mytweets.txt"
feeds <- as.vector(t(read.table(inputfile)))
for (user in feeds) {
tmptimeline <- userTimeline(user,n=as.character(howmany))
tmptimeline.df <- twListToDF(tmptimeline)
write.table(tmptimeline.df,file=outputfile,append=TRUE,sep="\t",col.names=FALSE)
}
Finally, if you import it into Stata you’ll probably want to run this:
drop v1 ren v2 text ren v3 favorited ren v4 replytosn ren v5 created ren v6 truncated ren v7 replytosid ren v8 id ren v9 replytouid ren v10 statussource ren v11 screenname gen double timestamp=clock(subinstr(created,"-","/",.),"YMDhms") format timestamp %tc
Entry filed under: Uncategorized. Tags: R, scraping.
1.
Trey | December 13, 2011 at 1:59 pm
Glad to see this worked out. A couple of follow-up points to potential users — if you want to do this as a one-off, you can set the col.names argument to TRUE and the append argument to FALSE and you’ll get variable names in the output. If you want to keep a running file (as I suspect almost everyone will want to do), then you’ll want to leave these options as listed above.
Quick and dirty fix to deal with multiple lines: Insert the following after line 9:
tmptimeline.df$text <- gsub("\\n", "", tmptimeline.df$text)For what it’s worth, right now you can make 150 unauthorized calls to Twitter an hour (or 350 OAuth calls), although you can make more calls than that using the Search API. More information on Twitter’s Rate Limiting FAQ.
2.
gabrielrossman | December 13, 2011 at 2:55 pm
Thanks, both for this and your earlier help by email.
One trick with the “colnames=TRUE” is that all the names are offset by one. Don’t ask me why, I just work here.
Also good to know about the 150 v 350 issue.
3.
Geoffrey Fojtasek | December 13, 2011 at 3:17 pm
You may still have some newlines if you don’t sub out “\r”, and there are also likely to be some tabs that end up being read as delimiters, so the gsub pattern should be “\\n|\\r|\\t”.
4.
Trey | December 13, 2011 at 3:23 pm
Yep, good point (hence the quick and dirty).
5. Scraping Using twitteR (updated) « Code and Culture | December 20, 2011 at 11:04 pm
[...] Last time I described using the twitteR library for R. In that post I had R itself read over a list to loop. In this post I make the have looping occur in Bash with argument passing to R through the commandArgs() function. [...]