Scraping Using twitteR

December 13, 2011 at 1:21 pm 5 comments

| Gabriel |

Previously I’d discussed scraping Twitter using Bash and Perl. Then yesterday on an orgtheory thread Trey mentioned the R library twitteR and with some help from Trey I worked out a simple script that replaces the twitterscrape_daily.sh and twitterparse.pl scripts from the earlier workflow. The advantage of this script is that it’s a lot shorter, it can get an arbitrary number of tweets instead of just 20, and it captures some of the meta-text that could be useful for constructing social networks.

To use it you need a text file that consists of a list of Twitter feeds, one per line. The location of this file is given in the “inputfile” line.

The “howmany” line controls how many tweets back it goes in each feed.

The “outputfile” line says where the output goes. Note that it treats it as append. As such you can get some redundant data, which you can fix by running this bash code:

sort mytweets.txt | uniq > tmp
mv tmp mytweets.txt

The outputfile has no headers, but they are as follows:

#v1 ignore field, just shows number w/in query
#v2 text of the Tweet
#v3 favorited dummy
#v4 replytosn (mention screenname)
#v5 created (date in YMDhms)
#v6 truncated dummy
#v7 replytosid
#v8 id
#v9 replytouid
#v10 statussource (Twitter client)
#v11 screenname

Unfortunately, the script doesn’t handle multi-line tweets very well, but I’m not sufficiently good at R to regexp out internal EOL characters. I’ll be happy to work this in if anyone cares to post some code to the comments on how to do a find and replace that zaps the internal EOL in the field tmptimeline.df$text.

library(twitteR)
howmany <- 30 #how many past tweets to collect
inputfile <- "~/feedme.txt"
outputfile <- "~/mytweets.txt"

feeds <- as.vector(t(read.table(inputfile)))
for (user in feeds) {
	tmptimeline <- userTimeline(user,n=as.character(howmany))
	tmptimeline.df <- twListToDF(tmptimeline)
	write.table(tmptimeline.df,file=outputfile,append=TRUE,sep="\t",col.names=FALSE)
}

Finally, if you import it into Stata you’ll probably want to run this:

drop v1
ren v2 text
ren v3 favorited 
ren v4 replytosn
ren v5 created
ren v6 truncated 
ren v7 replytosid
ren v8 id
ren v9 replytouid
ren v10 statussource
ren v11 screenname
gen double timestamp=clock(subinstr(created,"-","/",.),"YMDhms")
format timestamp %tc

Entry filed under: Uncategorized. Tags: , .

Mainstream Scraping Using twitteR (updated)

5 Comments

  • 1. Trey  |  December 13, 2011 at 1:59 pm

    Glad to see this worked out. A couple of follow-up points to potential users — if you want to do this as a one-off, you can set the col.names argument to TRUE and the append argument to FALSE and you’ll get variable names in the output. If you want to keep a running file (as I suspect almost everyone will want to do), then you’ll want to leave these options as listed above.

    Quick and dirty fix to deal with multiple lines: Insert the following after line 9:

    tmptimeline.df$text <- gsub("\\n", "", tmptimeline.df$text)

    For what it’s worth, right now you can make 150 unauthorized calls to Twitter an hour (or 350 OAuth calls), although you can make more calls than that using the Search API. More information on Twitter’s Rate Limiting FAQ.

    • 2. gabrielrossman  |  December 13, 2011 at 2:55 pm

      Thanks, both for this and your earlier help by email.
      One trick with the “colnames=TRUE” is that all the names are offset by one. Don’t ask me why, I just work here.
      Also good to know about the 150 v 350 issue.

  • 3. Geoffrey Fojtasek  |  December 13, 2011 at 3:17 pm

    You may still have some newlines if you don’t sub out “\r”, and there are also likely to be some tabs that end up being read as delimiters, so the gsub pattern should be “\\n|\\r|\\t”.

    • 4. Trey  |  December 13, 2011 at 3:23 pm

      Yep, good point (hence the quick and dirty).

  • 5. Scraping Using twitteR (updated) « Code and Culture  |  December 20, 2011 at 11:04 pm

    […] Last time I described using the twitteR library for R. In that post I had R itself read over a list to loop. In this post I make the have looping occur in Bash with argument passing to R through the commandArgs() function. […]


The Culture Geeks


%d bloggers like this: