Scraping Twitter

March 1, 2011 at 4:51 am 4 comments

| Gabriel |

I recently got interested in a set of communities that have a big Twitter presence and so I wrote some code to collect it. I started out creating OPML bookmark files which I can give to RSSOwl,* but ultimately decided to do everything in wget so I can cron it on my server. Nonetheless, I’m including the code for creating OPML in case people are interested.

Anyway, here’s the directory structure for the project:

project/
  a/
  b/
  c/
  lists/

I have a list of Twitter web URLs for each community that I store in “projects/lists” and call “twitterlist_a.txt”, “twitterlist_b.txt”, etc. I collected these lists by hand but if there’s a directory listing on the web you could parse it to get the URLs. Each of these files is just a list and looks like this:

http://twitter.com/adage
http://twitter.com/NotGaryBusey
http://twitter.com/science

I then run these lists through a Bash script, called “twitterscrape.sh” which is run from “project/” and takes the name of the community as an argument. It collects the Twitter page from each URL and extracts the RSS feed and “Web” link for each. It gives the RSS feeds to an OPML file, which is an XML file that RSS readers treat as a bookmark list and to a plain text file. The script also finds the “Web” link in the Twitter feeds and saves them as a file suitable for later use with wget.

#!/bin/bash
#twitterscrape.sh
#take list of twitter feeds
#extract rss feed links and convert to OPML (XML feed list) format
#extract weblinks 

#get current first page of twitter feeds
cd $1
wget -N --input-file=../lists/twitterlist_$1.txt
cd ..

#parse feeds for RSS, gen opml file (for use with RSS readers like RSSOwl)
echo -e "<?xml version\"1.0\" encoding=\"UTF-8\"?>\n<opml version=\"1.0\">\n\t<head>\n\t<body>" > lists/twitrss_$1.opml
grep -r 'xref rss favorites' ./$1 | perl -pe 's/.+\/(.+):     .+href="\/favorites\/(.+)" class.+\n/\t\t<outline text="$1" type="rss" xmlUrl="http:\/\/twitter.com\/statuses\/user_timeline\/$2"\/>\n/' >> lists/twitrss_$1.opml
echo -e "\t</body>\n</opml>\n" >> lists/twitrss_$1.opml

#make simple text list out of OPML (for use w wget)
grep 'http\:\/' lists/twitrss_$1.opml | perl -pe 's/\s+\<outline .+(http.+\.rss).+$/\1/' > lists/twitrss_$1.txt

#parse Twitter feeds for link to real websites (for use w wget)
grep -h -r '>Web</span>' ./$1 | perl -pe 's/.+href="(.+)" class.+\n/$1\n/' > lists/web_$1.txt

echo -e "\nIf using GUI RSS, please remember to import the OPML feed into RSSOwl or Thunderbird\nIf cronning, set up twitterscrape_daily.sh\n"

#have a nice day

This basically gives you a list of RSS feeds (in both OPML and TXT), but you still need to scrape them daily (or however often). If you’re using RSSOwl, “import” the OPML file. I started by doing this, but decided to cron it instead with two scripts.

The script twitterscrape_daily.sh collects the RSS files, calls the perl script to do some of the cleaning, combines the cleaned information into a cumulative file, and then deletes the temp files. Note that Twitter only lets you get 150 RSS feeds within a short amount of time — any more and it cuts you off. As such you’ll want to stagger the cron jobs. To see whether you’re running into trouble, the file project/masterlog.txt counts how many “400 Error” messages turn up per run. Usually these are Twitter turning you down because you’ve already collected a lot of data in a short amount of time. If you get this a lot, try splitting a large community in half and/or spacing out your crons a bit more and/or changing your IP address.

#!/bin/bash
#twitterscrape_daily.sh
#collect twitter feeds, reshape from individual rss/html files into single tab-delimited text file

DATESTAMP=`date '+%Y%m%d'`
PARENTPATH=~/project
TEMPDIR=$PARENTPATH/$1/$DATESTAMP

#get current first page of twitter feeds
mkdir $TEMPDIR
cd $TEMPDIR
wget -N --random-wait --output-file=log.txt --input-file=$PARENTPATH/lists/twitrss_$1.txt

#count "400" errors (ie, server refusals) in log.txt, report to master log file
echo "$1  $DATESTAMP" >> $PARENTPATH/masterlog.txt
grep 'ERROR 400\: Bad Request' log.txt | wc -l >> $PARENTPATH/masterlog.txt

#(re)create simple list of files
sed -e 's/http:\/\/twitter.com\/statuses\/user_timeline\///' $PARENTPATH/lists/twitrss_$1.txt > $PARENTPATH/lists/twitrssfilesonly_$1.txt

for i in $(cat $PARENTPATH/lists/twitrssfilesonly_$1.txt); do perl $PARENTPATH/twitterparse.pl $i; done
for i in $(cat $PARENTPATH/lists/twitrssfilesonly_$1.txt); do cat $TEMPDIR/$i.txt >> $PARENTPATH/$1/cumulativefile_$1.txt ; done

#delete the individual feeds (keep only "cumulativefile") to save disk space
#alternately, could save as tgz
rm -r $TEMPDIR

#delete duplicate lines
sort $PARENTPATH/$1/cumulativefile_$1.txt | uniq > $PARENTPATH/$1/tmp 
mv $PARENTPATH/$1/tmp $PARENTPATH/$1/cumulativefile_$1.txt

#have a nice day

Most of the cleaning is accomplished by twitterparse.pl. It’s unnecessary to cron this script as it’s called by twitterscrape_daily.sh, but it should be in the same directory.

#!/usr/bin/perl
#twitterparse.pl by ghr
#this script cleans RSS files scraped by WGET 
#usually run automatically by twitterscrape_daily.sh

use warnings; use strict;
die "usage: twitter_rss_parse.pl <foo.rss>\n" unless @ARGV==1;

my $rawdata = shift(@ARGV);

my $channelheader = 1 ; #flag for in the <channel> (as opposed to <item>)
my $feed = "" ;   #name of the twitter feed <channel><title>
my $title = "" ;  #item title/content <item><title> (or <item><description> for Twitter)
my $date = "" ;   #item date <item><pubDate>
my $source = "" ; #item source (aka, iphone, blackberry, web, etc) <item><twitter:source>

print "starting to read $rawdata\n";

open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.txt") or die "error creating $rawdata.txt\n";
while (<IN>) {
	#find if in <item> (ie, have left <channel>)
	if($_ =~ m/^\s+\<item\>/) {
		$channelheader = 0;
	}
		
	#find title of channel
	if($channelheader==1) {	
		if($_ =~ m/\<title\>/) {
			$feed = $_;
			$feed =~ s/\s+\<title\>(.+)\<\/title\>\n/$1/; #drop tags and EOL
			print "feed identifed as: $feed\n";
		}
	}

	#find all <item> info and write out at </item>
	if($channelheader==0) {	
		#note, cannot handle interal LF characters. 
		#doesn't crash but leaves in leading tag and 
		#only an issue for title/description
		#ignore for now
		if($_ =~ m/\<title\>/) {
			$title = $_;
			$title =~ s/\015?\012?//g; #manual chomp, global to allow internal \n
			$title =~ s/\s+\<title\>//; #drop leading tag
			$title =~ s/\<\/title\>//; #drop closing tag
		}
		if($_ =~ m/\<pubDate\>/) {
			$date = $_;
			$date =~ s/\s+\<pubDate\>(.+)\<\/pubDate\>\n/$1/; #drop tags and EOL
		}
		if($_ =~ m/\<twitter\:source\>/) {
			$source = $_;
			$source =~ s/\s+\<twitter\:source\>(.+)\<\/twitter\:source\>\n/$1/; #drop tags and CRLF
			$source =~ s/&lt;a href=&quot;http:\/\/twitter\.com\/&quot; rel=&quot;nofollow&quot;&gt;(.+)&lt;\/a&gt;/$1/; #cleanup long sources
		}
		#when item close tag is reached, write out then clear memory
		if($_ =~ m/\<\/item\>/) {
			print OUT "\"$feed\"\t\"$date\"\t\"$title\"\t\"$source\"\n";
			#clear memory (for <item> fields) 
			$title = "" ;
			$date = "" ;
			$source = "" ;
		}
	}
}
close IN;
close OUT;
print "done writing $rawdata.txt \n";

*In principle you could use Thunderbird instead of RSSOwl, but its RSS has some annoying bugs. If you do use Thunderbird, you’ll need to use this plug-in to export the mailboxes. Also note that by default, RSSOwl only keeps the 200 most recent posts. You want to disable this setting, either globally in the “preferences” or specifically in the “properties” of the particular feed.

Entry filed under: Uncategorized. Tags: , .

The experiment requires that you continue Fashion is danger

4 Comments

  • 1. elias  |  March 1, 2011 at 8:09 am

    Nice job, using bash is quite a decision… I love bash but for this purpose I preferred python…

    Blogs on philosophy of science

  • 2. Neal  |  March 1, 2011 at 10:00 am

    Nice work. You might want to check out the Twitter API Documentation. I haven’t done anything with the search API, but my gut is that you will get a lot less error messages. Also, you can get up to the last 1,500 tweets per user this way.

  • 3. gabrielrossman  |  March 1, 2011 at 1:49 pm

    elias and neal,

    agreed on both counts that it would be preferable to do this with Python and/or through the API. that i didn’t reflects that i’m a self-taught and somewhat idiosyncratic programmer. the larger archive through the API is especially tempting though.

  • 4. Scraping Using TwitteR « Code and Culture  |  December 13, 2011 at 1:21 pm

    […] I’d discussed scraping Twitter using Bash and Perl. Then yesterday on an orgtheory thread Trey mentioned the R library TwitteR and with some help from […]


The Culture Geeks


%d bloggers like this: