Memetracker into Stata

February 8, 2010 at 4:31 am 7 comments

| Gabriel |

A few months ago I mentioned the Memetracker project to scrape the internet and look for the diffusion of (various variants of) catchphrases. I wanted to play with the dataset but there were a few tricks. First, the dataset is really, really, big. The summary file is 862 megabytes when stored as text and would no doubt be bigger in Stata (because of how Stata allocates memory to string variables). Second, the data is in a moderately complicated hierarchical format, with “C” specific occurrences, nested within “B” phrase variants, which are in turn nested within “A” phrase families. You can immediately identify whether a row is A, B, or C by the numer of leading tabs (0, 1, and 2, respectively).

I figured that the best way to interpret this data in Stata would be two create two flat-files, one a record of all the “A” records that I call “key”, and the other a simplified version of all the “C” records but with the key variable to allow merging with the “A” records. Rather than do this all in Stata, I figured it would be good to pre-process it in perl, which reads text one line at a time and thus is well-suited for handling very large files. The easy part was to make a first pass through the file with grep to create the “key” file by copying all the “A” rows (i.e., those with no leading tabs).

Slightly harder was to cull the “C” rows. If I just wanted the “C” rows this would be easy, but I wanted to associate them with the cluster key variable from the “A” rows. This required looking for “A” rows, copying the key, and keeping it in memory until the next “A” row. Meanwhile, every time I hit a “C” row, I copy it but add in the key variable from the most recent “A” row. Both for debugging and because I get nervous when a program doesn’t give any output for several minutes, I have it print to screen every new “A” key. Finally, to keep the file size down, I set a floor to eliminate reasonably rare phrase clusters (anything with less than 500 occurrences total).

At that point I had two text files, “key” which associates the phrase cluster serial number with the actual phrase string and “data” which records occurrences of the phrases. The reason I didn’t merge them is that it would massively bloat the file size and it’s not necessary for analytic purposes. Anyway, at this point I could easily get both the key and data files into Stata and do whatever I want with them. As a first pass, I graphed the time-series for each catchphrase, with and without special attention drawn to mentions occurring in the top 10 news websites.

Here’s a sample graph.

Here’s the perl file:

#!/usr/bin/perl
#mt_clean.pl by ghr
#this script cleans the memetracker.org "phrase cluster" data
#http://snap.stanford.edu/data/d/quotes/Old-UniqUrls/clust-qt08080902w3mfq5.txt.gz
#script takes the (local and unzipped) location of this file as an argument
#throws out much of the data, saves as two tab flatfiles
#"key.txt" which associates cluster IDs with phrases
#"data.txt" which contains individual observations of the phrases
# input
# A:  <ClSz>  <TotFq>  <Root>  <ClId>
# B:          <QtFq>   <Urls>  <QtStr>  <QtId>
# C:                   <Tm>    <Fq>     <UrlTy>  <Url>
# output, key file
# A:  <ClSz>  <TotFq>  <Root>  <ClId>
# output, data file
# C:<ClID>	<Tm>	<UrlTy>	<URL>
# make two passes.

use warnings; use strict;
die "usage: mt_clean.pl <phrase cluster data>\n" unless @ARGV==1;

#define minimum number of occurences a phrase must have
my $minfreq = 500;

my $rawdata = shift(@ARGV);
# use bash grep to write out the "key file"
system("grep '^[0-9]' $rawdata > key.txt");

# read again, and write out the "data file"
# if line=A, redefine the "clid" variable
# optional, if second field of "A" is too small, (eg, below 100), break the loop?
# if line=B, skip
# if line=C, write out with "clid" in front
my $clid  ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">data.txt") or die "error creating data.txt\n";
print OUT "clid\ttm\turlty\turl\n";
while (<IN>) {
	#match "A" lines by looking for numbers in field 0
	if($_=~ /^\d/) {
		my @fields = split("\t", $_); #parse as tab-delimited text
		if($fields[1] < $minfreq) { last;} #quit when you get to a rare phrase
		$clid = $fields[3]; #record the ClID
		$clid =~ s/\015?\012//; #manual chomp
		print "$clid ";
	}
	#match "C" lines, write out with clid
	if ($_ =~ m/^\t\t/) {
		chomp;
		my @fields = split("\t", $_);
		print OUT "$clid\t$fields[2]\t$fields[4]\t$fields[5]\n";
	}
}
close IN;
close OUT;
print "\ndone\n";

And here’s the Stata file:

clear
set mem 500m
set more off
cd ~/Documents/Sjt/memetracker/
*import key, or "A" records
insheet using key.txt, clear
ren v1 clsz
ren v2 totfq
ren v3 root
ren v4 clid
sort clid
lab var clsz "cluster size, n phrases"
lab var totfq "total frequency"
lab var root "phrase"
lab var clid "cluster id"
compress
save key, replace
*import data, or "C" records
insheet using data.txt, clear
drop if clid==.
gen double timestamp=clock(tm,"YMDhms")
format timestamp %tc
drop tm
gen hostname=regexs(1) if regexm(url, "http://([^/]+)") /*get the website, leaving out the filepath*/
drop url
gen blog=0
replace blog=1 if urlty=="B"
replace blog=1 if hostname=="blog.myspace.com"
gen technoratitop10=0 /*note, as of 2/3/2010, some mismatch with late 2008 memetracker data*/
foreach site in huffingtonpost.com engadget.com gizmodo.com mashable.com techcrunch.com boingboing.net gawker.com corner.nationalreview.com thedailybeast.com tmz.com {
	replace technoratitop10=1 if hostname=="`site'"
}
gen alexanews10=0 /*as w technorati, anachronistic*/
foreach site in news.yahoo.com bbc.co.uk cnn.com news.bbc.co.uk news.google.com nytimes.com msnbc.msn.com foxnews.com {
	replace alexanews10=1 if hostname=="`site'"
}
drop urlty
sort clid timestamp
contract _all /*eliminate redundant "C" records (from different "B" branches)*/
drop _freq
save data, replace
*draw a graph of each meme's occurrences
levelsof clid, local(clidvalues)
foreach clid in `clidvalues' {
	disp "`clid'"
	quietly use key, clear
	quietly keep if clid==`clid'
	local title=root in 1
	quietly use data, clear
	histogram timestamp if clid==`clid', frequency xlabel(#5, labsize(small) angle(forty_five)) title(`title', size(medsmall))
	graph export graphs/`clid'.png, replace
	twoway (histogram timestamp if clid==`clid') (line alexanews10 timestamp if clid==`clid', yaxis(2)), legend(off) xlabel(#5, labsize(small) angle(forty_five)) title(`title', size(medsmall))
	graph export graphs_alexa/`clid'.png, replace
}
*have a nice day

Entry filed under: Uncategorized. Tags: , , , .

Pete A literary style, darkly

7 Comments

  • 1. mike3550  |  February 8, 2010 at 10:27 pm

    One thing that I have thought about playing with recently is using relational databases to store very large files like this. It is not efficient if you are analyzing the whole file at the same time, but if you are interested in cluster-by-cluster analyses, then the hooks in Stata aren’t that bad and it has the advantage that you can use SQL code to efficiently run searches. I think that they are much quicker than anything Stata can do, especially with very large datasets with lots of variables.

  • […] actually already done this kind of thing twice, with my code for cleaning Memetracker and the IMDb business file. However those two datasets had the convenient property that the record […]

  • 3. John  |  July 28, 2012 at 1:40 pm

    Question about your time series graph. On the time axis, the spacing of your dates don’t seem to be systematically (at least from the point of view of a human) placed. Do you know how to control %tc formatted variable labels for time series graphs?

    • 4. Nick Cox  |  August 15, 2012 at 6:52 pm

      Looking at the time labels on the graph shows that they are “nice numbers” only in terms of the underlying clock time measured in milliseconds:

      di %18.0f clock(“25jun2008 08:00:00”, “DMY hms”)
      1530000000000

      di %18.0f clock(“22aug2008 4:53:20”, “DMY hms”)
      1535000000000

      Nicer dates can be calculated from the dates you want to show using the clock() function. First put what you want to show in a local macro:

      local dates `” “1 Aug 2008” “1 Sep 2008” “1 Oct 2008” “1 Nov 2008” “1 Dec 2008” “1 Jan 2009” “1 Feb 2009” “‘

      Then loop over those dates and calculate the clock time for each:

      foreach d of local dates {
      local xdates `xdates’ `=clock(“`d'”, “DMY”)’ `”`d'”‘
      }

      Now what you have is a list that you can insert in the -xlabel()- option as -xlabel(`xdates’)-.

      di `”`xdates'”‘

      1533168000000 `”1 Aug 2008″‘ 1535846400000 `”1 Sep 2008″‘ 1538438400000 `”1 Oct 2008″‘ 1541116800000 `”1 Nov 2008″‘ 1543708800000 `”1 Dec 2008″‘ 1546387200000 `”1 Jan 2009″‘ 1549065600000 `”1 Feb 2009″‘

      The principles are very similar to those in
      http://www.stata.com/support/faqs/graphics/date-labels/
      Despite what it says, that FAQ is not completely superseded.

      See also documentation of the -tlabel()- option.

      • 5. gabrielrossman  |  August 15, 2012 at 7:00 pm

        thanks nick

      • 6. Nick Cox  |  August 16, 2012 at 4:11 am

        I’ve now edited -mylabels- (SSC) so that this works

        mylabels “1 Aug 2008” “1 Sep 2008” “1 Oct 2008” “1 Nov 2008” “1 Dec 2008” “1 Jan 2009” “1 Feb 2009”, myscale(clock(“@”, “DMY”)) local(labels)

        So, the sequence would be

        1. You look at an initial graph, or the data, and decide what axis labels you want.

        2. -mylabels- has one and only role, to do the fiddly little calculations of exactly where to put the text that will be the labels and bundle positions and labels in a local macro. There has to be a way of calculating the positions from the labels. “Christmas 2008” wouldn’t work, for example, with -clock()- as above.

        3. The local macro is what you then use in the graph call.

        I don’t think that there is a way to do this with -tlabel()-, although I would be happy to be corrected. The issue arises with timestamped data spanning even a few weeks, let alone a few months or years: the timestamp scale is what the data come in, but is unfriendly for graphics.

        It may be a short while before the revised -mylabels- files show up on SSC.

  • 7. Nick Cox  |  August 16, 2012 at 3:09 am

    The x axis labels show times of day as well as dates, which I guess are not interesting.

    xla(, format(%tcD_m_CY))

    would trim off the times.

    This comment is independent of #4 above.


The Culture Geeks


%d bloggers like this: