Posts tagged ‘perl’

Getting long flat-files out of field-tagged data

| Gabriel |

Some field-tagged data can be pretty unproblematically reshaped into flat-files. However one of the reasons people like field-tagged data is that they can have internal long structures and this creates problems for reshaping. For instance in Web of Science, the “CR” field (works that are being cited by the main work) usually has dozens of references separated by carriage returns. To get this into most statistic packages it has to be reshaped into a long flat-file. In other words, you need a script to turn this:

LA English
DT Article
   ROSENHAN DL, 1973, SCIENCE, V179, P250
   SCHEFF TJ, 1964, SOC PROBL, V11, P401
NR 5
TC 5
PA 1722 N ST NW, WASHINGTON, DC 20036-2981
SN 0092-055X
JI Teach. Sociol.
PY 1975
VL 2
IS 2
BP 207
EP 211
PG 5
SC Education & Educational Research; Sociology
GA AD732
UT ISI:A1975AD73200007

Into this:

ISI:A1975AD73200007	TEACH SOCIOL	SCHEFF TJ, 1964, SOC PROBL, V11, P401

Note that the other fields usually lack carriage returns or other internal long delimiters so they can be cleaned like this. The two approaches can then be merged (in a flat-file like Stata) or linked (in a relational like R or Access) using the key.

I’ve actually already done this kind of thing twice, with my code for cleaning Memetracker and the IMDb business file. However those two datasets had the convenient property that the record key appears in the first row of the record. With data structures like these, you just remember the key then every time you come across a long entry, write it out along with the key.

Unfortunately, Web of Science has the record key appear towards the end of the record, a data structure that I propose to call “a huge pain in the ass.” This means that you have to collect all the long values in an array, then record the record key, then loop over the array to write out.

#!/usr/bin/perl by ghr
#this script converts field-tagged WOS queries to tab-delimited text
#it extracts the CR field and attaches to it the fields UT and J9
#unlike wos2tab, this file outputs long-formatted data
#the two types of output can be merged with the UT field

#sice CR comes /before/ UT and J9, must save as array, then loop over array at ER

use warnings; use strict;
die "usage: <wos data>\n" unless @ARGV==1;

my $rawdata = shift(@ARGV);

my $ut = "" ; #unique article identifier
my $j9 = "" ; #j9 coding of journal title
my $cr = "" ; #cited work
my @crlist = () ; #list of cited works
my $cr_continued = 0 ; #flag for recently hitting "^CR"

print "starting to read $rawdata\n";

open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.long") or die "error creating $rawdata.long\n";
print OUT "ut\tj9\tcr\n";
while (<IN>) {
	# if begins with non-whitespace character other than CR, flag cr_continued as not
	if($_ =~ m/^\S/) {
		if($_ =~ m/^[^CR]/) {$cr_continued = 0}
	if($_ =~ m/^J9/) {
		$j9 = $_;
		$j9 =~ s/\015?\012//; #manual chomp
		$j9 =~ s/^J9 //; #drop leading tag
	if($_ =~ m/^UT/) {
		$ut = $_;
		$ut =~ s/\015?\012//; #manual chomp
		$ut =~ s/^UT //; #drop leading tag
	#first line of a CR field 
	if($_ =~ m/^CR/) {
		$cr = $_;
		$cr =~ s/\015?\012//; #manual chomp
		$cr =~ s/^CR //; #drop leading tag
		$cr_continued = 1 ; #flag to allow for multi-line CR field
		push @crlist, $cr; #add the new cite to the list
	#subsequent lines of a CR field
	if($_ =~ m/^  /) {
		if($cr_continued==1) {
			$cr = $_ ;
			$cr =~ s/\015?\012//; #manual chomp
			$cr =~ s/^   //; #drop leading tag
			push @crlist, $cr; #add the new cite to the list
	#when "end record" code is reached, loop over array to write out as long file, then clear memory
	if($_=~ /^ER/) {
		#loop over CRLIST array, printing for each value so as to have long file
		for (my $i = 0; $i < @crlist; $i++) {
			print OUT "$ut\t$j9\t$crlist[$i]\n"; #write out
		#clear memory, just do once per "end record"
		$j9 = "" ;
		$ut = "" ;
		$cr = "" ;
		$cr_continued = 0 ;
		@crlist = () ;
close IN;
close OUT;
print "done writing $rawdata.long \n";

August 12, 2010 at 4:08 am 2 comments

Some ways Stata is an unusual language

| Gabriel |

As I’ve tried to learn other languages, I’ve realized that part of the difficulty isn’t that they’re hard (although in some cases they are) but that I’m used to Stata’s very distinctive paradigm and nomenclature. Some aspects of Stata are pretty standard (e.g., “while”/”foreach”/”forvalues” loops, log files, and the “file” syntax for using text files on disk), but other bits are pretty strange. Or rather, they’re strange from a computer science perspective but intuitive from a social science perspective.

Stata seems to have been designed to make sense to social scientists and if this makes it confusing to programmers, then so be it. A simple example of this is that Stata uses the word “variable” in the sense meant by social scientists. More broadly, Stata is pretty bold about defaults so as to make things easy for beginners. It presumes that anything you’re doing applies to the dataset (aka the master data)  — which is always a flat-file database. Other things that might be held in memory have a secondary status and beginning users don’t even know that they’re there. Likewise, commands distinguish between the important arguments (usually variables) and the secondary arguments, which Stata calls “options”. There’s also the very sensible assumptions about what to report and what to put in ephemeral data objects that can be accessed immediately after the primary command (but need not be stored as part of the original command, as they would in most other languages).

Note, I’m not complaining about any of this. Very few of Stata’s quirks are pointlessly arbitrary. (The only arbitrary deviation I can think of is using “*” instead of “#” for commenting). Most of Stata’s quirks are necessary in order to make it so user-friendly to social scientists. In a lot of ways R is a more conventional language than Stata, but most social scientists find Stata much easier to learn. In part because Stata is willing to deviate from the conventions of general purpose programming languages, running and interpreting a regression in Stata looks like this “reg y x” instead of this “summary(lm(y~x))” and loading a dataset looks like this “use mydata, clear” instead of this “data <- read.table(mydata.txt)”. Stata has some pretty complicated syntax (e.g., the entire Mata language) but you can get a lot done with just a handful of simple commands like “use,” “gen,” and “reg”.

Nonetheless all this means that when Stata native speakers like me learn a second programming language it can be a bit confusing. And FWIW, I worry that rumored improvements to Stata (such as allowing relational data in memory) will detract from its user-friendliness. Anyway, the point is that I love Stata and I think it’s entirely appropriate for social scientists to learn it first. I do most of my work in Stata and I teach/mentor my graduate students in Stata unless there’s a specific reason for them to learn something else. At the same time I know that many social scientists would benefit a lot from also learning other languages. For instance, people into social networks should learn R, people who want to do content analysis should learn Perl or Python, and people who want to do simulations should learn NetLogo or Java. The thing is that when you do, you’re in for a culture shock and so I’m making explicit some ways in which Stata is weird.

Do-files and Ado-files. In any other language a do-file would be called a script and an ado-file would be called a library. Also note that Stata very conveniently reads all your ado-files automatically, whereas most other languages require you to specifically load the relevant libraries into memory at the beginning of each script.

Commands, Programs, and Functions. In Stata a program is basically just a command that you wrote yourself. Stata is somewhat unusual in drawing a distinction between a command/program and a function. So in Stata a function usually means some kind of transformation that attaches its output to a variable or macro, as in “gen ln_income=log(income)”. In contrast a command/program is pretty much anything that doesn’t directly attach to an operator and includes all file operations (e.g., “use”) and estimations (e.g, “regress”). Other languages don’t really draw this distinction but consider everything a function, no matter what it does and whether the user wrote it or not. (Some languages use “primitive” to mean something like the Stata command vs. program distinction, but it’s not terribly important).

Because most languages only have functions this means that pretty much everything has to be assigned to an object via an operator. Hence Stata users would usually type “reg y x” whereas R users would usually type “myregression <- lm(y~x)”. This is because “regress” in Stata is a command whereas “lm()” in R is a function. Also note that Stata distinguishes between commands and everything else by word order syntax with the command being the first word. In contrast functions in other languages (just like Stata functions) have the function being the thing outside the parentheses and inside the parentheses goes all of the arguments, both data objects and options.

The Dataset. Stata is one of the only languages where it’s appropriate to use the definite article in reference to data. (NetLogo is arguably another case of this). In other languages it’s more appropriate to speak of “a data object” than “the dataset,” even if there only happens to be one data object in memory. For the same reason, most languages don’t “use” or “open” data, but “read” the data and assign it to an object. Another way to think about it is that only Stata has a “dataset” whereas other languages only have “matrices.” Of course, Stata/Mata also has matrices but most Stata end users don’t bother with them as they tend to be kind of a backend thing that’s usually handled by ado-files. Furthermore, in other languages (e.g., Perl) it’s common to not even load a file into memory but to process it line-by-line, which in Stata terms is kind of like a cross between the “file read/write” syntax and a “while” loop.

Variables. Stata uses the term “variable” in the statistical or social scientific meaning of the term. In other languages this would usually be called a field or vector.

Macros. What most other languages call variables, Stata calls local and global “macros.” Stata’s usage of the local vs global distinction is standard. In other languages the concept of “declaring” a variable is usually a little more explicit than it is in Stata.

Stata is extremely good about expanding macros in situ and this can spoil us Stata users. In other languages you often have to do some kind of crude work around by first using some kind of concatenate function to create a string object containing the expansion and then you use that string object. For instance, if you wanted to access a series of numbered files in Stata you could just loop over this:

use ~/project/file`i', clear 

In other languages you’d have to add a separate line for the expansion. So in R you’d loop over:

filename <- paste('~/project/file',i, sep="")
data <- read.table(filename)

[Update: Also see this Statalist post by Nick Cox on the distinction between variables and macros]

Reporting. Stata allows you to pass estimations on for further work (that’s what return macros, ereturn matrices, and postestimation commands are all about), but it assumes you probably won’t and so it is unusually generous in reporting most of the really interesting things after a command. In other languages you usually have to specifically ask to get this level of reporting. Another way to put it is that in Stata verbosity is assumed by default and can be suppressed with “quietly,” whereas in R silence is assumed by default and verbosity can be invoked by wrapping the estimation (or an object saving the estimation) in the “summary()” function.

August 6, 2010 at 4:36 am 12 comments

| Gabriel |

As previously remarked, IMDb files have a weird structure that ain’t exactly ready to rock. I already posted a file for dealing with business.list (which could also be modified to work with files like certificates.list). The personnel files (actors.list, actresses.list, directors.list, writers.list, etc) look like this:

Gilligan, Vince		2-Face (2013)  (screenplay)
			A.M.P.E.D. (2007) (TV)  (writer)
			Hancock (2008)  (written by)  <1,2,1>
			Home Fries (1998)  (written by)  <1,1,1>
			The X Files: Revelations (2008) (V)  (written by) (segment "Bad Blood")  <8,1,1>
			The X Files: Revelations (2008) (V)  (written by) (segment "Memento Mori")  <6,1,3>
			Wilder Napalm (1993)  (written by)  <1,1,1>
			"Breaking Bad" (2008)  (creator)
			"Breaking Bad" (2008) {(#3.12)}  (creator)  <1,1,1>
			"Breaking Bad" (2008) {(#3.13)}  (creator)  <1,1,1>

Whereas we’re used to data that looks like this:

Gilligan, Vince	2-Face (2013)  (screenplay)
Gilligan, Vince	A.M.P.E.D. (2007) (TV)  (writer)
Gilligan, Vince	Hancock (2008)  (written by)  <1,2,1>
Gilligan, Vince	Home Fries (1998)  (written by)  <1,1,1>
Gilligan, Vince	The X Files: Revelations (2008) (V)  (written by) (segment "Bad Blood")  <8,1,1>
Gilligan, Vince	The X Files: Revelations (2008) (V)  (written by) (segment "Memento Mori")  <6,1,3>
Gilligan, Vince	Wilder Napalm (1993)  (written by)  <1,1,1>
Gilligan, Vince	"Breaking Bad" (2008)  (creator)
Gilligan, Vince	"Breaking Bad" (2008) {(#3.12)}  (creator)  <1,1,1>

Of course that’s still not complete since ideally you want to parse the title of the work (eg “Breaking Bad” (2008) ) from details of the artist’s contribution to the work (eg (creator) ). Likewise, depending on what your analysis is about you might want to drop certain kinds of works entirely. (I usually drop the porn, television, and direct to video ASAP). However you can do all that from within Stata (assuming memory isn’t an issue, which it might be) and this script will suffice to get you that far:

#!/usr/bin/perl by ghr
#this script cleans IMDB personnel files (eg, writers.list)
#works best if you delete the header (about the first 300 lines)
#raw data is organized by artist with
# "ARTIST\t\tCREDIT" for the first credit (though sometimes w a single tab) and
# subsequent records are "\t\t\tCREDIT"
#this script makes all rows "ARTIST\tCREDIT" and drops blank rows
#the resulting file is about 20% larger than the original but has a simpler structure that is easier for other programs (eg Stata) to read
#further cleaning would parse the "CREDIT" field but the contents of "CREDIT" 
#vary by personnel file
#in all files "CREDIT" begins with "FILM TITLE (YEAR)" but has further info
# eg, writers.list distinguishes screenplay vs story, etc and actors.list gives character name, etc

use warnings; use strict;
die "usage: <IMDB personnel file>\n" unless @ARGV==1;
my $rawdata = shift(@ARGV);

# if $_ matches leading non-tab, redefine the "artist" variable
# if $_ matches 3 leading tabs, drop two tabs and add current "artist"
my $artist ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n";
print OUT "artist\tcredit\n";
while (<IN>) {
	#match beginning of artist's credits by looking for lines NOT beginning with a tab
	if($_=~ /^[^\t].+\t.+/) {
		$artist = $_; 
		$artist =~ s/\015?\012//; #manual chomp
		$artist =~ s/\t.+$//; #drop the tab(s) and all else after it 
		$_ =~ s/\t\t/\t/; #go from two tabs to one
		print OUT "$_";
	#match subsequent credits (three leading tabs)
	if ($_ =~ m/^\t\t\t/) {
		$_ =~ s/^\t\t\t//; #drop leading tabs
		print OUT "$artist\t$_";
	#when matching blank line, clear "artist"
	if ($_ =~ m/^$/) {
		$artist = "";
close IN;
close OUT;
print "\ndone\n";
#have a nice day

July 26, 2010 at 4:13 am 2 comments

| Gabriel |

One of my grad students is doing some citation network analysis, for which the Python script (and .exe wrapper) wos2pajek is very well-suited. (Since most network packages can read “.net” this is a good idea even if you’re not using Pajek).

However the student is also interested in node level attributes, not just the network. Unfortunately WOS queries are field-tagged which is kind of a pain to work with and the grad student horrified me by expressing the willingness to spend weeks reshaping the data by hand in Excel. (Even in grad school your time is a lot more valuable than that). To get the data into tab-delimited text, I modified an earlier script I wrote for parsing field-tagged IMDb files (in my case business.list but most of the film-level IMDb files are structured similarly). The basic approach is to read a file line-by-line and match its contents by field-tag, saving the contents in a variable named after the tag. Then when you get to the new record delimiter (in this case, a blank line), dump the contents to disk and wipe the variables. Note that since the “CR” (cited reference) field has internal carriage returns it would require a little doing to integrate into this script, which is one of the reasons you’re better off relying on wos2pajek for that functionality.

#!/usr/bin/perl by ghr
#this script converts field-tagged Web Of Science queries to tab-delimited text
#for creating a network from the "CR" field, see wos2pajek
#note, you can use the info extracted by this script to replicate a wos2pajek key and thus merge

use warnings; use strict;
die "usage: <wos data>\n" unless @ARGV==1;

my $rawdata = shift(@ARGV);

my $au ; #author
my $ti ; #title
my $py ; #year
my $j9 ; #j9 coding of journal title
my $dt ; #document type

# to extract another field, work it in along the lines of the existing vars
# each var must be
# 1. declared with a "my statement" (eg, lines 12-16)
# 2. added to the header with the "print OUT" statement (ie, line 29)
# 3. written into a search and store loop following an "if" statement (eg, lines 37-41)
# 4. inside the blank line match loop (ie, lines 59-66)
#  4a. add to the print statement (ie, line 60)
#  4b. add a clear statement (eg, lines 61-65)

open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n";
print OUT "au\tdt\tpy\tti\tj9\n";
while (<IN>) {
	if($_ =~ m/^AU/) {
		$au = $_;
		$au =~ s/\015?\012//; #manual chomp
		$au =~ s/^AU //; #drop leading tag
		$au =~ s/,//; #drop comma -- author only
	if($_ =~ m/^DT/) {
		$dt = $_;
		$dt =~ s/\015?\012//; #manual chomp
		$dt =~ s/^DT //; #drop leading tag
	if($_ =~ m/^TI/) {
		$ti = $_;
		$ti =~ s/\015?\012//; #manual chomp
		$ti =~ s/^TI //; #drop leading tag
	if($_ =~ m/^J9/) {
		$j9 = $_;
		$j9 =~ s/\015?\012//; #manual chomp
		$j9 =~ s/^J9 //; #drop leading tag
	if($_ =~ m/^PY/) {
		$py = $_;
		$py =~ s/\015?\012//; #manual chomp
		$py =~ s/^PY //; #drop leading tag
	#when blank line is reached, write out and clear memory 
	if($_=~ /^$/) {
		print OUT "$au\t$dt\t$py\t$ti\t$j9\n";
		$au = "" ;
		$dt = "" ;
		$ti = "" ;
		$py = "" ;
		$j9 = "" ;
close IN;
close OUT;
print "\ndone\n";

July 19, 2010 at 2:13 pm 6 comments

| Gabriel |

A few months ago I talked about reshaping field-tagged data and gave some clumsy advice for doing so. I’ve now written a perl script that does this more elegantly. It’s written to extract movie title (“MV”) and domestic box office (“GR”) from the IMDB file business.list, but you could adapt it to get other variables and/or work on other field-tagged data.
Basically, the script will turn this:

MV: Little Shop of Horrors (1986)

AD: 118,418 (Sweden) 

BT: USD 30,000,000 

GR: USD 34,656,704 (USA) (8 February 1987) 
GR: USD 33,126,503 (USA) (1 February 1987) 
GR: USD 30,810,276 (USA) (25 January 1987) 
GR: USD 27,781,027 (USA) (18 January 1987) 
GR: USD 23,727,232 (USA) (11 January 1987) 
GR: USD 19,546,049 (USA) (4 January 1987) 
GR: USD 11,412,248 (USA) (28 December 1986) 
GR: USD 3,659,884 (USA) (21 December 1986) 
GR: USD 38,747,385 (USA) 
GR: SEK 4,318,255 (Sweden) 

OW: USD 3,659,884 (USA) (21 December 1986) (866 screens) 

RT: USD 19,300,000 (USA) 

SD: 21 October 1985 - ? 

WG: USD 1,112,016 (USA) (8 February 1987) (871 screens) 
WG: USD 1,719,329 (USA) (1 February 1987) 
WG: USD 2,093,847 (USA) (25 January 1987) 
WG: USD 3,222,066 (USA) (18 January 1987) 
WG: USD 3,057,666 (USA) (11 January 1987) (858 screens) 
WG: USD 4,004,838 (USA) (4 January 1987) (866 screens) 
WG: USD 5,042,682 (USA) (28 December 1986) (866 screens) 
WG: USD 3,659,884 (USA) (21 December 1986) (866 screens) 


Into this:

Little Shop of Horrors (1986)	34,656,704 (USA) (8 February 1987) 
Little Shop of Horrors (1986)	33,126,503 (USA) (1 February 1987) 
Little Shop of Horrors (1986)	30,810,276 (USA) (25 January 1987) 
Little Shop of Horrors (1986)	27,781,027 (USA) (18 January 1987) 
Little Shop of Horrors (1986)	23,727,232 (USA) (11 January 1987) 
Little Shop of Horrors (1986)	19,546,049 (USA) (4 January 1987) 
Little Shop of Horrors (1986)	11,412,248 (USA) (28 December 1986) 
Little Shop of Horrors (1986)	3,659,884 (USA) (21 December 1986) 
Little Shop of Horrors (1986)	38,747,385 (USA) 

Here’s the code:

#!/usr/bin/perl by ghr
#this script cleans the IMDB file business.list
#raw data is field-tagged, key tags are "MV" (movie title) and "GR" (gross)
#record can have multiple "gross" fields, only interested in those with "(USA)"
#MV: Astronaut's Wife, The (1999)
#GR: USD 10,654,581 (USA) (7 November 1999) 
#find "MV" tag, keep in memory, go to "GR" tag and write out as "GR\tMV"

use warnings; use strict;
die "usage: <IMDB business file>\n" unless @ARGV==1;
my $rawdata = shift(@ARGV);

# if line=MV, redefine the "title" variable
# if line=GR, write out with "title" in front
#optional, screen out non "USA" gross, parse GR into 
#"currency, quantity, country, date"
my $title ;
my $gross ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">gross.txt") or die "error creating gross.txt\n";
print OUT "title\tgross\n";
while (<IN>) {
	#match "MV" lines by looking for lines beginning "MV: "
	if($_=~ /^MV: /) {
		$title = $_; 
		$title =~ s/\015?\012//; #manual chomp
		$title =~ s/^MV: //; #drop leading tag
		print "$title ";
	#match "GR" lines, write out with clid
	if ($_ =~ m/^GR: USD .+\(USA\)/) {
		$gross = $_; 
		$gross =~ s/\015?\012//; #manual chomp
		$gross =~ s/^GR: USD //; #drop leading tag
		print OUT "$title\t$gross\n";
close IN;
close OUT;
print "\ndone\n";

March 31, 2010 at 3:40 pm 4 comments

Keep the best 5

| Gabriel |

In my undergrad lecture class I give pop quizzes, which are intended to measure both attendance and reading comprehension. Since I think an absence or two is understandable, I told the students I’d only count their best five quizzes out of seven offered. Of course this meant I had to figure out a way to calculate this, which is a lot harder to do than a simple sum. After some failed experimentation with Stata, I found this was easy to do in perl. This is because perl likes to process data row by row and its “sort()” function sorts elements in an array (rather than rows in a spreadsheet, like the “sort” commands in Stata or Excel). To view it from a Stata-centric or Excel-centric perspective, perl finds it quite natural to sort columns/variables within a row/record.

[Update: also see the comments for some solutions using Excel or R.]

Here’s the perl script

# Created by Gabriel Rossman, 2010-03-10
# this file cleans the grades by sorting the quizzes. this makes it easy to find the best 5

use strict; use warnings;
die "usage: <in.txt> <out.txt>\n" unless @ARGV == 2;

#read-write boilerplate
my $infile = shift (@ARGV);
my $outfile = shift (@ARGV);
open(IN, "<$infile") or die "error reading $infile";
open(OUT, ">$outfile") or die "error creating $outfile";
#loop to read, process, and write line by line
while (<IN>) {
	chomp; #drop \n
	my @fields = split("\t", $_); #parse STDIN into fields
	my $uid = shift (@fields); #lop off the first field, call it "uid" or University ID, which is my key variable
	my @sortedlist = sort {$b <=> $a} @fields; #descending sort other fields
	print OUT "$uid @sortedlist\n"; #write to disk as space-delimited text
close IN;
close OUT;

To execute it from the command line you’d type
perl dirtygrades.txt cleangrades.txt

You could easily use this script with, say, Excel, but I wrap it in a Stata script.

cd "~/Documents/w10_m176/exams/myuclagradebook/"
insheet using grades.csv, comma clear

drop if uid==. & midterm==.

sort uid
save grades.dta, replace
keep uid q1-q7
outsheet using grades.txt, nonames noquote replace

*send to perl to sort quizzes by score (by row)
shell perl grades.txt gradesclean.txt

insheet using gradesclean.txt, clear delim(" ")
ren v1 uid
ren v2 qg1
ren v3 qg2
ren v4 qg3
ren v5 qg4
ren v6 qg5
ren v7 qg6
ren v8 qg7
sort uid
merge 1:1 uid using grades
drop _merge q1-q7
gen q_top5=(qg1+qg2+qg3+qg4+qg5)/5

*have a nice day

March 12, 2010 at 5:05 am 11 comments

Memetracker into Stata

| Gabriel |

A few months ago I mentioned the Memetracker project to scrape the internet and look for the diffusion of (various variants of) catchphrases. I wanted to play with the dataset but there were a few tricks. First, the dataset is really, really, big. The summary file is 862 megabytes when stored as text and would no doubt be bigger in Stata (because of how Stata allocates memory to string variables). Second, the data is in a moderately complicated hierarchical format, with “C” specific occurrences, nested within “B” phrase variants, which are in turn nested within “A” phrase families. You can immediately identify whether a row is A, B, or C by the numer of leading tabs (0, 1, and 2, respectively).

I figured that the best way to interpret this data in Stata would be two create two flat-files, one a record of all the “A” records that I call “key”, and the other a simplified version of all the “C” records but with the key variable to allow merging with the “A” records. Rather than do this all in Stata, I figured it would be good to pre-process it in perl, which reads text one line at a time and thus is well-suited for handling very large files. The easy part was to make a first pass through the file with grep to create the “key” file by copying all the “A” rows (i.e., those with no leading tabs).

Slightly harder was to cull the “C” rows. If I just wanted the “C” rows this would be easy, but I wanted to associate them with the cluster key variable from the “A” rows. This required looking for “A” rows, copying the key, and keeping it in memory until the next “A” row. Meanwhile, every time I hit a “C” row, I copy it but add in the key variable from the most recent “A” row. Both for debugging and because I get nervous when a program doesn’t give any output for several minutes, I have it print to screen every new “A” key. Finally, to keep the file size down, I set a floor to eliminate reasonably rare phrase clusters (anything with less than 500 occurrences total).

At that point I had two text files, “key” which associates the phrase cluster serial number with the actual phrase string and “data” which records occurrences of the phrases. The reason I didn’t merge them is that it would massively bloat the file size and it’s not necessary for analytic purposes. Anyway, at this point I could easily get both the key and data files into Stata and do whatever I want with them. As a first pass, I graphed the time-series for each catchphrase, with and without special attention drawn to mentions occurring in the top 10 news websites.

Here’s a sample graph.

Here’s the perl file:

#!/usr/bin/perl by ghr
#this script cleans the "phrase cluster" data
#script takes the (local and unzipped) location of this file as an argument
#throws out much of the data, saves as two tab flatfiles
#"key.txt" which associates cluster IDs with phrases
#"data.txt" which contains individual observations of the phrases
# input
# A:  <ClSz>  <TotFq>  <Root>  <ClId>
# B:          <QtFq>   <Urls>  <QtStr>  <QtId>
# C:                   <Tm>    <Fq>     <UrlTy>  <Url>
# output, key file
# A:  <ClSz>  <TotFq>  <Root>  <ClId>
# output, data file
# C:<ClID>	<Tm>	<UrlTy>	<URL>
# make two passes.

use warnings; use strict;
die "usage: <phrase cluster data>\n" unless @ARGV==1;

#define minimum number of occurences a phrase must have
my $minfreq = 500;

my $rawdata = shift(@ARGV);
# use bash grep to write out the "key file"
system("grep '^[0-9]' $rawdata > key.txt");

# read again, and write out the "data file"
# if line=A, redefine the "clid" variable
# optional, if second field of "A" is too small, (eg, below 100), break the loop?
# if line=B, skip
# if line=C, write out with "clid" in front
my $clid  ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">data.txt") or die "error creating data.txt\n";
print OUT "clid\ttm\turlty\turl\n";
while (<IN>) {
	#match "A" lines by looking for numbers in field 0
	if($_=~ /^\d/) {
		my @fields = split("\t", $_); #parse as tab-delimited text
		if($fields[1] < $minfreq) { last;} #quit when you get to a rare phrase
		$clid = $fields[3]; #record the ClID
		$clid =~ s/\015?\012//; #manual chomp
		print "$clid ";
	#match "C" lines, write out with clid
	if ($_ =~ m/^\t\t/) {
		my @fields = split("\t", $_);
		print OUT "$clid\t$fields[2]\t$fields[4]\t$fields[5]\n";
close IN;
close OUT;
print "\ndone\n";

And here’s the Stata file:

set mem 500m
set more off
cd ~/Documents/Sjt/memetracker/
*import key, or "A" records
insheet using key.txt, clear
ren v1 clsz
ren v2 totfq
ren v3 root
ren v4 clid
sort clid
lab var clsz "cluster size, n phrases"
lab var totfq "total frequency"
lab var root "phrase"
lab var clid "cluster id"
save key, replace
*import data, or "C" records
insheet using data.txt, clear
drop if clid==.
gen double timestamp=clock(tm,"YMDhms")
format timestamp %tc
drop tm
gen hostname=regexs(1) if regexm(url, "http://([^/]+)") /*get the website, leaving out the filepath*/
drop url
gen blog=0
replace blog=1 if urlty=="B"
replace blog=1 if hostname==""
gen technoratitop10=0 /*note, as of 2/3/2010, some mismatch with late 2008 memetracker data*/
foreach site in {
	replace technoratitop10=1 if hostname=="`site'"
gen alexanews10=0 /*as w technorati, anachronistic*/
foreach site in {
	replace alexanews10=1 if hostname=="`site'"
drop urlty
sort clid timestamp
contract _all /*eliminate redundant "C" records (from different "B" branches)*/
drop _freq
save data, replace
*draw a graph of each meme's occurrences
levelsof clid, local(clidvalues)
foreach clid in `clidvalues' {
	disp "`clid'"
	quietly use key, clear
	quietly keep if clid==`clid'
	local title=root in 1
	quietly use data, clear
	histogram timestamp if clid==`clid', frequency xlabel(#5, labsize(small) angle(forty_five)) title(`title', size(medsmall))
	graph export graphs/`clid'.png, replace
	twoway (histogram timestamp if clid==`clid') (line alexanews10 timestamp if clid==`clid', yaxis(2)), legend(off) xlabel(#5, labsize(small) angle(forty_five)) title(`title', size(medsmall))
	graph export graphs_alexa/`clid'.png, replace
*have a nice day

February 8, 2010 at 4:31 am 7 comments

Older Posts

The Culture Geeks