Posts tagged ‘perl’
Getting long flat-files out of field-tagged data
| Gabriel |
Some field-tagged data can be pretty unproblematically reshaped into flat-files. However one of the reasons people like field-tagged data is that they can have internal long structures and this creates problems for reshaping. For instance in Web of Science, the “CR” field (works that are being cited by the main work) usually has dozens of references separated by carriage returns. To get this into most statistic packages it has to be reshaped into a long flat-file. In other words, you need a script to turn this:
PT J AU LEVINSON, RM TI TEACHING LABELING THEORY - 4 EXPERIENCES IN ILLNESS ATTRIBUTION SO TEACHING SOCIOLOGY LA English DT Article C1 EMORY UNIV,ATLANTA,GA 30322. CR BECKER HS, 1963, OUTSIDERS MENDEL WM, 1969, ARCH GEN PSYCHIAT, V20, P321 ROSENHAN DL, 1973, SCIENCE, V179, P250 SCHEFF TJ, 1964, SOC PROBL, V11, P401 SCHEFF TJ, 1966, BEING MENTALLY ILL NR 5 TC 5 PU AMER SOCIOLOGICAL ASSOC PI WASHINGTON PA 1722 N ST NW, WASHINGTON, DC 20036-2981 SN 0092-055X J9 TEACH SOCIOL JI Teach. Sociol. PY 1975 VL 2 IS 2 BP 207 EP 211 PG 5 SC Education & Educational Research; Sociology GA AD732 UT ISI:A1975AD73200007 ER
Into this:
ISI:A1975AD73200007 TEACH SOCIOL BECKER HS, 1963, OUTSIDERS ISI:A1975AD73200007 TEACH SOCIOL MENDEL WM, 1969, ARCH GEN PSYCHIAT, V20, P321 ISI:A1975AD73200007 TEACH SOCIOL ROSENHAN DL, 1973, SCIENCE, V179, P250 ISI:A1975AD73200007 TEACH SOCIOL SCHEFF TJ, 1964, SOC PROBL, V11, P401 ISI:A1975AD73200007 TEACH SOCIOL SCHEFF TJ, 1966, BEING MENTALLY ILL
Note that the other fields usually lack carriage returns or other internal long delimiters so they can be cleaned like this. The two approaches can then be merged (in a flat-file like Stata) or linked (in a relational like R or Access) using the key.
I’ve actually already done this kind of thing twice, with my code for cleaning Memetracker and the IMDb business file. However those two datasets had the convenient property that the record key appears in the first row of the record. With data structures like these, you just remember the key then every time you come across a long entry, write it out along with the key.
Unfortunately, Web of Science has the record key appear towards the end of the record, a data structure that I propose to call “a huge pain in the ass.” This means that you have to collect all the long values in an array, then record the record key, then loop over the array to write out.
#!/usr/bin/perl #wos2tab_cr.pl by ghr #this script converts field-tagged WOS queries to tab-delimited text #it extracts the CR field and attaches to it the fields UT and J9 #unlike wos2tab, this file outputs long-formatted data #the two types of output can be merged with the UT field #sice CR comes /before/ UT and J9, must save as array, then loop over array at ER use warnings; use strict; die "usage: wos2tab_cr.pl <wos data>\n" unless @ARGV==1; my $rawdata = shift(@ARGV); my $ut = "" ; #unique article identifier my $j9 = "" ; #j9 coding of journal title my $cr = "" ; #cited work my @crlist = () ; #list of cited works my $cr_continued = 0 ; #flag for recently hitting "^CR" print "starting to read $rawdata\n"; open(IN, "<$rawdata") or die "error opening $rawdata for reading\n"; open(OUT, ">$rawdata.long") or die "error creating $rawdata.long\n"; print OUT "ut\tj9\tcr\n"; while (<IN>) { # if begins with non-whitespace character other than CR, flag cr_continued as not if($_ =~ m/^\S/) { if($_ =~ m/^[^CR]/) {$cr_continued = 0} } if($_ =~ m/^J9/) { $j9 = $_; $j9 =~ s/\015?\012//; #manual chomp $j9 =~ s/^J9 //; #drop leading tag } if($_ =~ m/^UT/) { $ut = $_; $ut =~ s/\015?\012//; #manual chomp $ut =~ s/^UT //; #drop leading tag } #first line of a CR field if($_ =~ m/^CR/) { $cr = $_; $cr =~ s/\015?\012//; #manual chomp $cr =~ s/^CR //; #drop leading tag $cr_continued = 1 ; #flag to allow for multi-line CR field push @crlist, $cr; #add the new cite to the list } #subsequent lines of a CR field if($_ =~ m/^ /) { if($cr_continued==1) { $cr = $_ ; $cr =~ s/\015?\012//; #manual chomp $cr =~ s/^ //; #drop leading tag push @crlist, $cr; #add the new cite to the list } } #when "end record" code is reached, loop over array to write out as long file, then clear memory if($_=~ /^ER/) { #loop over CRLIST array, printing for each value so as to have long file for (my $i = 0; $i < @crlist; $i++) { print OUT "$ut\t$j9\t$crlist[$i]\n"; #write out } #clear memory, just do once per "end record" $j9 = "" ; $ut = "" ; $cr = "" ; $cr_continued = 0 ; @crlist = () ; } } close IN; close OUT; print "done writing $rawdata.long \n";
Some ways Stata is an unusual language
| Gabriel |
As I’ve tried to learn other languages, I’ve realized that part of the difficulty isn’t that they’re hard (although in some cases they are) but that I’m used to Stata’s very distinctive paradigm and nomenclature. Some aspects of Stata are pretty standard (e.g., “while”/”foreach”/”forvalues” loops, log files, and the “file” syntax for using text files on disk), but other bits are pretty strange. Or rather, they’re strange from a computer science perspective but intuitive from a social science perspective.
Stata seems to have been designed to make sense to social scientists and if this makes it confusing to programmers, then so be it. A simple example of this is that Stata uses the word “variable” in the sense meant by social scientists. More broadly, Stata is pretty bold about defaults so as to make things easy for beginners. It presumes that anything you’re doing applies to the dataset (aka the master data) — which is always a flat-file database. Other things that might be held in memory have a secondary status and beginning users don’t even know that they’re there. Likewise, commands distinguish between the important arguments (usually variables) and the secondary arguments, which Stata calls “options”. There’s also the very sensible assumptions about what to report and what to put in ephemeral data objects that can be accessed immediately after the primary command (but need not be stored as part of the original command, as they would in most other languages).
Note, I’m not complaining about any of this. Very few of Stata’s quirks are pointlessly arbitrary. (The only arbitrary deviation I can think of is using “*” instead of “#” for commenting). Most of Stata’s quirks are necessary in order to make it so user-friendly to social scientists. In a lot of ways R is a more conventional language than Stata, but most social scientists find Stata much easier to learn. In part because Stata is willing to deviate from the conventions of general purpose programming languages, running and interpreting a regression in Stata looks like this “reg y x” instead of this “summary(lm(y~x))” and loading a dataset looks like this “use mydata, clear” instead of this “data <- read.table(mydata.txt)”. Stata has some pretty complicated syntax (e.g., the entire Mata language) but you can get a lot done with just a handful of simple commands like “use,” “gen,” and “reg”.
Nonetheless all this means that when Stata native speakers like me learn a second programming language it can be a bit confusing. And FWIW, I worry that rumored improvements to Stata (such as allowing relational data in memory) will detract from its user-friendliness. Anyway, the point is that I love Stata and I think it’s entirely appropriate for social scientists to learn it first. I do most of my work in Stata and I teach/mentor my graduate students in Stata unless there’s a specific reason for them to learn something else. At the same time I know that many social scientists would benefit a lot from also learning other languages. For instance, people into social networks should learn R, people who want to do content analysis should learn Perl or Python, and people who want to do simulations should learn NetLogo or Java. The thing is that when you do, you’re in for a culture shock and so I’m making explicit some ways in which Stata is weird.
Do-files and Ado-files. In any other language a do-file would be called a script and an ado-file would be called a library. Also note that Stata very conveniently reads all your ado-files automatically, whereas most other languages require you to specifically load the relevant libraries into memory at the beginning of each script.
Commands, Programs, and Functions. In Stata a program is basically just a command that you wrote yourself. Stata is somewhat unusual in drawing a distinction between a command/program and a function. So in Stata a function usually means some kind of transformation that attaches its output to a variable or macro, as in “gen ln_income=log(income)”. In contrast a command/program is pretty much anything that doesn’t directly attach to an operator and includes all file operations (e.g., “use”) and estimations (e.g, “regress”). Other languages don’t really draw this distinction but consider everything a function, no matter what it does and whether the user wrote it or not. (Some languages use “primitive” to mean something like the Stata command vs. program distinction, but it’s not terribly important).
Because most languages only have functions this means that pretty much everything has to be assigned to an object via an operator. Hence Stata users would usually type “reg y x” whereas R users would usually type “myregression <- lm(y~x)”. This is because “regress” in Stata is a command whereas “lm()” in R is a function. Also note that Stata distinguishes between commands and everything else by word order syntax with the command being the first word. In contrast functions in other languages (just like Stata functions) have the function being the thing outside the parentheses and inside the parentheses goes all of the arguments, both data objects and options.
The Dataset. Stata is one of the only languages where it’s appropriate to use the definite article in reference to data. (NetLogo is arguably another case of this). In other languages it’s more appropriate to speak of “a data object” than “the dataset,” even if there only happens to be one data object in memory. For the same reason, most languages don’t “use” or “open” data, but “read” the data and assign it to an object. Another way to think about it is that only Stata has a “dataset” whereas other languages only have “matrices.” Of course, Stata/Mata also has matrices but most Stata end users don’t bother with them as they tend to be kind of a backend thing that’s usually handled by ado-files. Furthermore, in other languages (e.g., Perl) it’s common to not even load a file into memory but to process it line-by-line, which in Stata terms is kind of like a cross between the “file read/write” syntax and a “while” loop.
Variables. Stata uses the term “variable” in the statistical or social scientific meaning of the term. In other languages this would usually be called a field or vector.
Macros. What most other languages call variables, Stata calls local and global “macros.” Stata’s usage of the local vs global distinction is standard. In other languages the concept of “declaring” a variable is usually a little more explicit than it is in Stata.
Stata is extremely good about expanding macros in situ and this can spoil us Stata users. In other languages you often have to do some kind of crude work around by first using some kind of concatenate function to create a string object containing the expansion and then you use that string object. For instance, if you wanted to access a series of numbered files in Stata you could just loop over this:
use ~/project/file`i', clear
In other languages you’d have to add a separate line for the expansion. So in R you’d loop over:
filename <- paste('~/project/file',i, sep="") data <- read.table(filename)
[Update: Also see this Statalist post by Nick Cox on the distinction between variables and macros]
Reporting. Stata allows you to pass estimations on for further work (that’s what return macros, ereturn matrices, and postestimation commands are all about), but it assumes you probably won’t and so it is unusually generous in reporting most of the really interesting things after a command. In other languages you usually have to specifically ask to get this level of reporting. Another way to put it is that in Stata verbosity is assumed by default and can be suppressed with “quietly,” whereas in R silence is assumed by default and verbosity can be invoked by wrapping the estimation (or an object saving the estimation) in the “summary()” function.
imdb_personnel.pl
| Gabriel |
As previously remarked, IMDb files have a weird structure that ain’t exactly ready to rock. I already posted a file for dealing with business.list (which could also be modified to work with files like certificates.list). The personnel files (actors.list, actresses.list, directors.list, writers.list, etc) look like this:
Gilligan, Vince 2-Face (2013) (screenplay) A.M.P.E.D. (2007) (TV) (writer) Hancock (2008) (written by) <1,2,1> Home Fries (1998) (written by) <1,1,1> The X Files: Revelations (2008) (V) (written by) (segment "Bad Blood") <8,1,1> The X Files: Revelations (2008) (V) (written by) (segment "Memento Mori") <6,1,3> Wilder Napalm (1993) (written by) <1,1,1> "Breaking Bad" (2008) (creator) "Breaking Bad" (2008) {(#3.12)} (creator) <1,1,1> "Breaking Bad" (2008) {(#3.13)} (creator) <1,1,1>
Whereas we’re used to data that looks like this:
Gilligan, Vince 2-Face (2013) (screenplay) Gilligan, Vince A.M.P.E.D. (2007) (TV) (writer) Gilligan, Vince Hancock (2008) (written by) <1,2,1> Gilligan, Vince Home Fries (1998) (written by) <1,1,1> Gilligan, Vince The X Files: Revelations (2008) (V) (written by) (segment "Bad Blood") <8,1,1> Gilligan, Vince The X Files: Revelations (2008) (V) (written by) (segment "Memento Mori") <6,1,3> Gilligan, Vince Wilder Napalm (1993) (written by) <1,1,1> Gilligan, Vince "Breaking Bad" (2008) (creator) Gilligan, Vince "Breaking Bad" (2008) {(#3.12)} (creator) <1,1,1>
Of course that’s still not complete since ideally you want to parse the title of the work (eg “Breaking Bad” (2008) ) from details of the artist’s contribution to the work (eg (creator) ). Likewise, depending on what your analysis is about you might want to drop certain kinds of works entirely. (I usually drop the porn, television, and direct to video ASAP). However you can do all that from within Stata (assuming memory isn’t an issue, which it might be) and this script will suffice to get you that far:
#!/usr/bin/perl #imdb_personnel.pl by ghr #this script cleans IMDB personnel files (eg, writers.list) #works best if you delete the header (about the first 300 lines) #raw data is organized by artist with # "ARTIST\t\tCREDIT" for the first credit (though sometimes w a single tab) and # subsequent records are "\t\t\tCREDIT" #this script makes all rows "ARTIST\tCREDIT" and drops blank rows #the resulting file is about 20% larger than the original but has a simpler structure that is easier for other programs (eg Stata) to read #further cleaning would parse the "CREDIT" field but the contents of "CREDIT" #vary by personnel file #in all files "CREDIT" begins with "FILM TITLE (YEAR)" but has further info # eg, writers.list distinguishes screenplay vs story, etc and actors.list gives character name, etc use warnings; use strict; die "usage: imdb_personnel.pl <IMDB personnel file>\n" unless @ARGV==1; my $rawdata = shift(@ARGV); # if $_ matches leading non-tab, redefine the "artist" variable # if $_ matches 3 leading tabs, drop two tabs and add current "artist" my $artist ; open(IN, "<$rawdata") or die "error opening $rawdata for reading\n"; open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n"; print OUT "artist\tcredit\n"; while (<IN>) { #match beginning of artist's credits by looking for lines NOT beginning with a tab if($_=~ /^[^\t].+\t.+/) { $artist = $_; $artist =~ s/\015?\012//; #manual chomp $artist =~ s/\t.+$//; #drop the tab(s) and all else after it $_ =~ s/\t\t/\t/; #go from two tabs to one print OUT "$_"; } #match subsequent credits (three leading tabs) if ($_ =~ m/^\t\t\t/) { $_ =~ s/^\t\t\t//; #drop leading tabs print OUT "$artist\t$_"; } #when matching blank line, clear "artist" if ($_ =~ m/^$/) { $artist = ""; } } close IN; close OUT; print "\ndone\n"; #have a nice day
wos2tab.pl
| Gabriel |
One of my grad students is doing some citation network analysis, for which the Python script (and .exe wrapper) wos2pajek is very well-suited. (Since most network packages can read “.net” this is a good idea even if you’re not using Pajek).
However the student is also interested in node level attributes, not just the network. Unfortunately WOS queries are field-tagged which is kind of a pain to work with and the grad student horrified me by expressing the willingness to spend weeks reshaping the data by hand in Excel. (Even in grad school your time is a lot more valuable than that). To get the data into tab-delimited text, I modified an earlier script I wrote for parsing field-tagged IMDb files (in my case business.list but most of the film-level IMDb files are structured similarly). The basic approach is to read a file line-by-line and match its contents by field-tag, saving the contents in a variable named after the tag. Then when you get to the new record delimiter (in this case, a blank line), dump the contents to disk and wipe the variables. Note that since the “CR” (cited reference) field has internal carriage returns it would require a little doing to integrate into this script, which is one of the reasons you’re better off relying on wos2pajek for that functionality.
#!/usr/bin/perl #wos2tab.pl by ghr #this script converts field-tagged Web Of Science queries to tab-delimited text #for creating a network from the "CR" field, see wos2pajek #note, you can use the info extracted by this script to replicate a wos2pajek key and thus merge use warnings; use strict; die "usage: wos2tab.pl <wos data>\n" unless @ARGV==1; my $rawdata = shift(@ARGV); my $au ; #author my $ti ; #title my $py ; #year my $j9 ; #j9 coding of journal title my $dt ; #document type # to extract another field, work it in along the lines of the existing vars # each var must be # 1. declared with a "my statement" (eg, lines 12-16) # 2. added to the header with the "print OUT" statement (ie, line 29) # 3. written into a search and store loop following an "if" statement (eg, lines 37-41) # 4. inside the blank line match loop (ie, lines 59-66) # 4a. add to the print statement (ie, line 60) # 4b. add a clear statement (eg, lines 61-65) open(IN, "<$rawdata") or die "error opening $rawdata for reading\n"; open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n"; print OUT "au\tdt\tpy\tti\tj9\n"; while (<IN>) { if($_ =~ m/^AU/) { $au = $_; $au =~ s/\015?\012//; #manual chomp $au =~ s/^AU //; #drop leading tag $au =~ s/,//; #drop comma -- author only } if($_ =~ m/^DT/) { $dt = $_; $dt =~ s/\015?\012//; #manual chomp $dt =~ s/^DT //; #drop leading tag } if($_ =~ m/^TI/) { $ti = $_; $ti =~ s/\015?\012//; #manual chomp $ti =~ s/^TI //; #drop leading tag } if($_ =~ m/^J9/) { $j9 = $_; $j9 =~ s/\015?\012//; #manual chomp $j9 =~ s/^J9 //; #drop leading tag } if($_ =~ m/^PY/) { $py = $_; $py =~ s/\015?\012//; #manual chomp $py =~ s/^PY //; #drop leading tag } #when blank line is reached, write out and clear memory if($_=~ /^$/) { print OUT "$au\t$dt\t$py\t$ti\t$j9\n"; $au = "" ; $dt = "" ; $ti = "" ; $py = "" ; $j9 = "" ; } } close IN; close OUT; print "\ndone\n";
Gross.pl
| Gabriel |
A few months ago I talked about reshaping field-tagged data and gave some clumsy advice for doing so. I’ve now written a perl script that does this more elegantly. It’s written to extract movie title (“MV”) and domestic box office (“GR”) from the IMDB file business.list, but you could adapt it to get other variables and/or work on other field-tagged data.
Basically, the script will turn this:
------------------------------------------------------------------------------- MV: Little Shop of Horrors (1986) AD: 118,418 (Sweden) BT: USD 30,000,000 GR: USD 34,656,704 (USA) (8 February 1987) GR: USD 33,126,503 (USA) (1 February 1987) GR: USD 30,810,276 (USA) (25 January 1987) GR: USD 27,781,027 (USA) (18 January 1987) GR: USD 23,727,232 (USA) (11 January 1987) GR: USD 19,546,049 (USA) (4 January 1987) GR: USD 11,412,248 (USA) (28 December 1986) GR: USD 3,659,884 (USA) (21 December 1986) GR: USD 38,747,385 (USA) GR: SEK 4,318,255 (Sweden) OW: USD 3,659,884 (USA) (21 December 1986) (866 screens) RT: USD 19,300,000 (USA) SD: 21 October 1985 - ? WG: USD 1,112,016 (USA) (8 February 1987) (871 screens) WG: USD 1,719,329 (USA) (1 February 1987) WG: USD 2,093,847 (USA) (25 January 1987) WG: USD 3,222,066 (USA) (18 January 1987) WG: USD 3,057,666 (USA) (11 January 1987) (858 screens) WG: USD 4,004,838 (USA) (4 January 1987) (866 screens) WG: USD 5,042,682 (USA) (28 December 1986) (866 screens) WG: USD 3,659,884 (USA) (21 December 1986) (866 screens) -------------------------------------------------------------------------------
Into this:
Little Shop of Horrors (1986) 34,656,704 (USA) (8 February 1987) Little Shop of Horrors (1986) 33,126,503 (USA) (1 February 1987) Little Shop of Horrors (1986) 30,810,276 (USA) (25 January 1987) Little Shop of Horrors (1986) 27,781,027 (USA) (18 January 1987) Little Shop of Horrors (1986) 23,727,232 (USA) (11 January 1987) Little Shop of Horrors (1986) 19,546,049 (USA) (4 January 1987) Little Shop of Horrors (1986) 11,412,248 (USA) (28 December 1986) Little Shop of Horrors (1986) 3,659,884 (USA) (21 December 1986) Little Shop of Horrors (1986) 38,747,385 (USA)
Here’s the code:
#!/usr/bin/perl #gross.pl by ghr #this script cleans the IMDB file business.list #raw data is field-tagged, key tags are "MV" (movie title) and "GR" (gross) #record can have multiple "gross" fields, only interested in those with "(USA)" #ex #MV: Astronaut's Wife, The (1999) #GR: USD 10,654,581 (USA) (7 November 1999) #find "MV" tag, keep in memory, go to "GR" tag and write out as "GR\tMV" use warnings; use strict; die "usage: gross.pl <IMDB business file>\n" unless @ARGV==1; my $rawdata = shift(@ARGV); # if line=MV, redefine the "title" variable # if line=GR, write out with "title" in front #optional, screen out non "USA" gross, parse GR into #"currency, quantity, country, date" my $title ; my $gross ; open(IN, "<$rawdata") or die "error opening $rawdata for reading\n"; open(OUT, ">gross.txt") or die "error creating gross.txt\n"; print OUT "title\tgross\n"; while (<IN>) { #match "MV" lines by looking for lines beginning "MV: " if($_=~ /^MV: /) { $title = $_; $title =~ s/\015?\012//; #manual chomp $title =~ s/^MV: //; #drop leading tag print "$title "; } #match "GR" lines, write out with clid if ($_ =~ m/^GR: USD .+\(USA\)/) { $gross = $_; $gross =~ s/\015?\012//; #manual chomp $gross =~ s/^GR: USD //; #drop leading tag print OUT "$title\t$gross\n"; } } close IN; close OUT; print "\ndone\n";
Keep the best 5
| Gabriel |
In my undergrad lecture class I give pop quizzes, which are intended to measure both attendance and reading comprehension. Since I think an absence or two is understandable, I told the students I’d only count their best five quizzes out of seven offered. Of course this meant I had to figure out a way to calculate this, which is a lot harder to do than a simple sum. After some failed experimentation with Stata, I found this was easy to do in perl. This is because perl likes to process data row by row and its “sort()” function sorts elements in an array (rather than rows in a spreadsheet, like the “sort” commands in Stata or Excel). To view it from a Stata-centric or Excel-centric perspective, perl finds it quite natural to sort columns/variables within a row/record.
[Update: also see the comments for some solutions using Excel or R.]
Here’s the perl script
#!/usr/bin/perl # bestquiz.pl # Created by Gabriel Rossman rossman@soc.ucla.edu, 2010-03-10 # this file cleans the grades by sorting the quizzes. this makes it easy to find the best 5 use strict; use warnings; die "usage: bestquiz.pl <in.txt> <out.txt>\n" unless @ARGV == 2; #read-write boilerplate my $infile = shift (@ARGV); my $outfile = shift (@ARGV); open(IN, "<$infile") or die "error reading $infile"; open(OUT, ">$outfile") or die "error creating $outfile"; #loop to read, process, and write line by line while (<IN>) { chomp; #drop \n my @fields = split("\t", $_); #parse STDIN into fields my $uid = shift (@fields); #lop off the first field, call it "uid" or University ID, which is my key variable my @sortedlist = sort {$b <=> $a} @fields; #descending sort other fields print OUT "$uid @sortedlist\n"; #write to disk as space-delimited text } close IN; close OUT;
To execute it from the command line you’d type
perl bestquiz.pl dirtygrades.txt cleangrades.txt
You could easily use this script with, say, Excel, but I wrap it in a Stata script.
cd "~/Documents/w10_m176/exams/myuclagradebook/" insheet using grades.csv, comma clear drop if uid==. & midterm==. sort uid save grades.dta, replace keep uid q1-q7 outsheet using grades.txt, nonames noquote replace *send to perl to sort quizzes by score (by row) shell perl bestquiz.pl grades.txt gradesclean.txt insheet using gradesclean.txt, clear delim(" ") ren v1 uid ren v2 qg1 ren v3 qg2 ren v4 qg3 ren v5 qg4 ren v6 qg5 ren v7 qg6 ren v8 qg7 sort uid merge 1:1 uid using grades drop _merge q1-q7 gen q_top5=(qg1+qg2+qg3+qg4+qg5)/5 *have a nice day
Memetracker into Stata
| Gabriel |
A few months ago I mentioned the Memetracker project to scrape the internet and look for the diffusion of (various variants of) catchphrases. I wanted to play with the dataset but there were a few tricks. First, the dataset is really, really, big. The summary file is 862 megabytes when stored as text and would no doubt be bigger in Stata (because of how Stata allocates memory to string variables). Second, the data is in a moderately complicated hierarchical format, with “C” specific occurrences, nested within “B” phrase variants, which are in turn nested within “A” phrase families. You can immediately identify whether a row is A, B, or C by the numer of leading tabs (0, 1, and 2, respectively).
I figured that the best way to interpret this data in Stata would be two create two flat-files, one a record of all the “A” records that I call “key”, and the other a simplified version of all the “C” records but with the key variable to allow merging with the “A” records. Rather than do this all in Stata, I figured it would be good to pre-process it in perl, which reads text one line at a time and thus is well-suited for handling very large files. The easy part was to make a first pass through the file with grep to create the “key” file by copying all the “A” rows (i.e., those with no leading tabs).
Slightly harder was to cull the “C” rows. If I just wanted the “C” rows this would be easy, but I wanted to associate them with the cluster key variable from the “A” rows. This required looking for “A” rows, copying the key, and keeping it in memory until the next “A” row. Meanwhile, every time I hit a “C” row, I copy it but add in the key variable from the most recent “A” row. Both for debugging and because I get nervous when a program doesn’t give any output for several minutes, I have it print to screen every new “A” key. Finally, to keep the file size down, I set a floor to eliminate reasonably rare phrase clusters (anything with less than 500 occurrences total).
At that point I had two text files, “key” which associates the phrase cluster serial number with the actual phrase string and “data” which records occurrences of the phrases. The reason I didn’t merge them is that it would massively bloat the file size and it’s not necessary for analytic purposes. Anyway, at this point I could easily get both the key and data files into Stata and do whatever I want with them. As a first pass, I graphed the time-series for each catchphrase, with and without special attention drawn to mentions occurring in the top 10 news websites.
Here’s a sample graph.
Here’s the perl file:
#!/usr/bin/perl #mt_clean.pl by ghr #this script cleans the memetracker.org "phrase cluster" data #http://snap.stanford.edu/data/d/quotes/Old-UniqUrls/clust-qt08080902w3mfq5.txt.gz #script takes the (local and unzipped) location of this file as an argument #throws out much of the data, saves as two tab flatfiles #"key.txt" which associates cluster IDs with phrases #"data.txt" which contains individual observations of the phrases # input # A: <ClSz> <TotFq> <Root> <ClId> # B: <QtFq> <Urls> <QtStr> <QtId> # C: <Tm> <Fq> <UrlTy> <Url> # output, key file # A: <ClSz> <TotFq> <Root> <ClId> # output, data file # C:<ClID> <Tm> <UrlTy> <URL> # make two passes. use warnings; use strict; die "usage: mt_clean.pl <phrase cluster data>\n" unless @ARGV==1; #define minimum number of occurences a phrase must have my $minfreq = 500; my $rawdata = shift(@ARGV); # use bash grep to write out the "key file" system("grep '^[0-9]' $rawdata > key.txt"); # read again, and write out the "data file" # if line=A, redefine the "clid" variable # optional, if second field of "A" is too small, (eg, below 100), break the loop? # if line=B, skip # if line=C, write out with "clid" in front my $clid ; open(IN, "<$rawdata") or die "error opening $rawdata for reading\n"; open(OUT, ">data.txt") or die "error creating data.txt\n"; print OUT "clid\ttm\turlty\turl\n"; while (<IN>) { #match "A" lines by looking for numbers in field 0 if($_=~ /^\d/) { my @fields = split("\t", $_); #parse as tab-delimited text if($fields[1] < $minfreq) { last;} #quit when you get to a rare phrase $clid = $fields[3]; #record the ClID $clid =~ s/\015?\012//; #manual chomp print "$clid "; } #match "C" lines, write out with clid if ($_ =~ m/^\t\t/) { chomp; my @fields = split("\t", $_); print OUT "$clid\t$fields[2]\t$fields[4]\t$fields[5]\n"; } } close IN; close OUT; print "\ndone\n";
And here’s the Stata file:
clear set mem 500m set more off cd ~/Documents/Sjt/memetracker/ *import key, or "A" records insheet using key.txt, clear ren v1 clsz ren v2 totfq ren v3 root ren v4 clid sort clid lab var clsz "cluster size, n phrases" lab var totfq "total frequency" lab var root "phrase" lab var clid "cluster id" compress save key, replace *import data, or "C" records insheet using data.txt, clear drop if clid==. gen double timestamp=clock(tm,"YMDhms") format timestamp %tc drop tm gen hostname=regexs(1) if regexm(url, "http://([^/]+)") /*get the website, leaving out the filepath*/ drop url gen blog=0 replace blog=1 if urlty=="B" replace blog=1 if hostname=="blog.myspace.com" gen technoratitop10=0 /*note, as of 2/3/2010, some mismatch with late 2008 memetracker data*/ foreach site in huffingtonpost.com engadget.com gizmodo.com mashable.com techcrunch.com boingboing.net gawker.com corner.nationalreview.com thedailybeast.com tmz.com { replace technoratitop10=1 if hostname=="`site'" } gen alexanews10=0 /*as w technorati, anachronistic*/ foreach site in news.yahoo.com bbc.co.uk cnn.com news.bbc.co.uk news.google.com nytimes.com msnbc.msn.com foxnews.com { replace alexanews10=1 if hostname=="`site'" } drop urlty sort clid timestamp contract _all /*eliminate redundant "C" records (from different "B" branches)*/ drop _freq save data, replace *draw a graph of each meme's occurrences levelsof clid, local(clidvalues) foreach clid in `clidvalues' { disp "`clid'" quietly use key, clear quietly keep if clid==`clid' local title=root in 1 quietly use data, clear histogram timestamp if clid==`clid', frequency xlabel(#5, labsize(small) angle(forty_five)) title(`title', size(medsmall)) graph export graphs/`clid'.png, replace twoway (histogram timestamp if clid==`clid') (line alexanews10 timestamp if clid==`clid', yaxis(2)), legend(off) xlabel(#5, labsize(small) angle(forty_five)) title(`title', size(medsmall)) graph export graphs_alexa/`clid'.png, replace } *have a nice day
Perl text library
| Gabriel |
I found this very useful library of perl scripts for text cleaning. You can use them even if you can’t code perl yourself, for instance to transpose a dataset just download “transpose.pl” script to your ~/scripts directory and enter the shell command:
perl ~/scripts/transpose.pl row_col.txt > col_row.txt
The transpose script is particularly useful to me as I’ve never gotten Excel’s transpose function to work and for some bizarre reason Stata’s “xpose” command only works with numeric variables. You can even use these scripts from directly in a do-file like so:
tempfile foo1
tempfile foo2
outsheet using `foo1'.txt
shell perl ~/scripts/transpose.pl `foo1'.txt > `foo2'.txt
insheet using `foo2'.txt, clear
Pajek_labelvector.pl
| Gabriel |
A few months ago I wrote some notes on using a text editor to get output out of Pajek or Network Workbench and into a rows and columns dataset. Now that I’ve learned Perl from the course notes my UC Davis colleagues posted, I wrote up a perl script that will automate this and create a tab-delimited ascii file (or files if you give it multiple .vec files).
I’d like to put the code directly in the post but when I try, wordpress drops some of the characters (eg, backslash-zero-one-five renders as just “15”) so I put the properly-formatted script here .[Update, the new “sourcecode” tag properly escapes all this stuff so I’ve updated the post to include the script at the bottom of the post. The external link still works but is now unnecessary].
It takes the labels from a “.net” data file and merges them (by sort order) onto a “.vec” output file which let’s you merge it back onto your main (non-network) dataset. Read my older post for an explanation of why this is necessary. Note that if the sort order is different for the .vec and .net files it will get screwy so be sure to spot check the values. The syntax is simply:
perl pajek_labelvector.pl myfile.net netmetric_1.vec netmetric_k.vec
Between this perl script and stata2pajek.ado it should be fairly easy to integrate network data into Stata.
#!/usr/bin/perl # pajek_labelvector.pl # Gabriel Rossman, UCLA, 2009-09-22 # this file extracts the vertice labels from a .net file and merges it (by sort order) with one or more .vec files # take filenames as arguments # file 1 is .net, files 2-k are .vec # writes out foo.txt as tab delimited text # note, this is dependent on an unchanged sort order use strict; use warnings; die "usage: pajek_labelvector.pl ... \n" unless @ARGV > 1; my $netfile = shift (@ARGV); my @labels=(); #read the vertice labels from .net file open(NETIN, "<$netfile") or die "error reading $netfile for reading"; while (<NETIN>) { if ($_ =~ m/"/) { #only use the vertice label lines, which include quote chars $_ =~ /^[0-9]+ "(.*)"/; #search for quoted text push @labels, $1; #return match, push to array } } close NETIN; #read netfile foreach my $vecfile (@ARGV) { open(VECIN, "<$vecfile") or die "error reading $vecfile"; open(VECOUT, ">$vecfile.txt") or die "error creating $vecfile.txt"; my @vec=(); while (<VECIN>) { $_ =~ s/\015?\012//; #manual chomp to allow windows or unix text if ($_ !~ m/^\*/) { push @vec, $_; } } close VECIN; my $veclength = @vec - 1; my $lablength = @labels -1; die "error, $vecfile is different length than $netfile" unless $veclength==$lablength; for my $i (0..$veclength) { print VECOUT "$labels[$i]\t$vec[$i]\n"; } close VECOUT; @vec=(); } print "WARNING: this script assumes that the .vec and .net have the same sort order\nplease spot check the values to avoid error\n";
Texcount.pl
| Gabriel |
Somebody recently asked me for a projected word count of my manuscript (which is in Lyx) and to answer this question I found the amazingly useful script texcount.pl. If you just run “wc” (or the equivalent in a text editor) on a tex or lyx file you count all the plain text and the markup code. Not only does this script screen out the meta-text, but it can give you detailed breakdowns of words, figures, and captions — all broken out by section.
I like to keep scripts in “~/scripts/” so to make this script readily accessible from the command-line I entered the command:
echo "alias texcount='perl ~/scripts/TeXcount_2_2/texcount.pl'" >> ~/.bashrc
Now to run the command I just go to the terminal and type
texcount foo.tex
You should really check out the options if you have a long and complex document. My favorite option is “-sub”. This gives a detailed breakdown of word count, figure count, etc, by chapter, section, or whatever.
texcount -sub foo.tex
Remember that if you always use a certain option, you can write it into the alias command.
Lyx has a similar basic command built in (Tools/Statistics), but it doesn’t give as much information and doesn’t break out the data by section. To use texcount with lyx files, you first need to export Lyx to Latex which you can do from the GUI (File/Export/Latex), but if you’re using texcount anyway you should just use the command line.
lyx --export latex foo.lyx
That works for Linux but on a Mac this will work more consistently
exec '/Applications/Lyx.app/Contents/MacOS/lyx' --export latex foo.lyx
That’s a long command, so on my Mac I created an alias as “lyx2tex”
echo "alias lyx2tex='exec /Applications/Lyx.app/Contents/MacOS/lyx --export latex'" >> ~/.bashrc
Note that all this works on POSIX but may require some modification to work with Windows (unless it has CygWin).
Recent Comments