Getting long flat-files out of field-tagged data

August 12, 2010 at 4:08 am 2 comments

| Gabriel |

Some field-tagged data can be pretty unproblematically reshaped into flat-files. However one of the reasons people like field-tagged data is that they can have internal long structures and this creates problems for reshaping. For instance in Web of Science, the “CR” field (works that are being cited by the main work) usually has dozens of references separated by carriage returns. To get this into most statistic packages it has to be reshaped into a long flat-file. In other words, you need a script to turn this:

PT J
AU LEVINSON, RM
TI TEACHING LABELING THEORY - 4 EXPERIENCES IN ILLNESS ATTRIBUTION
SO TEACHING SOCIOLOGY
LA English
DT Article
C1 EMORY UNIV,ATLANTA,GA 30322.
CR BECKER HS, 1963, OUTSIDERS
   MENDEL WM, 1969, ARCH GEN PSYCHIAT, V20, P321
   ROSENHAN DL, 1973, SCIENCE, V179, P250
   SCHEFF TJ, 1964, SOC PROBL, V11, P401
   SCHEFF TJ, 1966, BEING MENTALLY ILL
NR 5
TC 5
PU AMER SOCIOLOGICAL ASSOC
PI WASHINGTON
PA 1722 N ST NW, WASHINGTON, DC 20036-2981
SN 0092-055X
J9 TEACH SOCIOL
JI Teach. Sociol.
PY 1975
VL 2
IS 2
BP 207
EP 211
PG 5
SC Education & Educational Research; Sociology
GA AD732
UT ISI:A1975AD73200007
ER

Into this:

ISI:A1975AD73200007	TEACH SOCIOL	BECKER HS, 1963, OUTSIDERS
ISI:A1975AD73200007	TEACH SOCIOL	MENDEL WM, 1969, ARCH GEN PSYCHIAT, V20, P321
ISI:A1975AD73200007	TEACH SOCIOL	ROSENHAN DL, 1973, SCIENCE, V179, P250
ISI:A1975AD73200007	TEACH SOCIOL	SCHEFF TJ, 1964, SOC PROBL, V11, P401
ISI:A1975AD73200007	TEACH SOCIOL	SCHEFF TJ, 1966, BEING MENTALLY ILL

Note that the other fields usually lack carriage returns or other internal long delimiters so they can be cleaned like this. The two approaches can then be merged (in a flat-file like Stata) or linked (in a relational like R or Access) using the key.

I’ve actually already done this kind of thing twice, with my code for cleaning Memetracker and the IMDb business file. However those two datasets had the convenient property that the record key appears in the first row of the record. With data structures like these, you just remember the key then every time you come across a long entry, write it out along with the key.

Unfortunately, Web of Science has the record key appear towards the end of the record, a data structure that I propose to call “a huge pain in the ass.” This means that you have to collect all the long values in an array, then record the record key, then loop over the array to write out.

#!/usr/bin/perl
#wos2tab_cr.pl by ghr
#this script converts field-tagged WOS queries to tab-delimited text
#it extracts the CR field and attaches to it the fields UT and J9
#unlike wos2tab, this file outputs long-formatted data
#the two types of output can be merged with the UT field

#sice CR comes /before/ UT and J9, must save as array, then loop over array at ER

use warnings; use strict;
die "usage: wos2tab_cr.pl <wos data>\n" unless @ARGV==1;

my $rawdata = shift(@ARGV);

my $ut = "" ; #unique article identifier
my $j9 = "" ; #j9 coding of journal title
my $cr = "" ; #cited work
my @crlist = () ; #list of cited works
my $cr_continued = 0 ; #flag for recently hitting "^CR"

print "starting to read $rawdata\n";

open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">$rawdata.long") or die "error creating $rawdata.long\n";
print OUT "ut\tj9\tcr\n";
while (<IN>) {
	# if begins with non-whitespace character other than CR, flag cr_continued as not
	if($_ =~ m/^\S/) {
		if($_ =~ m/^[^CR]/) {$cr_continued = 0}
	}
	if($_ =~ m/^J9/) {
		$j9 = $_;
		$j9 =~ s/\015?\012//; #manual chomp
		$j9 =~ s/^J9 //; #drop leading tag
	}
	if($_ =~ m/^UT/) {
		$ut = $_;
		$ut =~ s/\015?\012//; #manual chomp
		$ut =~ s/^UT //; #drop leading tag
	}
	
	#first line of a CR field 
	if($_ =~ m/^CR/) {
		$cr = $_;
		$cr =~ s/\015?\012//; #manual chomp
		$cr =~ s/^CR //; #drop leading tag
		$cr_continued = 1 ; #flag to allow for multi-line CR field
		push @crlist, $cr; #add the new cite to the list
	}
	#subsequent lines of a CR field
	if($_ =~ m/^  /) {
		if($cr_continued==1) {
			$cr = $_ ;
			$cr =~ s/\015?\012//; #manual chomp
			$cr =~ s/^   //; #drop leading tag
			push @crlist, $cr; #add the new cite to the list
		}
	}
	
	#when "end record" code is reached, loop over array to write out as long file, then clear memory
	if($_=~ /^ER/) {
		#loop over CRLIST array, printing for each value so as to have long file
		for (my $i = 0; $i < @crlist; $i++) {
			print OUT "$ut\t$j9\t$crlist[$i]\n"; #write out
		}
		#clear memory, just do once per "end record"
		$j9 = "" ;
		$ut = "" ;
		$cr = "" ;
		$cr_continued = 0 ;
		@crlist = () ;
	}
}
close IN;
close OUT;
print "done writing $rawdata.long \n";
Advertisement

Entry filed under: Uncategorized. Tags: , .

Zero marginal productivity Life Without Walls

2 Comments

  • 1. Eric Booth  |  August 14, 2010 at 7:20 pm

    I’ve been doing a lot with field tagged data recently (some of it is in html tables which I am still struggling with), and I’ve wanted to do something similar to what you’ve described but from within Stata.

    Below is the code for a (still messy and buggy) program called removetags.ado which is a working .ado file used to pull information from field tagged data from within Stata. The biggest challenges for me have been (1) how to adapt the code when some fields span multiple rows without field tags to identify them (like your CR s) and (2) how to deal with long entries in Stata (since there is a 244 string limit).

    I think I’ve solved (1) for most cases (see lines 71-90 in the ado-file linked below), but it’s clunky and I’m sure I’ll need to adapt it further when I run across things that make it choke. However, I am currently working on solving (2).
    [[So far, my solution involves identifying the rows that are longer than 244 chars using -file read-, and then reading those rows into Stata as characteristics associated with each record. The advantage of this approach is that you can still manipulate the macros using extended_fcn s and you can keep the association of the long strings with the field tagged records when you merge or export the data…but I’m still having issues with the file read/write commands to manipulate the long strings. I’ve read your posts on grepmerge and Gabi Huiber’s posts on doing something similar in mata (http://enoriver.net/index.php/2010/05/28/using-mata-for-string-processing/) but I’m not much a programmer and I don’t know mata,so will be my attempt at a work around]]

    Ignoring that problem that long strings will be truncated in Stata, here is some code that does what you do with perl from within Stata.

    To get the ado file, visit this link from my dropbox:

    http://dl.dropbox.com/u/428249/removetags.ado

    To test this ado-file, I’ve put two sample .txt files in my dropbox. The first one is just the WOS record you provided above (but I’ve -expanded- it to 3 copies). The second example is from EndNote. When you export your citations to a text file, it becomes field tagged data, so this is a another good test example.

    If any of these links are dead, please email me for a copy (ebooth@ppri.tamu.edu).

    Once you’ve downloaded and copied the .ado file into the proper location, try these examples (there’s no help file, so some instructions are included below):


    ***************!
    Instructions:
    /*
    Data must be field tagged records where each records spans across multiple rows
    Options:
    recordstart() must contain the field tag that begins each record
    fieldtagdelimiter() must contain the char that delimits the field tag (default is a space char)
    keep() is optional list of fieldtags that will be kept in reshaped file
    removechars() is an optional list of problematic non-alphanumeric chars that need to be removed from your fieldtags
    ((note: use the -charlist- program -ascii- (from SSC) to help identify these, if they exist))
    */

    **examples**
    clear
    which removetags

    //example 1//
    removetags using "http://dl.dropbox.com/u/428249/websci.txt", rec("PT") fieldtagd(" ") keep(AU UT JI CR*)
    li

    //example 2//
    removetags using "http://dl.dropbox.com/u/428249/endnote.txt", rec("Reference Type") fieldtagd(":")
    li record Year Author Title in 1/10
    ***************!

    So, in example 1 (WOS records) each record starts with the fieldtag “PT”, is delimited by the first space characters, and I want to keep the vars AU, UT, JI, and the CR’s.

    In example 2, “Reference Type” denotes the start of each record entry, the field tags are delimited by a colon, and I keep all the vars.

    Maybe this is useful for someone, but more importantly, I’d like to hear if anyone tries this out and has any feedback, comments, or help on how to improve this.

    ~ Eric

  • 2. Brooks  |  May 17, 2012 at 8:45 pm

    There is a word for it! An ugly word:

    munging: http://www.manning.com/cross/
    munge: http://www.catb.org/jargon/html/M/munge.html


The Culture Geeks


%d bloggers like this: