wos2tab.pl
July 19, 2010 at 2:13 pm gabrielrossman 6 comments
| Gabriel |
One of my grad students is doing some citation network analysis, for which the Python script (and .exe wrapper) wos2pajek is very well-suited. (Since most network packages can read “.net” this is a good idea even if you’re not using Pajek).
However the student is also interested in node level attributes, not just the network. Unfortunately WOS queries are field-tagged which is kind of a pain to work with and the grad student horrified me by expressing the willingness to spend weeks reshaping the data by hand in Excel. (Even in grad school your time is a lot more valuable than that). To get the data into tab-delimited text, I modified an earlier script I wrote for parsing field-tagged IMDb files (in my case business.list but most of the film-level IMDb files are structured similarly). The basic approach is to read a file line-by-line and match its contents by field-tag, saving the contents in a variable named after the tag. Then when you get to the new record delimiter (in this case, a blank line), dump the contents to disk and wipe the variables. Note that since the “CR” (cited reference) field has internal carriage returns it would require a little doing to integrate into this script, which is one of the reasons you’re better off relying on wos2pajek for that functionality.
#!/usr/bin/perl #wos2tab.pl by ghr #this script converts field-tagged Web Of Science queries to tab-delimited text #for creating a network from the "CR" field, see wos2pajek #note, you can use the info extracted by this script to replicate a wos2pajek key and thus merge use warnings; use strict; die "usage: wos2tab.pl <wos data>\n" unless @ARGV==1; my $rawdata = shift(@ARGV); my $au ; #author my $ti ; #title my $py ; #year my $j9 ; #j9 coding of journal title my $dt ; #document type # to extract another field, work it in along the lines of the existing vars # each var must be # 1. declared with a "my statement" (eg, lines 12-16) # 2. added to the header with the "print OUT" statement (ie, line 29) # 3. written into a search and store loop following an "if" statement (eg, lines 37-41) # 4. inside the blank line match loop (ie, lines 59-66) # 4a. add to the print statement (ie, line 60) # 4b. add a clear statement (eg, lines 61-65) open(IN, "<$rawdata") or die "error opening $rawdata for reading\n"; open(OUT, ">$rawdata.tsv") or die "error creating $rawdata.tsv\n"; print OUT "au\tdt\tpy\tti\tj9\n"; while (<IN>) { if($_ =~ m/^AU/) { $au = $_; $au =~ s/\015?\012//; #manual chomp $au =~ s/^AU //; #drop leading tag $au =~ s/,//; #drop comma -- author only } if($_ =~ m/^DT/) { $dt = $_; $dt =~ s/\015?\012//; #manual chomp $dt =~ s/^DT //; #drop leading tag } if($_ =~ m/^TI/) { $ti = $_; $ti =~ s/\015?\012//; #manual chomp $ti =~ s/^TI //; #drop leading tag } if($_ =~ m/^J9/) { $j9 = $_; $j9 =~ s/\015?\012//; #manual chomp $j9 =~ s/^J9 //; #drop leading tag } if($_ =~ m/^PY/) { $py = $_; $py =~ s/\015?\012//; #manual chomp $py =~ s/^PY //; #drop leading tag } #when blank line is reached, write out and clear memory if($_=~ /^$/) { print OUT "$au\t$dt\t$py\t$ti\t$j9\n"; $au = "" ; $dt = "" ; $ti = "" ; $py = "" ; $j9 = "" ; } } close IN; close OUT; print "\ndone\n";
Entry filed under: Uncategorized. Tags: cleaning, macros, perl, sociology of science.
1.
Brooks | July 19, 2010 at 5:38 pm
Weeks? Come one, it took me one very tedious evening to reshape the data in Excel. But you’re right, it was horrifying, and I felt stupid during it. Thank God for TextWrangler, or it would have taken weeks.
2.
Brooks | July 19, 2010 at 6:00 pm
I should also say that while the WoS2Pajek program is useful, Gabriel’s script is also very helpful if you just want to keep things in a clean network format like a DL file. Using TextWrangler I flatten the CR field by replacing the leading carriage return and spaces prior to each citation with tabs. You can then use use Gabriel’s script to extract only the CR field. This yields undirected, relational citation data in a node-list format that will work if you add the appropriate DL header.
Because WoS doesn’t include the originating article as a coded citation in the CR field, the data doesn’t mean ARTICLE–>CITATION. At best, this gives you CITATIONCITATION data, with edges meaning that two citations appeared in the same article. I haven’t started using the WoS2Pajek program yet, and it remains to be seen how it codes the originating article since WoS doesn’t provide this information.
3.
gabrielrossman | July 20, 2010 at 1:53 pm
i think you’re underestimating wos2pajek. if you read themanual you’ll see (on page 12) a discussion of how to merge a CR field and a regular record.
unfortunately i can’t download the Python source code (i just end up download a file called “fetch.php”) so I can’t see how they implement it, but the description of how CR relates to the other fields is pretty clear.
btw, if you did want to treat co-occurence in a CR field as an affiliation network, you can modify this Stata script, just make sure you correctly flag the internal delimiter and also that you don’t run into trouble w the 244 character limit.
4.
mike | July 20, 2010 at 3:24 pm
I don’t work with Pajek, so this might not be appropriate. But, the other way that you can think about this is in a relational database-type framework of having multiple datasets, one that contains linkage data and one that contains node-level data. I think that it might provide an alternative way to carry the carriage returns and a smaller data footprint since only the necessarily variables can be included.
The first step is to create a unique ID for each node by adding an “extra” tag on the WoS output or some way to link data with a unique identifier (since I don’t know wos2pajek, I’m not sure how this is handled).
Then, you can import the data for articles into Zotero or any other citation software that outputs to XML format. Then, in Python, Perl, or any other language of choice, you can use standard XML processing libraries designed to handle carriage returns and other types of data inside of the XML tags. This will create an article-level dataset and use standard text-parsing libraries.
Then, you can easily merge these two datasets, when necessary, for different analyses or load the data into an SQL-type database.
Like I said, I’m not sure that this would save time/energy but it might be a possibility.
5.
gabrielrossman | July 20, 2010 at 4:31 pm
mike,
great minds think alike as a lot of network datasets follow similar structures to what you’re describing. the Pajek .net format first has a list of nodes, each of which is assigned a serial number and then a list of edges as described by pairs of node serial numbers. iGraph’s native format is similar but a bit more complicated.
anyway, i think the XML/SQL approach might be overkill for our needs but it’s good to keep in mind if it scales up
6. Getting long flat-files out of field-tagged data « Code and Culture | August 12, 2010 at 4:22 am
[…] field-tagged data can be pretty unproblematically reshaped into flat-files. However one of the reasons people like field-tagged data is that they can have internal long […]