Posts Tagged cleaning
Stata2Pajek w vertice colors
| Gabriel |
My Stata program stata2pajek let’s you export an edge list or arc list from Stata to Pajek. However I recently wanted to make color-coded vertices and this required tweaking the syntax a bit. Because this is a relational database problem (and Stata likes flat-file), I do it by merging in a vertice level file on disk. Although I’ve written it to merge on colors it should be fairly easy to rewrite to deal with other vertice-level variables. Unfortunately the syntax is pretty awkward, which is why I’m forking it and posting it to the blog instead of updating the main stata2pajek file at SSC.
Note that the script assumes that nodes which appear in the arc list but not the vertice attributes file should be color-coded yellow. If you want yellow to have a substantive meaning you should change line 61 to be something else.
The syntax is the same as stata2pajek except there are a few more options. If you omit these options it behaves just like stata2pajek classic:
- attributefile() — the file where the vertice level variables are stored
- attributekey() — the merge key, defaults to “ego”
- color() — the variable storing the color-codes
For instance, to color-code a graph of radio stations based on when they first played “My Humps” (yes, really), I use this command:
stata2pajekalt ego alter, filename(ties_bounded_humpcolor) attributefile(humps_color) attributekey(ego) color(color)
It says to treat the data in memory as an arc list, to merge on station-level data on adoption time where the station name is stored as the variable “ego” in the file “humps_color”, and to write out the whole mess as a Pajek formatted text file called “ties_bounded_humpscolor.net”.
*1.2b GHR Dec 10, 2009
*forked from stata2pajek 1.2
capture program drop stata2pajekalt
program define stata2pajekalt
version 10
set more off
syntax varlist(min=2 max=2) [ , tiestrength(string asis) filename(string asis) EDGEs attributefile(string asis) attributekey(string asis) color(string asis)]
tempfile dataset
quietly save `dataset'
gettoken ego alter : varlist
local starchar="*"
if "`filename'"=="" {
local pajeknetfile "mypajekfile"
}
else {
local pajeknetfile "`filename'"
}
if "`attributekey'"=="" {
local attributekey "`ego'"
}
capture file close pajeknetfile
file open pajeknetfile using `pajeknetfile'.net, write text replace
use `dataset', clear
drop `ego'
ren `alter' `ego'
append using `dataset'
keep `ego'
contract `ego'
ren `ego' verticelabel
drop _freq
sort verticelabel
gen number=[_n]
order number verticelabel
if "`attributefile'"!="" {
ren verticelabel `attributekey'
merge `attributekey' using `attributefile'
drop if _merge==2
ren `attributekey' verticelabel
drop _merge
keep number verticelabel `color'
sort verticelabel
}
tempfile verticelabels
quietly save `verticelabels', replace
local nvertices=[_N]
file write pajeknetfile "`starchar'Vertices `nvertices'" _n
if "`attributefile'"=="" {
forvalues x=1/`nvertices' {
local c2=verticelabel in `x'
file write pajeknetfile `"`x' "`c2'""' _n
}
}
else {
forvalues x=1/`nvertices' {
local c2=verticelabel in `x'
local colvalue=`color' in `x'
if "`colvalue'"=="" {
local colvalue "Yellow"
}
file write pajeknetfile `"`x' "`c2'" ic `colvalue'"' _n
}
}
use `dataset', clear
ren `ego' verticelabel
sort verticelabel
quietly merge verticelabel using `verticelabels'
quietly keep if _merge==3
drop _merge verticelabel
ren number `ego'
ren `alter' verticelabel
sort verticelabel
quietly merge verticelabel using `verticelabels'
quietly keep if _merge==3
drop _merge verticelabel
ren number `alter'
order `ego' `alter' `tiestrength'
keep `ego' `alter' `tiestrength'
local narcs=[_N]
if "`edges'"=="edges" {
file write pajeknetfile `"`starchar'Edges"' _n
}
else {
file write pajeknetfile `"`starchar'Arcs"' _n
}
sort `ego' `alter'
if "`tiestrength'"~="" {
forvalues x=1/`narcs' {
local c1=`ego' in `x'
local c2=`alter' in `x'
local c3=`tiestrength' in `x'
file write pajeknetfile "`c1' `c2' `c3'" _n
}
}
else {
forvalues x=1/`narcs' {
local c1=`ego' in `x'
local c2=`alter' in `x'
file write pajeknetfile "`c1' `c2'" _n
}
}
file close pajeknetfile
*ensure that it's windows (CRLF) text format
if "$S_OS"~="Windows" {
filefilter `pajeknetfile'.net tmp, from(\M) to(\W) replace
shell mv tmp `pajeknetfile'.net
filefilter `pajeknetfile'.net tmp, from(\U) to(\W) replace
shell mv tmp `pajeknetfile'.net
}
use `dataset', clear
disp "Your output is saved as"
disp "`c(pwd)'`c(dirsep)'`pajeknetfile'.net"
end
2 comments December 10, 2009
Drop the entirely missing variables
| Gabriel |
One of the purely technical frustrations with GSS is that it’s hard to figure out if a particular question was in a particular wave. The SDA server is pretty good about kicking these from extracts, but it still leaves some in. The other day I was playing with the 2008 wave and got sick of saying “oooh, that looks interesting” only to find the variable was missing for my wave, which happened repeatedly. (For instance, I thought it would be fun to run Biblical literalism against the Israel feeling thermometer, but no dice, at least for the 2008 wave).
To get rid of these phantom variables, I wrote this little loop that drops variables with entirely missing data (or that are coded as strings, see below):
foreach varname of varlist * {
quietly sum `varname'
if `r(N)'==0 {
drop `varname'
disp "dropped `varname' for too much missing data"
}
}
Unfortunately “sum” thinks that string variables have no observations so this will also drop strings. There’s a workaround, but it involves the “ds” command, which works in Stata 11 but has been deprecated and so may not work in future versions.
ds, not(type string)
foreach varname of varlist `r(varlist)' {
quietly sum `varname'
if `r(N)'==0 {
drop `varname'
disp "dropped `varname' for too much missing data"
}
}
7 comments December 2, 2009
Perl text library
| Gabriel |
I found this very useful library of perl scripts for text cleaning. You can use them even if you can’t code perl yourself, for instance to transpose a dataset just download “transpose.pl” script to your ~/scripts directory and enter the shell command:
perl ~/scripts/transpose.pl row_col.txt > col_row.txt
The transpose script is particularly useful to me as I’ve never gotten Excel’s transpose function to work and for some bizarre reason Stata’s “xpose” command only works with numeric variables. You can even use these scripts from directly in a do-file like so:
tempfile foo1
tempfile foo2
outsheet using `foo1'.txt
shell perl ~/scripts/transpose.pl `foo1'.txt > `foo2'.txt
insheet using `foo2'.txt, clear
1 comment November 30, 2009
Pajek_labelvector.pl
| Gabriel |
A few months ago I wrote some notes on using a text editor to get output out of Pajek or Network Workbench and into a rows and columns dataset. Now that I’ve learned Perl from the course notes my UC Davis colleagues posted, I wrote up a perl script that will automate this and create a tab-delimited ascii file (or files if you give it multiple .vec files).
I’d like to put the code directly in the post but when I try, wordpress drops some of the characters (eg, backslash-zero-one-five renders as just “15″) so I put the properly-formatted script here .[Update, the new "sourcecode" tag properly escapes all this stuff so I've updated the post to include the script at the bottom of the post. The external link still works but is now unnecessary].
It takes the labels from a “.net” data file and merges them (by sort order) onto a “.vec” output file which let’s you merge it back onto your main (non-network) dataset. Read my older post for an explanation of why this is necessary. Note that if the sort order is different for the .vec and .net files it will get screwy so be sure to spot check the values. The syntax is simply:
perl pajek_labelvector.pl myfile.net netmetric_1.vec netmetric_k.vec
Between this perl script and stata2pajek.ado it should be fairly easy to integrate network data into Stata.
#!/usr/bin/perl
# pajek_labelvector.pl
# Gabriel Rossman, UCLA, 2009-09-22
# this file extracts the vertice labels from a .net file and merges it (by sort order) with one or more .vec files
# take filenames as arguments
# file 1 is .net, files 2-k are .vec
# writes out foo.txt as tab delimited text
# note, this is dependent on an unchanged sort order
use strict; use warnings;
die "usage: pajek_labelvector.pl ... \n" unless @ARGV > 1;
my $netfile = shift (@ARGV);
my @labels=();
#read the vertice labels from .net file
open(NETIN, "<$netfile") or die "error reading $netfile for reading";
while (<NETIN>) {
if ($_ =~ m/"/) { #only use the vertice label lines, which include quote chars
$_ =~ /^[0-9]+ "(.*)"/; #search for quoted text
push @labels, $1; #return match, push to array
}
}
close NETIN;
#read netfile
foreach my $vecfile (@ARGV) {
open(VECIN, "<$vecfile") or die "error reading $vecfile"; open(VECOUT, ">$vecfile.txt") or die "error creating $vecfile.txt";
my @vec=();
while (<VECIN>) {
$_ =~ s/\015?\012//; #manual chomp to allow windows or unix text
if ($_ !~ m/^\*/) {
push @vec, $_;
}
}
close VECIN;
my $veclength = @vec - 1;
my $lablength = @labels -1;
die "error, $vecfile is different length than $netfile" unless $veclength==$lablength;
for my $i (0..$veclength) {
print VECOUT "$labels[$i]\t$vec[$i]\n";
}
close VECOUT;
@vec=();
}
print "WARNING: this script assumes that the .vec and .net have the same sort order\nplease spot check the values to avoid error\n";
Add comment September 29, 2009
Stata2Pajek (update)
I fixed two bugs in Stata2Pajek and improved the documentation. To get it, type (from within Stata)
ssc install stata2pajek
If you already have the old version, the above command will update it but even better is to update all of your ado files with:
adoupdate
Add comment September 18, 2009
Bash/Perl tutorial
| Gabriel |
I’m a big fan of the idea of using Unix tools like Perl to script the cleaning of the massive text-based datasets that social scientists (especially sociologists of culture) often use. Unfortunately there’s something of a learning curve to this so even though I like the idea in principle and increasingly in practice, I still sometimes clean data interactively with TextWrangler and just try to keep good notes.
Fortunately two of my UC system colleagues have posted the course materials for a “Unix and Perl Primer for Biologists.” I’m about halfway through the materials and it’s great, in part because (unlike the llama) they assume no prior familiarity with programming or Unix. Although the examples involve genetics, it’s well-suited for social scientists as, like us, biologists are not computer scientists but are reasonably technically competent and they often deal with large text based data sets. Basically, if you can write a Stata do-file, you should be able to follow their course guide and if you use things like scalars and loops it should be pretty easy.
I highly recommend the course to any social scientist who deals with large dirty datasets, in other words, basically anyone who is a quant but doesn’t just download clean ICPSR or Census data. This is especially relevant for anyone who wants to scrape data off the web, use IMDB, do large-scale content analysis, etc.
Some notes:
- They assume you will a) be running the materials off a stick and b) using Mac OS X. If you’re keeping the material on the hard drive, get used to typing “chmod u+x foo.pl” to make the perl script “foo” executable. (This step is unnecessary for files on a stick because unlike HFS+ or EXT3, the FAT filesystem doesn’t do permissions). If you’re using a different version of Unix, most of it should work similarly with only a few minor differences, such as that you’ll want to use Kate instead of Smultron and on a Mac a USB stick is in /Volumes/ whereas in Linux it’s in /Media/ and in BSD it’s in /mnt/. If you’re using Windows you’ll either need to a) install CygWin b) install a virtual machine c) run off a live cd or bootable stick or d) dual boot with Wubi.
- If you’re really used to Stata, some of the nomenclature may seem backwards, mostly because Perl doesn’t keep a dataset in memory but processes it on disk, one command at a time. So, in Perl and Bash a “variable” is the equivalent to what Stata calls a (global or local) “macro”. The closest Perl equivalent to what Stata calls a “variable” would be a “field” in a tab-delimited text file.
[Update: Although they suggest Smultron, I find TextMate works even better as it can execute scripts entirely within the editor, so you don't have to constantly cmd-tab to Terminal.app and back.]
Add comment September 15, 2009
Stata2Pajek
| Gabriel |
[update #4 (9/16), I also fixed a problem with tie strength, improved the help file, and submitted it to SSC. the SSC version should be updated in a few days but in the meantime the updated version is in this post]
[update #3, thanks to "T" in the comments, i fixed a bug in the code. the updated code is in this post but not yet uploaded to SSC]
[update #2, the ado file is now hosted at SSC. type "ssc install stata2pajek" to install it]
[update #1, I cleaned the code a bit]
I wait impatiently wait for somebody more talented than I am to write a native Stata social network package (which should be feasible with Mata). In the meantime, to do anything much more ambitious than estimate density, us Stata folks have to export the data to other packages, basically all of which can read the “.net” format. Currently the best way to do this is to use the command “outsheet” and then clean it with the Windows program “Txt2Pajek,” but I decided to do natively in Stata. My way is faster, scriptable, and works better cross-platform. (Running Txt2Pajek on Unix-like computers requires Crossover/Wine, although it does run smoothly).
The program takes the arguments dataset, ego, alter, and an optional “tie strength” argument.
*1.2 GHR Sept 16, 2009
capture program drop stata2pajek
program define stata2pajek
version 10
set more off
syntax varlist(min=2 max=2) [ , tiestrength(string asis) filename(string asis) EDGEs]
tempfile dataset
quietly save `dataset'
gettoken ego alter : varlist
local starchar="*"
if "`filename'"=="" {
local pajeknetfile "mypajekfile"
}
else {
local pajeknetfile "`filename'"
}
capture file close pajeknetfile
file open pajeknetfile using `pajeknetfile'.net, write text replace
use `dataset', clear
drop `ego'
ren `alter' `ego'
append using `dataset'
keep `ego'
contract `ego'
ren `ego' verticelabel
drop _freq
sort verticelabel
gen number=[_n]
order number verticelabel
tempfile verticelabels
quietly save `verticelabels', replace
local nvertices=[_N]
file write pajeknetfile "`starchar'Vertices `nvertices'" _n
forvalues x=1/`nvertices' {
local c2=verticelabel in `x'
file write pajeknetfile `"`x' "`c2'""' _n
}
use `dataset', clear
ren `ego' verticelabel
sort verticelabel
quietly merge verticelabel using `verticelabels'
quietly keep if _merge==3
drop _merge verticelabel
ren number `ego'
ren `alter' verticelabel
sort verticelabel
quietly merge verticelabel using `verticelabels'
quietly keep if _merge==3
drop _merge verticelabel
ren number `alter'
order `ego' `alter' `tiestrength'
keep `ego' `alter' `tiestrength'
local narcs=[_N]
if "`edges'"=="edges" {
file write pajeknetfile `"`starchar'Edges"' _n
}
else {
file write pajeknetfile `"`starchar'Arcs"' _n
}
sort `ego' `alter'
if "`tiestrength'"~="" {
forvalues x=1/`narcs' {
local c1=`ego' in `x'
local c2=`alter' in `x'
local c3=`tiestrength' in `x'
file write pajeknetfile "`c1' `c2' `c3'" _n
}
}
else {
forvalues x=1/`narcs' {
local c1=`ego' in `x'
local c2=`alter' in `x'
file write pajeknetfile "`c1' `c2'" _n
}
}
file close pajeknetfile
*ensure that it's windows (CRLF) text format
if "$S_OS"~="Windows" {
filefilter `pajeknetfile'.net tmp, from(\M) to(\W) replace
shell mv tmp `pajeknetfile'.net
filefilter `pajeknetfile'.net tmp, from(\U) to(\W) replace
shell mv tmp `pajeknetfile'.net
}
use `dataset', clear
disp "Your output is saved as"
disp "`c(pwd)'`c(dirsep)'`pajeknetfile'.net"
end
I also wrote a help file:
{smcl}
{* 16sep2009}{...}
{hline}
help for {hi:stata2pajek}
{hline}
{title:Export data to Pajek .net format}
{p 8 17 2}
{cmd:stata2pajek} {it:ego alter}[, {cmdab:edges tiestrength() filename()}]
{title:Description}
{p 4 4 2}
{cmd:stata2pajek} exports data to the ".net" format read by Pajek, Network
Workbench, and many other social network analysis packages.
{title:Remarks}
{p 4 4 2}
The program assumes that you already have used Stata to create an edge list
or arc list. You specify which (string or numeric) variable identifies ego and
which alter. Specifying a tie strength variable is optional. {cmd:stata2pajek}
converts this to .net format, which is a Windows-formatted text file beginning
with a list of vertices (aka nodes) and their labels, followed by a list of ties
(arcs or edges).
{p 4 4 2}
As an alternative to this program, you may wish to use {cmd:outsheet} then
process the saved output with the Windows program txt2pajek.
{p 4 4 2}
Note that {cmd:stata2pajek} treats the Stata versions of your id variables as
labels even if they are numeric. Thus if you have a node called #15 in Stata
it will not necessarily also be called #15 in Pajek. Please see the vertice
section of the .net file (which is human-readable text) to see the
correspondence.
{title:Options}
{p 4 8 2}
{cmd:edges} specifies that ties should be treated as edges (symmetrical ties).
The default is to treat ties as arcs (directed ties).
{p 4 8 2}
{cmd:filename()} allows you to name the output file. By default it is named
mypajekfile.net
{title:Examples}
{p 4 8 2}{cmd:. *create a random network with 10 nodes and export it as pajek}{p_end}
{p 4 8 2}{cmd:. clear}{p_end}
{p 4 8 2}{cmd:. set obs 200}{p_end}
{p 4 8 2}{cmd:. gen i=int(uniform()*10)+1}{p_end}
{p 4 8 2}{cmd:. gen j=int(uniform()*10)+1}{p_end}
{p 4 8 2}{cmd:. contract i j, freq(strength)}{p_end}
{p 4 8 2}{cmd:. drop if i==j}{p_end}
{p 4 8 2}{cmd:. sort i j}{p_end}
{p 4 8 2}{cmd:. stata2pajek i j, tiestrength(strength) filename(samplerandomnetwork)}{p_end}
{title:Author}
{p 4 4 2}Gabriel Rossman, UCLA{break}
rossman@soc.ucla.edu
{title:Also see}
{p 4 13 2}On-line:
help for {help outsheet}
A few notes:
- My code assumes you want arcs but it would be pretty easy to modify so it gives edges, or even lets you mix and match. [update -- edges are now an option]
- Escaped quotes choke the syntax highlighting of TextMate and TextWrangler. I wouldn’t be surprised if there were similar problems with UltraEdit and/or TextPad. However Smultron works great, which I plan to use in the future for any file that uses escaped quotes (though generally I like the TextMate for the code-folding).
- Stata ignores a literal “*” in the “file” command. My workaround is to define a local containing the asterix, then use that local.
- I can’t figure out how to dump the entire dataset into “file.” My extremely ugly workaround is to write it to file one line at a time using locals to ferry the values between the dataset in memory and the text file on disk. This works OK for me but might be slow for large datasets.
Now I just need to figure out how to invoke Pajek from the command line and I’ll be able to completely script everything rather than having do-file lines that say things like “Note, after this runs remember to play around with the Pajek GUI for ten minutes.” I know this is technically possible because many Pajek commands are scriptable in Windows, but I’m using a Mac and would have to figure out how to pipe it from Stata to the shell to Crossover/Wine to Pajek, which I’ve never done. Another alternative is to use Network Workbench.
Also see this post on getting Pajek output back into Stata.
4 comments September 1, 2009
If it ain’t broke
| Gabriel |
I was running some code today and noticed the return message
(note: you are using old merge syntax; see [R] merge for new syntax)
I looked this up and saw that the syntax had changed from this in Stata 10 and earlier:
use macro, clear merge macro_id using micro
to this in Stata 11:
use macro, clear merge 1:m macro_id using micro
My first thought was, aha, they must have added the “1:1″ and “1:m” options as a way to make a more general syntax that will combine the functionality of merge, joinby, and cross into one command. However on RTFM I see that:
if you think you need to perform an m:m merge, then we suspect you are wrong. If you would like to match every observation in the master to every observation in the using with the same values of the key variable(s), then you should be using joinby
So it turns out that the what “m:m” does is not to replicate joinby, but to allow a sort order merge within matches, a prospect that the Stata people warn is probably a bad idea (and I agree).
So the new version of merge doesn’t add any real functionality and it seems to be changed only as a reliability check, so that you don’t accidentally do a one-to-many when you think you’re doing many-to-one. Since that worry has never kept me up at night in the first place, I’m sticking with the old syntax (which has not been deprecated and still works, even without the “version” command).
(That aside I’ve been very happy with Stata 11 so far. It’s running really fast and most of the changes are clear improvements).
Add comment August 20, 2009
SPPA 2008
| Gabriel |
The 2008 wave for the Survey of Public Participation in the Arts is now available at CPANDA. In the WSJ, Terry Teachout noticed one basic thing in the data, which is that nobody born since the Ford administration likes jazz. I’ve been waiting for this dataset for awhile because several years ago Pete Peterson and I noticed some weird differences between 92 to 02 (particularly as relates to the omnivore hypothesis) and we need a third data point to help us figure it out. Another cool thing about the dataset is that they now ask questions about literature by genre, which as seen in the literature based on SPPA music questions, is a good way to get at cultural capital type issues.
Anyway, one of the minor annoyances about SPPA is that it uses a convention of “1=Yes 2=No” whereas any native Stata speaker knows that this is an abomination and contrary to the divine rule that in all binary variables, 0 shall equal “no” and 1 shall equal “yes.” (For one thing, this makes it easier to sum the dummies into a count). As such I’ve written this code to fix these perverse variables. Just add it to the end of the do-file that CPANDA generates for you when you download the file.
*change all the yes/no vars to Stata convention where 0 is no and 1 is yes
*all variables that are similar to yes/no but slightly different (eg, PEDWWNTO) are left alone
*to avoid confusion by plugging into scripts that assume SPSS yes/no, rename these variables with suffix "r"
global yesnovars "PEX4A PEX4B PEX5 PEQ1A PEQ2A PEQ3A PEQ4A PEQ5A PEQ6A PEQ7A PEQ8AA PEQ9A PEQ10A PEQ10B PEQ11A PEQ12A PEQ13AA1 PEQ13AA2 PEQ13AA3 PEA1A PEA1B PEA2 PEA31 PEA32 PEA33 PEA34 PEA35 PEA36 PEA37 PEA38 PEA39 PEA310 PEA311 PEA41 PEA42 PEA43 PEA44 PEA45 PEA46 PEA47 PEA48 PEA49 PEA410 PEA411 PEA412 PEA413 PEA414 PEB1A PEB2A PEB3A PEB4A PEB5A PEB6 PEB7 PEB8 PEB9 PEB10 PEB11 PEB12 PEB13 PEB14 PEC2A PEC3A PEC4A PEC5A PEC6A PEC7A PEC8A PEC9A PEC10A PEC11A PEC12A PEC13A PEC14A PEC15A PEC15B PEC16A PEC16B PEC16C PEC17A PEC18A PEC19A PEC20A PEC21A PEC25A PEC26A PEC27A PED1A PED1C PED1D PED2A PED2C PED2D PED3A PED3C PED3D PED4A PED4C PED4D PED5A PED5C PED5D PED6A PED6C PED6D PED7A PED7C PED7D HETELAVL HETELHHD HUBUS PEABSPDO PEAFEVER PEAFNOW PEDW4WK PEDWAVL PEDWLKO PEDWLKWK PEDWWK PEERNCOV PEERNLAB PEERNRT PEERNUOT PEHRAVL PEJHWKO PELAYAVL PELAYFTO PELAYLK PELKAVL PEMJOT PENLFRET PESCHENR PUBUS1 PUBUS2OT PUDIS1 PUDIS2 PUHROFF1 PUHROT1 PUIODP1 PUIODP2 PUIODP3 PUJHDP1O PULAY6M PULAYDT "
sum $yesnovars
*check that range is (1,2)
lab def yesno 0 "N" 1 "Y"
foreach var in $yesnovars {
recode `var' 2=0 1=1 .=.
lab val `var' yesno
ren `var' `var'r
}
1 comment August 18, 2009
Merging Pajek vertices into Stata
| Gabriel |
Sometimes I use Pajek (or something that behaves similarly like Mathematica or Network Workbench) to generate a variable which I then want to merge back onto Stata. However the problem is that the output requires a little cleaning because it’s not as if the first column is your “id” variable as it exists in Stata and the second column the metric and you can just merge on “id.” Instead they tend to encode your Stata id variable, which means you have to merge twice, first to associate the Stata id variable with the Pajek id variable, second to associate the new data with your main dataset.
So the first step is to create a merge file to associate the encoded key with the Stata id variable. You get this from the Pajek “.net” file (ie, the data file). The first part of this file is the encoding of the nodes, the rest (which you don’t care about for these purposes) is the connections between these nodes. In other words you want to go from this:
*Vertices 3 1 "tom" 2 "dick" 3 "harry" *Edges 1 2 2 3
to this:
pajek_id stata_id 1 Tom 2 Dick 3 Harry
The thing that makes this a pain is that “.net” files are usually really big so if you try to just select the “vertices” part of the file you may be holding down the mouse button for a really long time. My solution is to open the file in a text editor (I prefer TextWrangler for this) and put the cursor at the end of what I want. I then enter the regular expression search pattern “^.+$\r” (or “^.+$\n”) to be replaced with nothing, which has the effect of erasing everything after the cursor. Note that the search should start at the cursor and not wrap so don’t check “start at top” or “wrap around.” You’ll then be left with just the labels, the edge list having been deleted. Another way to do it is to search the whole file and tell it to delete lines that do not include quotes marks.
Having eliminated the edge list and kept only the encoding key, at this point you still need to get the vertice labels into a nice tab-delimited format, which is easily accomplished with this pattern.
“(.+)”$
\t\1
Note the leading space in the search regular expression. Also note that if the labels have embedded spaces there should be quotes around \1 in the replacement regular expression.
Manually name the first column “pajek_id” and the second column “stata_id” (or better yet, whatever you call your id variable in Stata) and save the file as something like “pajekmerge.txt”. Now go to Stata and use “insheet,” “sort,” and “merge” to add the “pajek_id” variable into Stata. You’re now ready to import the foreign data. Use “insheet” to get it into Stata. Some of these programs include an id variable, if so name it “pajek_id.” Others (eg Mathematica) don’t and just rely on ordering. If so, enter the command “gen mathematica_id=[_n]“. You’re now ready to merge the foreign data into Stata.
This is obviously a tricky process and there are a lot of stupid ways it could go wrong. Therefore it is absolutely imperative that you spot-check the results. There are usually some cases where you intuitively know about what the new metric should be. Likewise, you may have another variable native to your Stata dataset that should have a reasonably high (positive or negative) correlation with the new metric imported from Pajek. Check this correlation as when things should be correlated but ain’t it often means a merge error.
Note that it’s not necessarily a problem if some cases in your Stata dataset don’t have corresponding entries in your Pajek output. This is because isolates are often dropped from your Pajek data. However you should know who these isolates are and be able to spot-check that the right people are missing. If you’re doing an inter-locking board studies and you see that an investment bank in your Stata data doesn’t appear in your Pajek data then you probably have a merge error.
2 comments July 22, 2009