Posts Tagged cleaning

Stata2Pajek w vertice colors

| Gabriel |

My Stata program stata2pajek let’s you export an edge list or arc list from Stata to Pajek. However I recently wanted to make color-coded vertices and this required tweaking the syntax a bit. Because this is a relational database problem (and Stata likes flat-file), I do it by merging in a vertice level file on disk. Although I’ve written it to merge on colors it should be fairly easy to rewrite to deal with other vertice-level variables. Unfortunately the syntax is pretty awkward, which is why I’m forking it and posting it to the blog instead of updating the main stata2pajek file at SSC.

Note that the script assumes that nodes which appear in the arc list but not the vertice attributes file should be color-coded yellow. If you want yellow to have a substantive meaning you should change line 61 to be something else.

The syntax is the same as stata2pajek except there are a few more options. If you omit these options it behaves just like stata2pajek classic:

  • attributefile() — the file where the vertice level variables are stored
  • attributekey() — the merge key, defaults to “ego”
  • color() — the variable storing the color-codes

For instance, to color-code a graph of radio stations based on when they first played “My Humps” (yes, really), I use this command:

stata2pajekalt ego alter, filename(ties_bounded_humpcolor) attributefile(humps_color) attributekey(ego) color(color)

It says to treat the data in memory as an arc list, to merge on station-level data on adoption time where the station name is stored as the variable “ego” in the file “humps_color”, and to write out the whole mess as a Pajek formatted text file called “ties_bounded_humpscolor.net”.

*1.2b GHR Dec 10, 2009
*forked from stata2pajek 1.2
capture program drop stata2pajekalt
program define stata2pajekalt
	version 10
	set more off
	syntax varlist(min=2 max=2) [ , tiestrength(string asis) filename(string asis) EDGEs attributefile(string asis) attributekey(string asis) color(string asis)]

	tempfile dataset
	quietly save `dataset'

	gettoken ego alter : varlist

	local starchar="*"
	if "`filename'"=="" {
		local pajeknetfile "mypajekfile"
	}
	else {
		local pajeknetfile "`filename'"
	}
	if "`attributekey'"=="" {
		local attributekey "`ego'"
	}
	capture file close pajeknetfile
	file open pajeknetfile using `pajeknetfile'.net, write text replace
	use `dataset', clear
	drop `ego'
	ren `alter' `ego'
	append using `dataset'
	keep `ego'
	contract `ego'
	ren `ego' verticelabel
	drop _freq
	sort verticelabel
	gen number=[_n]
	order number verticelabel
	if "`attributefile'"!="" {
		ren verticelabel `attributekey'
		merge `attributekey' using `attributefile'
		drop if _merge==2
		ren `attributekey' verticelabel
		drop _merge
		keep number verticelabel `color'
		sort verticelabel
	}
	tempfile verticelabels
	quietly save `verticelabels', replace
	local nvertices=[_N]
	file write pajeknetfile "`starchar'Vertices `nvertices'" _n
	if "`attributefile'"=="" {
		forvalues x=1/`nvertices' {
			local c2=verticelabel in `x'
			file write pajeknetfile `"`x' "`c2'""' _n
		}
	}
	else {
		forvalues x=1/`nvertices' {
			local c2=verticelabel in `x'
			local colvalue=`color' in `x'
			if "`colvalue'"=="" {
				local colvalue "Yellow"
			}
			file write pajeknetfile `"`x' "`c2'" ic `colvalue'"' _n
		}

	}
	use `dataset', clear
	ren `ego' verticelabel
	sort verticelabel
	quietly merge verticelabel using `verticelabels'
	quietly keep if _merge==3
	drop _merge verticelabel
	ren number `ego'
	ren `alter' verticelabel
	sort verticelabel
	quietly merge verticelabel using `verticelabels'
	quietly keep if _merge==3
	drop _merge verticelabel
	ren number `alter'
	order `ego' `alter' `tiestrength'
	keep  `ego' `alter' `tiestrength'
	local narcs=[_N]
	if "`edges'"=="edges" {
		file write pajeknetfile `"`starchar'Edges"' _n
	}
	else {
		file write pajeknetfile `"`starchar'Arcs"' _n
	}
	sort `ego' `alter'
	if "`tiestrength'"~="" {
		forvalues x=1/`narcs' {
		        local c1=`ego' in `x'
		        local c2=`alter' in `x'
		        local c3=`tiestrength' in `x'
		        file write pajeknetfile "`c1' `c2' `c3'" _n
		}
	}
	else {
		forvalues x=1/`narcs' {
		        local c1=`ego' in `x'
		        local c2=`alter' in `x'
		        file write pajeknetfile "`c1' `c2'" _n
		}
	}
	file close pajeknetfile
	*ensure that it's windows (CRLF) text format
	if "$S_OS"~="Windows" {
		filefilter `pajeknetfile'.net tmp, from(\M) to(\W) replace
		shell mv tmp `pajeknetfile'.net
		filefilter `pajeknetfile'.net tmp, from(\U) to(\W) replace
		shell mv tmp `pajeknetfile'.net
	}
	use `dataset', clear
	disp "Your output is saved as"
	disp "`c(pwd)'`c(dirsep)'`pajeknetfile'.net"
end

2 comments December 10, 2009

Drop the entirely missing variables

| Gabriel |

One of the purely technical frustrations with GSS is that it’s hard to figure out if a particular question was in a particular wave. The SDA server is pretty good about kicking these from extracts, but it still leaves some in. The other day I was playing with the 2008 wave and got sick of saying “oooh, that looks interesting” only to find the variable was missing for my wave, which happened repeatedly. (For instance, I thought it would be fun to run Biblical literalism against the Israel feeling thermometer, but no dice, at least for the 2008 wave).

To get rid of these phantom variables, I wrote this little loop that drops variables with entirely missing data (or that are coded as strings, see below):

foreach varname of varlist * {
 quietly sum `varname'
 if `r(N)'==0 {
  drop `varname'
  disp "dropped `varname' for too much missing data"
 }
}

Unfortunately “sum” thinks that string variables have no observations so this will also drop strings. There’s a workaround, but it involves the “ds” command, which works in Stata 11 but has been deprecated and so may not work in future versions.

ds, not(type string)
foreach varname of varlist `r(varlist)' {
 quietly sum `varname'
 if `r(N)'==0 {
  drop `varname'
  disp "dropped `varname' for too much missing data"
 }
}

7 comments December 2, 2009

Perl text library

| Gabriel |

I found this very useful library of perl scripts for text cleaning. You can use them even if you can’t code perl yourself, for instance to transpose a dataset just download “transpose.pl” script to your ~/scripts directory and enter the shell command:
perl ~/scripts/transpose.pl row_col.txt > col_row.txt

The transpose script is particularly useful to me as I’ve never gotten Excel’s transpose function to work and for some bizarre reason Stata’s “xpose” command only works with numeric variables. You can even use these scripts from directly in a do-file like so:

tempfile foo1
tempfile foo2
outsheet using `foo1'.txt
shell perl ~/scripts/transpose.pl `foo1'.txt > `foo2'.txt
insheet using `foo2'.txt, clear

1 comment November 30, 2009

Pajek_labelvector.pl

| Gabriel |

A few months ago I wrote some notes on using a text editor to get output out of Pajek or Network Workbench and into a rows and columns dataset. Now that I’ve learned Perl from the course notes my UC Davis colleagues posted, I wrote up a perl script that will automate this and create a tab-delimited ascii file (or files if you give it multiple .vec files).

I’d like to put the code directly in the post but when I try, wordpress drops some of the characters (eg, backslash-zero-one-five renders as just “15″) so I put the properly-formatted script here .[Update, the new "sourcecode" tag properly escapes all this stuff so I've updated the post to include the script at the bottom of the post. The external link still works but is now unnecessary].

It takes the labels from a “.net” data file and merges them (by sort order) onto a “.vec” output file which let’s you merge it back onto your main (non-network) dataset. Read my older post for an explanation of why this is necessary. Note that if the sort order is different for the .vec and .net files it will get screwy so be sure to spot check the values. The syntax is simply:

perl pajek_labelvector.pl myfile.net netmetric_1.vec netmetric_k.vec

Between this perl script and stata2pajek.ado it should be fairly easy to integrate network data into Stata.

#!/usr/bin/perl
# pajek_labelvector.pl
# Gabriel Rossman, UCLA, 2009-09-22
# this file extracts the vertice labels from a .net file and merges it (by sort order) with one or more .vec files
# take filenames as arguments
# file 1 is .net, files 2-k are .vec
# writes out foo.txt as tab delimited text
# note, this is dependent on an unchanged sort order

use strict; use warnings;
die "usage: pajek_labelvector.pl   ... \n" unless @ARGV > 1;

my $netfile = shift (@ARGV);
my @labels=();
#read the vertice labels from .net file
open(NETIN, "<$netfile") or die "error reading $netfile for reading";
while (<NETIN>) {
	if ($_ =~ m/"/) { #only use the vertice label lines, which include quote chars
		$_ =~ /^[0-9]+ "(.*)"/; #search for quoted text
		push @labels, $1; #return match, push to array
	}
}
close NETIN;
#read netfile
foreach my $vecfile (@ARGV) {
	open(VECIN, "<$vecfile") or die "error reading $vecfile"; 	open(VECOUT, ">$vecfile.txt") or die "error creating $vecfile.txt";
	my @vec=();
	while (<VECIN>) {
		$_ =~ s/\015?\012//; #manual chomp to allow windows or unix text
		if ($_ !~ m/^\*/) {
			push @vec, $_;
		}
	}
	close VECIN;
	my $veclength = @vec - 1;
	my $lablength = @labels -1;
	die "error, $vecfile is different length than $netfile" unless $veclength==$lablength;
	for my $i (0..$veclength) {
		print VECOUT "$labels[$i]\t$vec[$i]\n";
	}
	close VECOUT;
	@vec=();
}

print "WARNING: this script assumes that the .vec and .net have the same sort order\nplease spot check the values to avoid error\n";

Add comment September 29, 2009

Stata2Pajek (update)

I fixed two bugs in Stata2Pajek and improved the documentation. To get it, type (from within Stata)

ssc install stata2pajek

If you already have the old version, the above command will update it but even better is to update all of your ado files with:

adoupdate

Add comment September 18, 2009

Bash/Perl tutorial

| Gabriel |

I’m a big fan of the idea of using Unix tools like Perl to script the cleaning of the massive text-based datasets that social scientists (especially sociologists of culture) often use. Unfortunately there’s something of a learning curve to this so even though I like the idea in principle and increasingly in practice, I still sometimes clean data interactively with TextWrangler and just try to keep good notes.

Fortunately two of my UC system colleagues have posted the course materials for a “Unix and Perl Primer for Biologists.” I’m about halfway through the materials and it’s great, in part because (unlike the llama) they assume no prior familiarity with programming or Unix. Although the examples involve genetics, it’s well-suited for social scientists as, like us, biologists are not computer scientists but are reasonably technically competent and they often deal with large text based data sets. Basically, if you can write a Stata do-file, you should be able to follow their course guide and if you use things like scalars and loops it should be pretty easy.

I highly recommend the course to any social scientist who deals with large dirty datasets, in other words, basically anyone who is a quant but doesn’t just download clean ICPSR or Census data. This is especially relevant for anyone who wants to scrape data off the web, use IMDB, do large-scale content analysis, etc.

Some notes:

  • They assume you will a) be running the materials off a stick and b) using Mac OS X. If you’re keeping the material on the hard drive, get used to typing “chmod u+x foo.pl” to make the perl script “foo” executable. (This step is unnecessary for files on a stick because unlike HFS+ or EXT3, the FAT filesystem doesn’t do permissions). If you’re using a different version of Unix, most of it should work similarly with only a few minor differences, such as that you’ll want to use Kate instead of Smultron and on a Mac a USB stick is in /Volumes/ whereas in Linux it’s in /Media/ and in BSD it’s in /mnt/. If you’re using Windows you’ll either need to a) install CygWin b) install a virtual machine c) run off a live cd or bootable stick or d) dual boot with Wubi.
  • If you’re really used to Stata, some of the nomenclature may seem backwards, mostly because Perl doesn’t keep a dataset in memory but processes it on disk, one command at a time. So, in Perl and Bash a “variable” is the equivalent to what Stata calls a (global or local) “macro”. The closest Perl equivalent to what Stata calls a “variable” would be a “field” in a tab-delimited text file.

[Update: Although they suggest Smultron, I find TextMate works even better as it can execute scripts entirely within the editor, so you don't have to constantly cmd-tab to Terminal.app and back.]

    Add comment September 15, 2009

    Stata2Pajek

    | Gabriel |

    [update #4 (9/16), I also fixed a problem with tie strength, improved the help file, and submitted it to SSC. the SSC version should be updated in a few days but in the meantime the updated version is in this post]

    [update #3, thanks to "T" in the comments, i fixed a bug in the code. the updated code is in this post but not yet uploaded to SSC]

    [update #2, the ado file is now hosted at SSC. type "ssc install stata2pajek" to install it]

    [update #1, I cleaned the code a bit]

    I wait impatiently wait for somebody more talented than I am to write a native Stata social network package (which should be feasible with Mata). In the meantime, to do anything much more ambitious than estimate density, us Stata folks have to export the data to other packages, basically all of which can read the “.net” format. Currently the best way to do this is to use the command “outsheet” and then clean it with the Windows program “Txt2Pajek,” but I decided to do natively in Stata. My way is faster, scriptable, and works better cross-platform. (Running Txt2Pajek on Unix-like computers requires Crossover/Wine, although it does run smoothly).

    The program takes the arguments dataset, ego, alter, and an optional “tie strength” argument.

    *1.2 GHR Sept 16, 2009
    capture program drop stata2pajek
    program define stata2pajek
    	version 10
    	set more off
    	syntax varlist(min=2 max=2) [ , tiestrength(string asis) filename(string asis) EDGEs]
    
    	tempfile dataset
    	quietly save `dataset'
    
    	gettoken ego alter : varlist
    
    	local starchar="*"
    	if "`filename'"=="" {
    		local pajeknetfile "mypajekfile"
    	}
    	else {
    		local pajeknetfile "`filename'"
    	}
    	capture file close pajeknetfile
    	file open pajeknetfile using `pajeknetfile'.net, write text replace
    	use `dataset', clear
    	drop `ego'
    	ren `alter' `ego'
    	append using `dataset'
    	keep `ego'
    	contract `ego'
    	ren `ego' verticelabel
    	drop _freq
    	sort verticelabel
    	gen number=[_n]
    	order number verticelabel
    	tempfile verticelabels
    	quietly save `verticelabels', replace
    	local nvertices=[_N]
    	file write pajeknetfile "`starchar'Vertices `nvertices'" _n
    	forvalues x=1/`nvertices' {
    		local c2=verticelabel in `x'
    		file write pajeknetfile `"`x' "`c2'""' _n
    	}
    	use `dataset', clear
    	ren `ego' verticelabel
    	sort verticelabel
    	quietly merge verticelabel using `verticelabels'
    	quietly keep if _merge==3
    	drop _merge verticelabel
    	ren number `ego'
    	ren `alter' verticelabel
    	sort verticelabel
    	quietly merge verticelabel using `verticelabels'
    	quietly keep if _merge==3
    	drop _merge verticelabel
    	ren number `alter'
    	order `ego' `alter' `tiestrength'
    	keep  `ego' `alter' `tiestrength'
    	local narcs=[_N]
    	if "`edges'"=="edges" {
    		file write pajeknetfile `"`starchar'Edges"' _n
    	}
    	else {
    		file write pajeknetfile `"`starchar'Arcs"' _n
    	}
    	sort `ego' `alter'
    	if "`tiestrength'"~="" {
    		forvalues x=1/`narcs' {
    			local c1=`ego' in `x'
    			local c2=`alter' in `x'
    			local c3=`tiestrength' in `x'
    			file write pajeknetfile "`c1' `c2' `c3'" _n
    		}
    	}
    	else {
    		forvalues x=1/`narcs' {
    			local c1=`ego' in `x'
    			local c2=`alter' in `x'
    			file write pajeknetfile "`c1' `c2'" _n
    		}
    	}
    	file close pajeknetfile
    	*ensure that it's windows (CRLF) text format
    	if "$S_OS"~="Windows" {
    		filefilter `pajeknetfile'.net tmp, from(\M) to(\W) replace
    		shell mv tmp `pajeknetfile'.net
    		filefilter `pajeknetfile'.net tmp, from(\U) to(\W) replace
    		shell mv tmp `pajeknetfile'.net
    	}
    	use `dataset', clear
    	disp "Your output is saved as"
    	disp "`c(pwd)'`c(dirsep)'`pajeknetfile'.net"
    end

    I also wrote a help file:

    {smcl}
    {* 16sep2009}{...}
    {hline}
    help for {hi:stata2pajek}
    {hline}
    
    {title:Export data to Pajek .net format} 
    
    {p 8 17 2}
    {cmd:stata2pajek} {it:ego alter}[, {cmdab:edges tiestrength() filename()}]
    
    {title:Description} 
    
    {p 4 4 2}
    {cmd:stata2pajek} exports data to the ".net" format read by Pajek, Network
    Workbench, and many other social network analysis packages.
    
    {title:Remarks} 
    
    {p 4 4 2}
    The program assumes that you already have used Stata to create an edge list
    or arc list. You specify which (string or numeric) variable identifies ego and
    which alter. Specifying a tie strength variable is optional. {cmd:stata2pajek}
    converts this to .net format, which is a Windows-formatted text file beginning
    with a list of vertices (aka nodes) and their labels, followed by a list of ties
    (arcs or edges).
    
    {p 4 4 2}
    As an alternative to this program, you may wish to use {cmd:outsheet} then
    process the saved output with the Windows program txt2pajek.
    
    {p 4 4 2}
    Note that {cmd:stata2pajek} treats the Stata versions of your id variables as
    labels even if they are numeric. Thus if you have a node called #15 in Stata
    it will not necessarily also be called #15 in Pajek. Please see the vertice
    section of the .net file (which is human-readable text) to see the
    correspondence.
    
    {title:Options} 
    
    {p 4 8 2}
    {cmd:edges} specifies that ties should be treated as edges (symmetrical ties).
    The default is to treat ties as arcs (directed ties).
    
    {p 4 8 2}
    {cmd:filename()} allows you to name the output file. By default it is named
    mypajekfile.net
    
    {title:Examples}
    
    {p 4 8 2}{cmd:. *create a random network with 10 nodes and export it as pajek}{p_end}
    {p 4 8 2}{cmd:. clear}{p_end}
    {p 4 8 2}{cmd:. set obs 200}{p_end}
    {p 4 8 2}{cmd:. gen i=int(uniform()*10)+1}{p_end}
    {p 4 8 2}{cmd:. gen j=int(uniform()*10)+1}{p_end}
    {p 4 8 2}{cmd:. contract i j, freq(strength)}{p_end}
    {p 4 8 2}{cmd:. drop if i==j}{p_end}
    {p 4 8 2}{cmd:. sort i j}{p_end}
    {p 4 8 2}{cmd:. stata2pajek i j, tiestrength(strength) filename(samplerandomnetwork)}{p_end}
    
    {title:Author}
    
    {p 4 4 2}Gabriel Rossman, UCLA{break}
    rossman@soc.ucla.edu
    
    {title:Also see}
    
    {p 4 13 2}On-line:
    help for {help outsheet}

    A few notes:

    1. My code assumes you want arcs but it would be pretty easy to modify so it gives edges, or even lets you mix and match. [update -- edges are now an option]
    2. Escaped quotes choke the syntax highlighting of TextMate and TextWrangler. I wouldn’t be surprised if there were similar problems with UltraEdit and/or TextPad. However Smultron works great, which I plan to use in the future for any file that uses escaped quotes (though generally I like the TextMate for the code-folding).
    3. Stata ignores a literal “*” in the “file” command. My workaround is to define a local containing the asterix, then use that local.
    4. I can’t figure out how to dump the entire dataset into “file.” My extremely ugly workaround is to write it to file one line at a time using locals to ferry the values between the dataset in memory and the text file on disk. This works OK for me but might be slow for large datasets.

    Now I just need to figure out how to invoke Pajek from the command line and I’ll be able to completely script everything rather than having do-file lines that say things like “Note, after this runs remember to play around with the Pajek GUI for ten minutes.” I know this is technically possible because many Pajek commands are scriptable in Windows, but I’m using a Mac and would have to figure out how to pipe it from Stata to the shell to Crossover/Wine to Pajek, which I’ve never done. Another alternative is to use Network Workbench.

    Also see this post on getting Pajek output back into Stata.

    4 comments September 1, 2009

    If it ain’t broke

    | Gabriel |

    I was running some code today and noticed the return message

    (note: you are using old merge syntax; see [R] merge for new syntax)

    I looked this up and saw that the syntax had changed from this in Stata 10 and earlier:

    use macro, clear
    merge macro_id using micro

    to this in Stata 11:

    use macro, clear
    merge 1:m macro_id using micro

    My first thought was, aha, they must have added the “1:1″ and “1:m” options as a way to make a more general syntax that will combine the functionality of merge, joinby, and cross into one command. However on RTFM I see that:

    if you think you need to perform an m:m merge, then we suspect you are wrong. If you would like to match every observation in the master to every observation in the using with the same values of the key variable(s), then you should be using joinby

    So it turns out that the what “m:m” does is not to replicate joinby, but to allow a sort order merge within matches, a prospect that the Stata people warn is probably a bad idea (and I agree).

    So the new version of merge doesn’t add any real functionality and it seems to be changed only as a reliability check, so that you don’t accidentally do a one-to-many when you think you’re doing many-to-one. Since that worry has never kept me up at night in the first place, I’m sticking with the old syntax (which has not been deprecated and still works, even without the “version” command).

    (That aside I’ve been very happy with Stata 11 so far. It’s running really fast and most of the changes are clear improvements).

    Add comment August 20, 2009

    SPPA 2008

    | Gabriel |

    The 2008 wave for the Survey of Public Participation in the Arts is now available at CPANDA. In the WSJ, Terry Teachout noticed one basic thing in the data, which is that nobody born since the Ford administration likes jazz. I’ve been waiting for this dataset for awhile because several years ago Pete Peterson and I noticed some weird differences between 92 to 02 (particularly as relates to the omnivore hypothesis) and we need a third data point to help us figure it out. Another cool thing about the dataset is that they now ask questions about literature by genre, which as seen in the literature based on SPPA music questions, is a good way to get at cultural capital type issues.

    Anyway, one of the minor annoyances about SPPA is that it uses a convention of “1=Yes 2=No” whereas any native Stata speaker knows that this is an abomination and contrary to the divine rule that in all binary variables, 0 shall equal “no” and 1 shall equal “yes.” (For one thing, this makes it easier to sum the dummies into a count). As such I’ve written this code to fix these perverse variables. Just add it to the end of the do-file that CPANDA generates for you when you download the file.

    *change all the yes/no vars to Stata convention where 0 is no and 1 is yes
    *all variables that are similar to yes/no but slightly different (eg, PEDWWNTO) are left alone
    *to avoid confusion by plugging into scripts that assume SPSS yes/no, rename these variables with suffix "r"
    global yesnovars "PEX4A PEX4B PEX5 PEQ1A PEQ2A PEQ3A PEQ4A PEQ5A PEQ6A PEQ7A PEQ8AA PEQ9A PEQ10A PEQ10B PEQ11A PEQ12A PEQ13AA1 PEQ13AA2 PEQ13AA3 PEA1A PEA1B PEA2 PEA31 PEA32 PEA33 PEA34 PEA35 PEA36 PEA37 PEA38 PEA39 PEA310 PEA311 PEA41 PEA42 PEA43 PEA44 PEA45 PEA46 PEA47 PEA48 PEA49 PEA410 PEA411 PEA412 PEA413 PEA414 PEB1A PEB2A PEB3A PEB4A PEB5A PEB6 PEB7 PEB8 PEB9 PEB10 PEB11 PEB12 PEB13 PEB14 PEC2A PEC3A PEC4A PEC5A PEC6A PEC7A PEC8A PEC9A PEC10A PEC11A PEC12A PEC13A PEC14A PEC15A PEC15B PEC16A PEC16B PEC16C PEC17A PEC18A PEC19A PEC20A PEC21A PEC25A PEC26A PEC27A PED1A PED1C PED1D PED2A PED2C PED2D PED3A PED3C PED3D PED4A PED4C PED4D PED5A PED5C PED5D PED6A PED6C PED6D PED7A PED7C PED7D HETELAVL HETELHHD HUBUS PEABSPDO PEAFEVER PEAFNOW PEDW4WK PEDWAVL PEDWLKO PEDWLKWK PEDWWK PEERNCOV PEERNLAB PEERNRT PEERNUOT PEHRAVL PEJHWKO PELAYAVL PELAYFTO PELAYLK PELKAVL PEMJOT PENLFRET PESCHENR PUBUS1 PUBUS2OT PUDIS1 PUDIS2 PUHROFF1 PUHROT1 PUIODP1 PUIODP2 PUIODP3 PUJHDP1O PULAY6M PULAYDT "
    sum $yesnovars
    *check that range is (1,2)
    
    lab def yesno 0 "N" 1 "Y"
    
    foreach var in $yesnovars {
    	recode `var' 2=0 1=1 .=.
    	lab val `var' yesno
    	ren `var' `var'r
    }

    1 comment August 18, 2009

    Merging Pajek vertices into Stata

    | Gabriel |

    Sometimes I use Pajek (or something that behaves similarly like Mathematica or Network Workbench) to generate a variable which I then want to merge back onto Stata. However the problem is that the output requires a little cleaning because it’s not as if the first column is your “id” variable as it exists in Stata and the second column the metric and you can just merge on “id.” Instead they tend to encode your Stata id variable, which means you have to merge twice, first to associate the Stata id variable with the Pajek id variable, second to associate the new data with your main dataset.

    So the first step is to create a merge file to associate the encoded key with the Stata id variable. You get this from the Pajek “.net” file (ie, the data file). The first part of this file is the encoding of the nodes, the rest (which you don’t care about for these purposes) is the connections between these nodes. In other words you want to go from this:

    *Vertices 3
    1 "tom"
    2 "dick"
    3 "harry"
    *Edges
    1 2
    2 3

    to this:

    pajek_id	stata_id
    1	Tom
    2	Dick
    3	Harry

    The thing that makes this a pain is that “.net” files are usually really big so if you try to just select the “vertices” part of the file you may be holding down the mouse button for a really long time. My solution is to open the file in a text editor (I prefer TextWrangler for this) and put the cursor at the end of what I want. I then enter the regular expression search pattern “^.+$\r” (or “^.+$\n”) to be replaced with nothing, which has the effect of erasing everything after the cursor. Note that the search should start at the cursor and not wrap so don’t check “start at top” or “wrap around.” You’ll then be left with just the labels, the edge list having been deleted. Another way to do it is to search the whole file and tell it to delete lines that do not include quotes marks.

    Having eliminated the edge list and kept only the encoding key, at this point you still need to get the vertice labels into a nice tab-delimited format, which is easily accomplished with this pattern.

    “(.+)”$
    \t\1

    Note the leading space in the search regular expression. Also note that if the labels have embedded spaces there should be quotes around \1 in the replacement regular expression.

    Manually name the first column “pajek_id” and the second column “stata_id” (or better yet, whatever you call your id variable in Stata) and save the file as something like “pajekmerge.txt”. Now go to Stata and use “insheet,” “sort,” and “merge” to add the “pajek_id” variable into Stata. You’re now ready to import the foreign data. Use “insheet” to get it into Stata. Some of these programs include an id variable, if so name it “pajek_id.” Others (eg Mathematica) don’t and just rely on ordering. If so, enter the command “gen mathematica_id=[_n]“. You’re now ready to merge the foreign data into Stata.

    This is obviously a tricky process and there are a lot of stupid ways it could go wrong. Therefore it is absolutely imperative that you spot-check the results. There are usually some cases where you intuitively know about what the new metric should be. Likewise, you may have another variable native to your Stata dataset that should have a reasonably high (positive or negative) correlation with the new metric imported from Pajek. Check this correlation as when things should be correlated but ain’t it often means a merge error.

    Note that it’s not necessarily a problem if some cases in your Stata dataset don’t have corresponding entries in your Pajek output. This is because isolates are often dropped from your Pajek data. However you should know who these isolates are and be able to spot-check that the right people are missing. If you’re doing an inter-locking board studies and you see that an investment bank in your Stata data doesn’t appear in your Pajek data then you probably have a merge error.

    2 comments July 22, 2009

    Previous Posts


    The Culture Geeks

    Tags

    bayesian cleaning culture diffusion economics economic sociology ethnomethodology financial crisis graphs history IMDB loops lyx macros networks phenomenology philosophy of science R random variables regular expressions resampling shell sociology of organizations sociology of science st Stata superstar text editor typesetting

    Archives

    Recent Posts

    Recent Comments

    Blogroll