Merging Pajek vertices into Stata
July 22, 2009 at 5:02 am GR 2 comments
| Gabriel |
Sometimes I use Pajek (or something that behaves similarly like Mathematica or Network Workbench) to generate a variable which I then want to merge back onto Stata. However the problem is that the output requires a little cleaning because it’s not as if the first column is your “id” variable as it exists in Stata and the second column the metric and you can just merge on “id.” Instead they tend to encode your Stata id variable, which means you have to merge twice, first to associate the Stata id variable with the Pajek id variable, second to associate the new data with your main dataset.
So the first step is to create a merge file to associate the encoded key with the Stata id variable. You get this from the Pajek “.net” file (ie, the data file). The first part of this file is the encoding of the nodes, the rest (which you don’t care about for these purposes) is the connections between these nodes. In other words you want to go from this:
*Vertices 3 1 "tom" 2 "dick" 3 "harry" *Edges 1 2 2 3
to this:
pajek_id stata_id 1 Tom 2 Dick 3 Harry
The thing that makes this a pain is that “.net” files are usually really big so if you try to just select the “vertices” part of the file you may be holding down the mouse button for a really long time. My solution is to open the file in a text editor (I prefer TextWrangler for this) and put the cursor at the end of what I want. I then enter the regular expression search pattern “^.+$\r” (or “^.+$\n”) to be replaced with nothing, which has the effect of erasing everything after the cursor. Note that the search should start at the cursor and not wrap so don’t check “start at top” or “wrap around.” You’ll then be left with just the labels, the edge list having been deleted. Another way to do it is to search the whole file and tell it to delete lines that do not include quotes marks.
Having eliminated the edge list and kept only the encoding key, at this point you still need to get the vertice labels into a nice tab-delimited format, which is easily accomplished with this pattern.
“(.+)”$
\t\1
Note the leading space in the search regular expression. Also note that if the labels have embedded spaces there should be quotes around \1 in the replacement regular expression.
Manually name the first column “pajek_id” and the second column “stata_id” (or better yet, whatever you call your id variable in Stata) and save the file as something like “pajekmerge.txt”. Now go to Stata and use “insheet,” “sort,” and “merge” to add the “pajek_id” variable into Stata. You’re now ready to import the foreign data. Use “insheet” to get it into Stata. Some of these programs include an id variable, if so name it “pajek_id.” Others (eg Mathematica) don’t and just rely on ordering. If so, enter the command “gen mathematica_id=[_n]”. You’re now ready to merge the foreign data into Stata.
This is obviously a tricky process and there are a lot of stupid ways it could go wrong. Therefore it is absolutely imperative that you spot-check the results. There are usually some cases where you intuitively know about what the new metric should be. Likewise, you may have another variable native to your Stata dataset that should have a reasonably high (positive or negative) correlation with the new metric imported from Pajek. Check this correlation as when things should be correlated but ain’t it often means a merge error.
Note that it’s not necessarily a problem if some cases in your Stata dataset don’t have corresponding entries in your Pajek output. This is because isolates are often dropped from your Pajek data. However you should know who these isolates are and be able to spot-check that the right people are missing. If you’re doing an inter-locking board studies and you see that an investment bank in your Stata data doesn’t appear in your Pajek data then you probably have a merge error.
Entry filed under: Uncategorized. Tags: cleaning, networks, regular expressions, Stata, text editor.
1. Stata2Pajek « Code and Culture | September 1, 2009 at 4:59 am
[…] Also see this post on getting Pajek output back into Stata. […]
2. Pajek_labelvector.pl « Code and Culture | September 29, 2009 at 5:37 am
[…] few months ago I wrote some notes on using a text editor to get output out of Pajek or Network Workbench and into a rows and columns […]