Posts tagged ‘text editor’

Merging Pajek vertices into Stata

| Gabriel |

Sometimes I use Pajek (or something that behaves similarly like Mathematica or Network Workbench) to generate a variable which I then want to merge back onto Stata. However the problem is that the output requires a little cleaning because it’s not as if the first column is your “id” variable as it exists in Stata and the second column the metric and you can just merge on “id.” Instead they tend to encode your Stata id variable, which means you have to merge twice, first to associate the Stata id variable with the Pajek id variable, second to associate the new data with your main dataset.

So the first step is to create a merge file to associate the encoded key with the Stata id variable. You get this from the Pajek “.net” file (ie, the data file). The first part of this file is the encoding of the nodes, the rest (which you don’t care about for these purposes) is the connections between these nodes. In other words you want to go from this:

*Vertices 3
1 "tom"
2 "dick"
3 "harry"
1 2
2 3

to this:

pajek_id	stata_id
1	Tom
2	Dick
3	Harry

The thing that makes this a pain is that “.net” files are usually really big so if you try to just select the “vertices” part of the file you may be holding down the mouse button for a really long time. My solution is to open the file in a text editor (I prefer TextWrangler for this) and put the cursor at the end of what I want. I then enter the regular expression search pattern “^.+$\r” (or “^.+$\n”) to be replaced with nothing, which has the effect of erasing everything after the cursor. Note that the search should start at the cursor and not wrap so don’t check “start at top” or “wrap around.” You’ll then be left with just the labels, the edge list having been deleted. Another way to do it is to search the whole file and tell it to delete lines that do not include quotes marks.

Having eliminated the edge list and kept only the encoding key, at this point you still need to get the vertice labels into a nice tab-delimited format, which is easily accomplished with this pattern.


Note the leading space in the search regular expression. Also note that if the labels have embedded spaces there should be quotes around \1 in the replacement regular expression.

Manually name the first column “pajek_id” and the second column “stata_id” (or better yet, whatever you call your id variable in Stata) and save the file as something like “pajekmerge.txt”. Now go to Stata and use “insheet,” “sort,” and “merge” to add the “pajek_id” variable into Stata. You’re now ready to import the foreign data. Use “insheet” to get it into Stata. Some of these programs include an id variable, if so name it “pajek_id.” Others (eg Mathematica) don’t and just rely on ordering. If so, enter the command “gen mathematica_id=[_n]”. You’re now ready to merge the foreign data into Stata.

This is obviously a tricky process and there are a lot of stupid ways it could go wrong. Therefore it is absolutely imperative that you spot-check the results. There are usually some cases where you intuitively know about what the new metric should be. Likewise, you may have another variable native to your Stata dataset that should have a reasonably high (positive or negative) correlation with the new metric imported from Pajek. Check this correlation as when things should be correlated but ain’t it often means a merge error.

Note that it’s not necessarily a problem if some cases in your Stata dataset don’t have corresponding entries in your Pajek output. This is because isolates are often dropped from your Pajek data. However you should know who these isolates are and be able to spot-check that the right people are missing. If you’re doing an inter-locking board studies and you see that an investment bank in your Stata data doesn’t appear in your Pajek data then you probably have a merge error.

July 22, 2009 at 5:02 am 2 comments

Collaborative code

| Gabriel |

A friend recently told me about a collaborative text editor. I’m perfectly happy having an entirely local text editor because I tend to do most of coding by myself (I’m a loner, a rebel). Although I co-author a lot, there tends to be sufficiently clear division of labor that I can just send data files and output to my co-authors and vice versa. For instance on one project I did all the cleaning and my co-author did the analysis. Nonetheless, not everyone works like this so I figured I’d pass it along.

The program my friend told me about is Etherpad. This is a totally cloud solution and is very quick to set up. Unfortunately it’s really bare bones, for instance it highlights by author (which is good) but doesn’t highlight syntax for anything but Java.

There are also local clients with remote sync. A popular solution for collaborative coding on the mac is SubEthaEdit. On the plus side there is Stata syntax. On the downside both authors need to have Macs and buy the software (30 euros).

A cross-platform, free, and open-source solution is Gobby. Although there is no Stata syntax file it uses a well documented highlighting standard so it should be feasible to write one. In principle Gobby works on the Mac but there’s no binary so good luck getting it to compile. If you’re a Mac person who can’t get Fink to work my suggestion is to use the Linux or Windows version through virtualization.

May 19, 2009 at 5:30 am

Choosing a (Mac) text editor with Stata

| Gabriel |

I love TextWrangler (free) but I was a little frustrated that it doesn’t allow code folding. (I should note that it does support everything else on my text editor wish list, most notably regular expressions, syntax highlighting, and pushing). I’d seen code folding in action with html editors like Bluefish and this struck me as a great feature. If you’re not familiar with it, code folding is when you hide some block of code, usually a subroutine or loop. TextWrangler’s big brother, BBEdit ($49 educational, $125 commercial) does offer code folding but it’s only really useful if the syntax files are written to make it work because the program has to be able to recognize what a loop looks like. Unfortunately BBEdit doesn’t come with Stata syntax files and the excellent TextWrangler Stata language file written by dataninja doesn’t support BBEdit only features like code folding. I looked to see if I could figure out how to write code folding syntax into dataninja’s language module but I couldn’t find any documentation about code folding in the developer kit and in any case I’m not that talented. Aaargh!

In desperation I’ve considered using another text editor, even though I really like TextWrangler. Apparently Kate (free) works pretty well with Stata but there’s no mac version available. (In theory I could recompile it using Fink but that never works for me). Likewise Notepad++ (free) has an excellent Stata syntax file and I highly recommend it to people who use Windows, but there’s no Mac version so I’d have to run it through Crossover/Wine and again that’s a hassle (the key bindings are different and you lose access to the native file browser and applescript integration). UltraEdit ($49) also has good Stata support and apparently it will be ported to Mac/Linux, but it’s not going to be out for a few months.

Editra (free) is a very well-featured and cross-platform editor, but there’s no Stata syntax file yet, nor can I figure out how to write one. One minor limitation I’ve noticed is that Editra can’t handle extremely long rows, but I only ever used extremely long rows for file list globals and there’s a better way to do that. A nice feature is that the language support is in the app package (as compared to “~/Library/Application Support”) which makes it easier to run off a key. Likewise Smultron keeps the syntax in the app package and is well-suited to run off a key. It has excellent Stata highlighting but no code folding. Smultron is the only editor I’ve seen that comes with the Stata syntax file included so it might be a good choice for beginners who don’t want to fiddle with the language preferences, libraries, and that sort of thing to install a user-written file.

Currently the best option is looking like TextMate ($45 educational, $53 commercial). Timothy Beatty at York University has put together a bundle that integrates it beautifully with Stata (note that the bundle assumes you have MP and requires some light editing for some of the features to work with other versions). Something I didn’t expect to like as much as I did is that every open file has its own window (rather than a tab drawer like TW) and this makes it much easier to compare two similar files, though it would get unwieldy with dozens of open files. On the other hand I still prefer certain features of TextWrangler. For instance, it’s much easy to execute a multi-file find/replace in TextWrangler than it is in TextMate (which requires you to first set up a “project” then apply the batch to the project). Both the tab thing and the batch thing have something in common which is that TextWrangler is better suited for cleaning multiple data files, something I do a lot of. However for coding it’s looking like TextMate. I’ve been using it for about a week while working with a very complex file and so far I’ve been very happy with its code folding, (limited) syntax completion, (excellent) syntax highlighting, etc. Some of the other editors I’ve mentioned could be this good in principle (and already are for some languages), but they would need the as-of-yet unwritten syntax files to do so for Stata.

(btw, here are the definitive thoughts on using text editors with Stata for various platforms).

April 15, 2009 at 6:30 am 2 comments

Newer Posts

The Culture Geeks