September | 2009 | Code and Culture

Archive for September, 2009

Pajek_labelvector.pl

| Gabriel |

A few months ago I wrote some notes on using a text editor to get output out of Pajek or Network Workbench and into a rows and columns dataset. Now that I’ve learned Perl from the course notes my UC Davis colleagues posted, I wrote up a perl script that will automate this and create a tab-delimited ascii file (or files if you give it multiple .vec files).

I’d like to put the code directly in the post but when I try, wordpress drops some of the characters (eg, backslash-zero-one-five renders as just “15”) so I put the properly-formatted script here .[Update, the new “sourcecode” tag properly escapes all this stuff so I’ve updated the post to include the script at the bottom of the post. The external link still works but is now unnecessary].

It takes the labels from a “.net” data file and merges them (by sort order) onto a “.vec” output file which let’s you merge it back onto your main (non-network) dataset. Read my older post for an explanation of why this is necessary. Note that if the sort order is different for the .vec and .net files it will get screwy so be sure to spot check the values. The syntax is simply:

perl pajek_labelvector.pl myfile.net netmetric_1.vec netmetric_k.vec

Between this perl script and stata2pajek.ado it should be fairly easy to integrate network data into Stata.

#!/usr/bin/perl
# pajek_labelvector.pl
# Gabriel Rossman, UCLA, 2009-09-22
# this file extracts the vertice labels from a .net file and merges it (by sort order) with one or more .vec files
# take filenames as arguments
# file 1 is .net, files 2-k are .vec
# writes out foo.txt as tab delimited text
# note, this is dependent on an unchanged sort order

use strict; use warnings;
die "usage: pajek_labelvector.pl   ... \n" unless @ARGV > 1;

my $netfile = shift (@ARGV);
my @labels=();
#read the vertice labels from .net file
open(NETIN, "<$netfile") or die "error reading $netfile for reading";
while (<NETIN>) {
	if ($_ =~ m/"/) { #only use the vertice label lines, which include quote chars
		$_ =~ /^[0-9]+ "(.*)"/; #search for quoted text
		push @labels, $1; #return match, push to array
	}
}
close NETIN;
#read netfile
foreach my $vecfile (@ARGV) {
	open(VECIN, "<$vecfile") or die "error reading $vecfile"; 	open(VECOUT, ">$vecfile.txt") or die "error creating $vecfile.txt";
	my @vec=();
	while (<VECIN>) {
		$_ =~ s/\015?\012//; #manual chomp to allow windows or unix text
		if ($_ !~ m/^\*/) {
			push @vec, $_;
		}
	}
	close VECIN;
	my $veclength = @vec - 1;
	my $lablength = @labels -1;
	die "error, $vecfile is different length than $netfile" unless $veclength==$lablength;
	for my $i (0..$veclength) {
		print VECOUT "$labels[$i]\t$vec[$i]\n";
	}
	close VECOUT;
	@vec=();
}

print "WARNING: this script assumes that the .vec and .net have the same sort order\nplease spot check the values to avoid error\n";

September 29, 2009 at 5:25 am GR 2 comments

Categorical distinctions: special victims unit

| Gabriel |

Alcibiades: But Socrates, what is it that you mean by justice?

Socrates: Suppose that there was a poet, Polkurgus, who gave a young girl unwatered wine and then once she was drunk made love to her both as a girl and as an eremenos. Certainly it would be just for the city authorities to make Polkurgus to suffer?

Alcibiades: Indeed it would be just for Polkurgus to suffer. But I know this sad story and in fact Polkurgus took refuge in the Lydian court. His friends could not see him in his home city and he could not enter his works in the city’s poetry competitions. Over the course of a lifetime of exile certainly you would agree that Polkurgus suffered greatly.

Socrates: Indeed, Polkurgus suffered but suffering is not justice. Justice consists in righteous punishment by the city authorities. Even if it created greater suffering, Polkurgus’ exile was mere suffering and not justice, indeed it was the denial of justice.

Alcibiades: You are wise indeed Socrates.

Alcibiades: But Socrates, what is it that you mean by justice?

Socrates: Remember that Polkurgus gave a young girl unwatered wine and then once she was drunk forcefully made love to her both as one would a girl and as one would an eremenos. Certainly it would be just for the city authorities to punish Polkurgus?

Alcibiades: But Polkurgus was and remains a great poet, certainly the beauty of his poetry is greater than the suffering of his crime.

Socrates: Did not Agamemnon deserve to suffer for his crimes, even though as in conquering Troy he brought great glory to Greece than he would have by preserving Iphigenia?

Alcibiades: I see, indeed it would be as just for Polkurgus to suffer as for any criminal among the thetes. But I know this sad story and in fact Polkurgus fled his city and took refuge in the Lydian court. His friends could not see him without a very inconvenient voyage and he could not enter his works in the city’s poetry competitions. Furthermore, while in Lydia, he violated no more girls. Over the course of a lifetime away from his city, certainly you would agree that Polkurgus suffered greatly.

Alcibiades: You are wise indeed Socrates.

—————-

(If you don’t know what I’m talking about, see Kieran’s post at CT, including the mostly odious comment thread).

September 28, 2009 at 3:20 pm GR 5 comments

so random

| Gabriel |

Nate Silver at 538 has accused Strategic Vision of fudging their numbers and his argument is simply that few of their estimates end in “0” or “5” and a lot of them end in “7.” The reason this is meaningful is that there’s a big difference between random and the perception of random. A true random number generator will give you nearly equal frequency of trailing digits “0” and “7,” but to a human being a number ending in “7” seems more random than one ending in “0.” Likewise clusters occur in randomness but human beings see clustering as suspicious. A scatterplot of two random variables drawn from a uniform has a lot of dense and sparse patches but people expect it to look like a slightly off-kilter lattice. That is, we intuitively can’t understand that there is a difference between a uniform distribution and a random variable drawn from a uniform distribution.

This reminded me of two passages from literature. One is in Silence of the Lambs when Hannibal Lector tells Clarice that the locations of Buffalo Bill’s crime scenes is “desperately random, like the elaborations of a bad liar.” The other is from Stephenson’s Cryptonomicon, where a mathematician explains how he broke a theoretically perfect encryption scheme:

That is true in theory, … In practice, this is only true if the letters that make up the one-time pad are chosen perfectly randomly … An English speaker is accustomed to a certain frequency distribution of letters. He expects to see a great many e’s t’s, and a’s, and not so many z’s and q’s and x’s. So if such a person were using some supposedly random algorithm to generate the letters, he would be subconsciously irritated every time a z or an x came up, and, conversely, soothed by the appearance of e or t. Over time, this might skew the frequency distribution.

Going a little bit further afield, in a recent bloggingheads, Knobe and Morewidge discuss the latter’s psych lab research on various issues, including how people tend to ascribe misfortune to malicious agency but fortune to chance. They then note that this is the opposite of how we tend to talk about God, seeing fortune as divine agency and misfortune as random. This is true for Americans, but this has less to do with human nature than with the unusual nature of the Abrahamic religions.*

Ironically, the lab research is pretty consistent with the modal human religious experience — animism organized around a “do ut des” relationship with innumerable spirits that control every aspect of the natural world. Most noteworthy is that much of this worship appears aimed not at some special positive favor but at getting the gods to leave you alone. So the Romans had sacrifices and festivals to appease gods like Robigus, the god of mold, and Cato the Elder’s De Agricultura explains things like how when you clear a grove of trees you need to sacrifice a pig to the fairies who lived in the trees so they don’t haunt the farm. These religious practices seem pretty clearly derived from a human tendency to treat misfortune as the result of agency and to generalize this to supernatural agency, absent cultural traditions to the contrary.

—————-

*I generally get pretty frustrated with people who talk about religion and human nature proceeding from the assumption that ethical monotheism and atheism are the basic alternatives. Appreciating that historically and pre-historically most human beings have been animists makes the spandrel theory of hyper-sensitive agency-detection much more plausible than the group-selectionist theory of solidarity and intra-group altruism.

September 28, 2009 at 1:10 pm GR

Stata console mode

| Gabriel |

I just realized that Stata SE/MP includes the console version of Stata.

Since the Stata GUI adds only about 15 or 16 megs of RAM and a comparably light load to the CPU, it doesn’t really improve the performance that much for most things, but I still thought it was pretty cool in a dorky ASCII art kind of way. The only place where I notice a substantial performance jump is with one do-file that generates hundreds of graphs (and saves them to disk) — not only is console mode much faster but it’s less distracting as graphs aren’t constantly popping up.

To invoke console mode on a mac, go to the terminal and write:

/Applications/Stata/StataMP.app/Contents/MacOS/stata-mp

To get the GUI you’d do the same thing but change the last bit to “stataMP” (note the case-sensitivity). Of course both versions can take a do-file as an argument and you can add the path as an alias to ~/.bashrc like this:

echo "alias stataconsole='exec /Applications/Stata/StataMP.app/Contents/MacOS/stata-mp'" >> ~/.bashrc
echo "alias statagui='exec /Applications/Stata/StataMP.app/Contents/MacOS/stataMP'" >> ~/.bashrc

You could similarly change text editor push scripts to use console, but I think it’s a good idea to use the GUI while you’re still debugging because it’s easier to spot error messages (the GUI has syntax highlighting) and experiment with alternate usage (the GUI menus can be useful for learning syntax).

September 24, 2009 at 4:45 am GR 3 comments

Shell vs “Shell”

| Gabriel |

Two thoughts on Stata’s “shell” command (which let’s Stata access the OS command line).

First, I just discovered the “ashell” command (“ssc install ashell”), which pipes the shell’s “standard out” to Stata return macros. For short output this can be a lot more convenient than what I had been doing, which was to pipe stdout to a file, then use “file” or “insheet” to get that into Stata. For instance, my “do it to everything in a directory” script is a lot simpler if I rewrite it to use “ashell” instead of “shell”.

ashell ls *.dta
* note that ashell tokenizes stdout so to use it as one string you need to reconstruct it
forvalues stringno=1/`r(no)' {
  local stdout "`stdout' `r(o`stringno')'"
}
*because i used "ls", stdout is now a list of files, suitable for looping
*as an example, i'll load each file and export it to excel 2007 format
foreach file in `stdout' {
  use `file', clear
  xmlsave `file'.xlsx, doctype(excel) replace
}

The second thing is a minor frustrations I’ve had with the Stata “shell” command. Unlike a true shell in the terminal, it has no concept of a “session” but treats each shell command as coming ex nihilio. A related problem is it doesn’t read any of your preference files (e.g., ~/.bashrc). Since shell preference files are just shell scripts read automatically at the beginning of a session, the latter is a logical corollary of the first. Ignoring the preference files is arguably a feature, not a bug, as it forces you to write do-files that will travel between machines (at least if both are Windows or both are POSIX).

Anyway, here’s a simple example of what I mean. Compare running this (working) bash script in a terminal session:

alias helloworld="echo 'hello world'"
helloworld

with the (failing) equivalent through Stata’s shell command:

shell alias helloworld="echo 'hello world'"
shell helloworld

Anyway, I think the best work-around is to use multiple script files that either invoke each other as a daisy chain or are all invoked by a master shell script. So, say you needed Stata to do some stuff, then process the data with some Unix tools, then do some more stuff with Stata, then typeset the output as PDF. One way to do this would be to have a shell script that says to use the first do file, then the perl script, then the second do-file, then a latex command. Alternately you could make the last line of the first do-file a “shell” command to invoke the perl script, the last line of the perl script a “system” or “exec” command to invoke the second do-file, and the last line of the second do-file is a “shell” command to invoke ghostscript or lyx.

Also note that if you’re doing the master shell script approach you can do some interesting stuff with “make” so as to ensure that the dependencies in a complicated workflow execute in the right order. See here and here for more info.

Finally, if you just want to read a long path that you usually “alias”, the simplest thing is to just copy the full path from bashrc, you can do this directly from Stata by typing “view ~/.bashrc”

September 23, 2009 at 5:29 am GR 4 comments

So long (distinctive) Notre Dame economics

| Gabriel |

For many years Notre Dame had a “heterodox” economics department known for approaches that were less rosy on laissez-faire policy and formal modeling methods than your typical econ department. A few years ago the university created a conventional econ department alongside the heterodox department. Now it is closing down the old heterodox department and spreading the faculty among sister departments.

Even though I’m an economic sociologist (and thus by definition think conventional economics leaves out a lot of interesting stuff about markets and exchange), I think overall Chicago style economics has a more accurate and parsimonious model of exchange than does the style of economics that we can no longer associate with Notre Dame. Nonetheless even though I don’t agree with many of their models and conclusions, I think it’s unequivocally a shame both for economics as a discipline and Notre Dame as an institution that a department providing a distinctive perspective has been turned into yet another second-tier conventional department.

First, what this means for economics as a discipline. The development of ideas come out of meaningful intellectual diversity, especially when there are viable circles where people who have novel ideas can bounce off of each other. So long as the departure from the discipline’s dominant paradigm is not truly ridiculous (e.g., a biology or geology department known for young Earth creationism), I think it’s healthy for the field to have a few circles that break from the consensus.

Second, for Notre Dame as an institution. Notre Dame is embracing the strategy of middle-range conformity and this strikes me as a bad idea. Let’s take it as granted that there is no way that Notre Dame economics is going to crack the US News or NRC top 10 in the medium-term. If so, it seems like Notre Dame is better having a deviant but distinctive department than a conventional department without any reputation at all (in the Zuckerman “focused identity” sense). This is the exact opposite of the famous GMU strategy to raise up their third-tier law school and econ department into the top half of the second-tier by developing a high profile and distinctive identity by emphasizing “law and economics” and the Austrian school. This was incredibly successful and managed to put GMU on the map (and raise the profile of these ideas, see point #1). GMU administrators describe their strategy as “moneyball,” meaning they found a niche with a lot of unappreciated and underpriced faculty talent, but I think of equal or greater importance was that they created a very focused and distinctive identity. [See the Crooked Timber book symposium on Rise of the Conservative Legal Movement for some very interesting thoughts on this].

The only way it could possibly benefit Notre Dame is in undergraduate education, if we feel (plausibly but not certainly) that undergraduates benefit from learning the conventional theories and (plausibly but not certainly) that most high school seniors aren’t aware of the specific intellectual reputations of specific departments. However if undergraduates know broadly what to expect of an institution (perhaps because Catholic universities are known for embracing the church’s “social doctrine” of center-left economic policy) then they’ll understand what they are signing up for and I think it’s good that they have the option of choosing an unconventional economics education.

Finally, I think it’s ironic that conventional economics assumptions of perfect information and free competition make a more plausible case that Notre Dame should keep a heterodox econ department (for reasons of undergrad pedagogy) than do the heterodox assumptions of bounded rationality and captive audiences. On the other hand, heterodox econ and econ soc are much better at predicting that Notre Dame would conform to disciplinary conventions.

September 22, 2009 at 4:56 am GR

Texcount.pl

| Gabriel |

Somebody recently asked me for a projected word count of my manuscript (which is in Lyx) and to answer this question I found the amazingly useful script texcount.pl. If you just run “wc” (or the equivalent in a text editor) on a tex or lyx file you count all the plain text and the markup code. Not only does this script screen out the meta-text, but it can give you detailed breakdowns of words, figures, and captions — all broken out by section.

I like to keep scripts in “~/scripts/” so to make this script readily accessible from the command-line I entered the command:

echo "alias texcount='perl ~/scripts/TeXcount_2_2/texcount.pl'" >> ~/.bashrc

Now to run the command I just go to the terminal and type

texcount foo.tex

You should really check out the options if you have a long and complex document. My favorite option is “-sub”. This gives a detailed breakdown of word count, figure count, etc, by chapter, section, or whatever.

texcount -sub foo.tex

Remember that if you always use a certain option, you can write it into the alias command.

Lyx has a similar basic command built in (Tools/Statistics), but it doesn’t give as much information and doesn’t break out the data by section. To use texcount with lyx files, you first need to export Lyx to Latex which you can do from the GUI (File/Export/Latex), but if you’re using texcount anyway you should just use the command line.

lyx --export latex foo.lyx

That works for Linux but on a Mac this will work more consistently

exec '/Applications/Lyx.app/Contents/MacOS/lyx' --export latex foo.lyx

That’s a long command, so on my Mac I created an alias as “lyx2tex”

echo "alias lyx2tex='exec /Applications/Lyx.app/Contents/MacOS/lyx --export latex'" >> ~/.bashrc

Note that all this works on POSIX but may require some modification to work with Windows (unless it has CygWin).

September 21, 2009 at 5:31 am GR

Uncertainty, the CBO, and health coverage

| Gabriel |

[update. #1. i’ve been thinking about these ideas for awhile in the context of the original Orszag v. CBO thing, but was spurred to write and post it by these thoughts by McArdle. #2. MR has an interesting post on risk vs uncertainty in the context of securities markets]

Over at OT, Katherine Chen mentions that IRB seems to be a means for universities to try to tame uncertainty. The risk/uncertainty dichotomy is generally a very interesting issue. It played a huge part in the financial crash in that most of the models and instruments based on them were much better at dealing with (routine) risk than with uncertainty (aka, “systemic risk”). Everyone was aware of the uncertainty but the really sophisticated technologies for risk provided enough comfort to help us ignore that so much was unknowable.

Currently one of the main ways we’re seeing uncertainty in action is with the CBO’s role in health finance reform. The CBO’s cost estimates are especially salient given the poor economy and Orszag/Obama’s framing of the issue as about cost. The CBO’s practice is to score bills based on a) the quantifiable parts of a bill and b) the assumption that the bill will be implemented as written. Of course qualitative parts of a bill and the possibility of time inconsistency are huge elements of uncertainty on the likely fiscal impact of any legislation. The fun thing is that this is a bipartisan frustration.

When the CBO scored an old version of the bill it said it would be a budget buster, which made Obama’s cost framing look ridiculous and scared the hell out of the blue dogs. This infuriated the pro-reform people who (correctly) noted that the CBO had not included in its estimates that IMAC would “bend the cost curve,” and thus decrease the long-term growth in health expenditures by some unknowable but presumably large amount. That is to say, the CBO balked at the uncertainty inherent in evaluating a qualitative change and so ignored the issue, thereby giving a cost estimate that was biased upwards.

More recently the CBO scored another version of the bill as being reasonably cheap, which goes a long way to repairing the political damage of its earlier estimate. This infuriates anti-reform people who note (correctly) that the bill includes automatic spending cuts and historically Congress has been loath to let automatic spending cuts in entitlements (or for that matter, scheduled tax hikes) go into effect. That is to say, the CBO balked at the uncertainty inherent in considering whether Congress suffers time inconsistency and so ignored the issue, thereby giving a cost estimate that was biased downwards.

That is to say, what looks like a straight forward accounting exercise is only partly knowable and the really interesting questions are inherently qualitative ones like do we trust IMAC to cut costs and do we trust Congress to stick to a diet. And that’s not even getting into real noodle-scratchers like pricing in the possibility that an initially cost-neutral plan chartered as a GSE would eventually get general fund subsidies or what will happen to the tax base when you factor in that making coverage less tightly coupled to employment should yield improvements in labor productivity.

September 18, 2009 at 5:18 pm GR

Stata2Pajek (update)

I fixed two bugs in Stata2Pajek and improved the documentation. To get it, type (from within Stata)

ssc install stata2pajek

If you already have the old version, the above command will update it but even better is to update all of your ado files with:

adoupdate

September 18, 2009 at 2:07 pm GR

If at first you don’t succeed, try a different specification

| Gabriel |

Cristobal Young (with whom I overlapped at Princeton for a few years) has an article in the last ASR on model uncertainty, with an empirical application to religion and development. This is similar to the issue of publication bias but more complicated and harder to formally model. (You can simulate the model uncertainty problem as to control variables but beyond that it gets intractable).

In classic publication bias, the assumption is that the model is always the same and it is applied to multiple datasets. This is somewhat realistic in fields like psychology where many studies are analyses of original experimental data. However in macro-economics and macro-sociology there is just one world and so to a first approximation what happens is that there is basically just one big dataset that people just keep analyzing over and over. To a lesser extent this is true of micro literatures that rely heavily on secondary analyses of a few standard datasets (e.g., GSS and NES for public opinion; PSID and ADD-health for certain kinds of demography; SPPA for cultural consumption). What changes between these analyses is the models, most notably assumptions about the basic structure (distribution of dependent variable, error term, etc), the inclusion of control variables, and the inclusion of interaction terms.

Although Cristobal doesn’t put it like this, my interpretation is that if there were no measurement error, this wouldn’t be a bad thing as it would just involve people groping towards better specifications. However if there is error then these specifications may just be fitting the error rather than fitting the model. Cristobal shows this pretty convincingly by showing that the analysis is sensitive to the inclusion of data points suspected to be of low quality.

I think it’s also worth honoring Robert Barro for being willing to cooperate with a young unknown researcher seeking to debunk one of his findings. A lot of established scientists are complete assholes about this kind of thing and not only won’t cooperate but will do all sorts of power plays to prevent publication.

Finally, see this poli sci paper which does a meta-analysis of their two flagship journals and finds a suspicious number of papers that are just barely significant. Although, they describe the issue as “publication bias,” I think the issue is really model uncertainty.

September 17, 2009 at 3:30 pm GR

Older Posts

Code and Culture

Archive for September, 2009

Pajek_labelvector.pl

Categorical distinctions: special victims unit

so random

Stata console mode

Shell vs “Shell”

So long (distinctive) Notre Dame economics

Texcount.pl

Uncertainty, the CBO, and health coverage

Stata2Pajek (update)

If at first you don’t succeed, try a different specification

The Culture Geeks

Archives

Recent Comments

Blogroll

References/Resources