Archive for May, 2009


| Gabriel |

Sometimes you want to use a tool that’s not available for your operating system. For instance, I use a Mac but I sometimes want to use Windows software (eg Pajek) or Unix software (eg Dia). Likewise, Windows users might envy the system tools provided by POSIX systems. Since POSIX has a lot of very powerful text-processing tools I think this should be particularly appealing to culture quants. My own basic solution is to use API emulation for Windows and a full-blown virtual machine for Unix applications.

A lot of people use dual boot solutions for this, but I’m not as fond of this. The way this works is that when you turn on the computer you choose which OS you want to use. For instance OS X Leopard includes “boot camp” and programs like Grub and Wubi let you choose between Windows and Linux. Once you’re in the environment you can only use the software native to that environment and you sometimes even have limited access to the file system of the other environment. The upside is that once you get them working they make minimal demands on system resources. There are really two problems with this approach. One is that you have to reboot to switch between, say, Windows apps and Mac apps. The other is that some of these solutions are a little dangerous as most of them involve things like partitions. This is particularly the case with Macs, which have a weird BIOS and partition table so dual-boot solutions other than boot camp don’t work very smoothly with them. I’ve tried to get my Mac running as dual boot with Linux twice and both times I ended up having to reinstall OS X and restoring from Time Machine. (On the other hand I’ve had no problem installing dual-boots on Wintel machines, including the old clunker I used to write my dissertation which is now mostly running Xubuntu because it’s faster than XP).

Since OS X already has the full panoply of POSIX tools and can run any UNIX software, at first glance it doesn’t make sense that I’d want to run Linux on my Mac. The problem is that while in theory, MacOS should be able to run any UNIX software, this usually only works if it’s pre-compiled and most of it is not and I’ve had a lot of trouble getting Fink to work properly. It seems like it’s always missing some package or compiler and won’t compile the application. As such I find it easier just to keep a copy of Ubuntu so I can use the native package manager which never ever gives me any hassle. Basically, I’m so frustrated with Fink that I find it much easier to just use VirtualBox to run an Xubuntu virtual machine. VirtualBox is a free virtual machine manager that can run just about anything from just about anything. The main reason I use virtualization instead of dual-boot is that it’s impossible to damage your main OS by installing a virtual machine. Even if you can’t get it to work, the worst case scenario is you wasted your time, unlike trying to do a dual boot where you may have to start thinking about how good your most recent backup is and whether you still have all your installation discs. Of course the main downside to virtualization is that you split the system resources. The first way to handle this is to avoid bloated guest OS. You can get really small with DSL or Puppy, but the best ratio of user-friendly to compact is probably Xubuntu. Likewise If you’re installing a Windows guest OS you’d rather use XP than Vista. The second way to handle it is to buy more RAM. This is cheaper than you think because OEM’s in general, but especially Apple, use configuration as a form of price discrimination. Apple charges $100 to upgrade a new computer from 2GB to 4GB but you can get 4GB of RAM on Amazon for $50 and it takes about five minutes to install if you have a jeweler’s size philip’s head screw driver. (Their hard drives are even more over-priced).

For Windows software I don’t keep a virtual machine, in part because I don’t want to buy a Windows license and in part because I worry about the performance hit of running Windows as a VM. Instead I use Crossover, a proprietary build of Wine with better tech support. Crossover/Wine is a Windows API emulator, which is basically a minimalist virtual machine. It both runs much faster than a full-blown emulation and doesn’t require a license for the guest OS. On the other hand it can be slightly more buggy for some things, but in my experience Crossover works great with my old Microsoft Office 2003 for Windows license as well as Pajek.

May 18, 2009 at 5:47 am 1 comment

Herd immunity, again

| Gabriel |

Recently I talked about herd immunity in computer viruses. Yesterday Slashdot linked to an article on a potential vaccine that kills mosquitoes after they’ve bitten you, that is it has a herd immunity effect but no individual benefit at all. Although the article doesn’t mention it, traditional residential DDT spraying works exactly the same way. (After the mosquito bites you she rests on your wall, takes in DDT, and dies).

It’s interesting to think about whether people will adopt these sorts of vaccines since the discrepancy  between the marginal vs average benefit is even greater than with, say, the measles vaccine. In an article on DDT, Gladwell noted that dictatorships tended to be more effective at DDT campaigns than democracies, but I’d like to imagine that carrots would work as well as sticks to encourage people to contribute to the public good of herd immunity. Of course sociology has a lot to say about the best ways to get people to contribute to the health of strangers.

May 15, 2009 at 5:12 am

Remove the double-spacing from estout

| Gabriel |
As mentioned before, I love estout. However I dislike some of its features, such as that it leaves a blank line between rows. I wrote this code to go at the end of a do-file where it could clean all my tables and remove the gratuitous lines. You could also use it to accomplish other cleaning tasks or to clean other text-based files, including log files.

esttab using table_x.txt , se b(3) se(3) scalars(ll rho) nodepvars nomtitles  label title(Table X: THE NEW VARIABLES. 1936-2005) replace fixed
shell cp table_x.txt $tabledir
eststo clear
set mem 250m
*clean the tables, eliminate extra lines
cd $parentpath
shell touch tmpfile
shell mv tmpfile filelist_text.txt
cd $tabledir
shell ls *.txt >”$parentpath/filelist_text.txt”
shell awk ‘{ gsub(“\.csv”, “”); print $0;}’ “$parentpath/filelist_text.txt” > tmp
shell mv tmp “$parentpath/filelist_text.txt”
shell perl -pe ‘s/\n/ /g’  “$parentpath/filelist_text.txt” > tmp
shell mv tmp “$parentpath/filelist_text.txt”
capture file close myfile
file open myfile using “$parentpath/filelist_text.txt”, read
file read myfile line
global filelist `line’
foreach file in $filelist  {
shell perl -ne ‘print unless /^$/’  `file’ > tmp
copy tmp `file’, replace
erase tmp
*each of several banks of regression commands throughout the code end with some variant on:
esttab using table_k.txt
shell cp table_k.txt $tabledir
*note, i prefer "shell cp" over the Stata command "copy" because "cp" assumes the target filename

*at the end of the do-files clean the tables
*this part merely seeds the loop and is adapted from my "do it to everything in the directory" post
cd $parentpath
shell touch tmpfile
shell mv tmpfile filelist_text.txt
cd $tabledir
shell ls *.txt >"$parentpath/filelist_text.txt"
shell perl -pe 's/\n/ /g'  "$parentpath/filelist_text.txt" > tmp
shell mv tmp "$parentpath/filelist_text.txt"
capture file close myfile
file open myfile using "$parentpath/filelist_text.txt", read
file read myfile line
global filelist `line'
*this is where the actual cleaning occurs
foreach file in $filelist  {
	shell perl -ne 'print unless /^$/'  `file' > tmp
	copy tmp `file', replace
	erase tmp

May 14, 2009 at 5:48 am 2 comments


| Gabriel |

I experimented today with Gretl, a cross-platform free and open source statistical package that is designed to be user-friendly. I found it worked very well and was very intuitive. Furthermore it was fast, executing a logit of a large dataset in under a second. Although it is described as being for econometrics the package will work just as well for sociology. It has a pretty good variety of regression commands, not just OLS but things like multinomial logit, poisson, censored data, time-series,  and seemingly-unrelated regression.

The interface is similar to SPSS or Stata’s interactive mode and there is also a scripting mode accessible from the program’s console which uses a “command y x1 x2” syntax very similar to Stata. The only real difference is that the options are prefaced by “–” like in shell-scripting instead of a comma as in Stata. I recommend it to people who are teaching statistics so that each of your students can have a free package on his/her computer. This is especially so if you’re teaching students who lack access to a lab equipped with Stata and the long-term interest in statistics to justify buying Stata themselves. On the other hand it may not be the best package for full-blown quants as it encourages the bad habit of relying on interactive mode and treating scripting as a secondary activity. There’s also the fact that it’s not very popular which will make collaboration and other support (such as finding text editor language modules) difficult.

One note for Stata people, the import filter for Stata files doesn’t recognize the newest Stata format so use the Stata command “saveold” instead of “save” if you plan to have Gretl import your files.

Some mac-specific notes. I was very glad that they provide binaries as I think Fink is so difficult and buggy as to be useless for most people (including me), but of these binaries only the “snapshot” build worked for me. Also, I found that I have to launch X11 before launching Gretl.

May 13, 2009 at 12:05 pm

The “by” prefix

| Gabriel |
In the referrer logs, somebody apparently reached us by googling “tag last record for each individual in stata”. Thisis pretty easy to do with the “by” syntax, which applies the command within the cluster. This syntax is very useful for all sorts of cleaning tasks on multilevel data. Suppose we have data where “i” is nested within “j” and the numbers of “i” are ordinally meaningful. For instance, it could be longitudinal data where “j” is a person and “i” is a person-interview so the highest value of “i” is the most recent interview of “j”.

. use “/Users/rossman/Documents/oscars/IMDB_tiny_sqrt.dta”
. browse
. sort film
. browse
. gen x=0
. by film: replace x=1 if [_n]==[_N]
(16392 real changes made)
sort j i 
gen lastobs=0
by j: replace lastobs=1 if [_n]==[_N]

May 12, 2009 at 8:29 am 1 comment

Reply to All: Unsubscribe!

| Gabriel |

I subscribe to an academic listserv that’s usually very low traffic. Yesterday between 12:30 and 2:00 pm EDT there were a grand total of three messages discussing an issue within the list’s purview but not of interest to everyone on it. This was apparently too much for one reader of the list who at 2:54 pm EDT hit “reply to all” and wrote “Please remove me from this email list.  Thanks.”

And that’s when all Hell broke loose.

What ensued over the next few hours was 47 messages (on a list that usually gets maybe 10 messages a month), most of which consisted of some minor variation of “unsubscribe.” A few messages were people explaining that this wouldn’t work and providing detailed instructions on how one actually could unsubscribe (a multi-step process). Two others were from foundation officers pleading with people to stay on the list so they could use it to disseminate RFPs (“take my grants, please!”). Finally at about 9:15 pm EDT the listserv admin wrote and said he was pulling the plug on the whole list for a cooling off period until things could get sorted out.

To most of the people on the list this must have been a very unpleasant experience, either because they were bothered by a flood of messages all saying “unsubscribe” or (as with the foundation officers) because they were people who valued the list and were dismayed to mass defections from it. I mostly found it intellectually fascinating since I was seeing an epidemic occur in real time and this is my favorite subject.

I went through each of the messages and recorded the time it was sent. Because the messages are bounced through a central server the timestamps are on the same clock. Here’s the time-series, counting from the first “unsubscribe” message:

0 12 19 26 29 30 30 30 30 33 33 33 41 43 49 51 55 58 58 59 60 65 67 68
68 76 79 81 83 85 86 87 98 107 116 122 125 131 137 169 287 311 317 345
355 383 390

Here’s the graph of the cumulative count.


The first 150 minutes or so of this is a classic s-curve, which tips at about the 30 minute mark, increases rapidly for about an hour, then starts to go asymptotic around 90 minutes.

OK, so there’s some kind of contagious process going on, but what kind? I’m thinking that it has to be mostly network externalities. That is, it’s unpleasant to get a bunch of emails that all say “unsubscribe” or “remove.” Some people may stoically delete them (or take a perverse pleasure in watching an epidemic unfold) whereas others may be very attached to the list and willing to put up with a lot of garbage to keep what they like. That is, there is a distribution of tolerance for annoying emails. For those people with a weak attachment to the list (many people apparently didn’t even remember subscribing) and little patience, they’re going to want to escape as soon as they get a few annoying emails, and they’re not going to think that carefully about the correct (and fairly elaborate) way to do it. So they hit reply to all. This of course makes it even more annoying and so people who are slightly less impatient will hit reply to all. My favorite example of the unthinking panic this can involve is one message, the body of which was “Unsubscribe me!” and the subject was “RE: abc-listserv: Please DO NOT email the whole list serve to be REMOVED from the mailing list.”

Another thing to note is that the tipping point occurs really early and outliers trickle in really late. If you ignore the late outliers coming in after 150 minutes, the curve is almost a perfect fit for the Gompertz function, described on pp 19-21 of the Mahajan and Peterson RSF green book as:


What the logarithms do is move the tipping point up a little earlier so that the diffusion is not symmetrical but the laggards trickle in over a long time. Note this is the opposite of the curve Ryan and Gross found for hybrid corn, where it took forever to reach the tipping point but once it did there were very few laggards. It’s nice to have a formula for it, but why does it follow this pattern? My guess is that it is not that some people read 30 annoying emails in the space of an hour, ignore them, and then an hour later two more emails are the straw that breaks the camel’s back. Rather I think that what’s going on is that some people are away from their email for a few hours, they get back, and what on Earth is this in my inbox? So there are really two random variables to consider, a distribution of thresholds of tolerance for annoying emails and a distribution of times for when people will go on the internet and become exposed to those emails. Diffusion models, especially as practiced by sociologists, tend to be much more attentive to the first kind of effect and much less to the second. However there are lots of situations where both may be going on and failing to account for the latter may give skewed views of the former.

May 7, 2009 at 6:21 am 4 comments

R, Stata and descriptive stats

| Pierre |

It’s amazing how R can make complicated things look simple and simple things look complicated.

I tried to explain in my previous post that R could have important advantages over Stata when it came to managing weird, “non-rectangular” datasets that need to be transformed/combined/reshaped in non-trivial ways: R makes it much easier to work on several datasets at the same time, and different types of objects can be used in consistent ways.

Still, I haven’t completely stopped using Stata: one of the things that bother me when I use R is the lack of nice and quick descriptive statistics functions like “tabulate” or “tabstat”. Of course, it is possible to use standard R functions to get about the same desired output, but they tend to be quite a bit more cumbersome. Here’s an example:

tabstat y, by(group) stats(N mean p10 median p90)

could be translated into R as:

tapply(levels(group), levels(group), function(i)
cbind(N=length(y[group == i],
mean(y[group == i]),
quantile(y[group == i], c(.1,.5,.9)))

or, for a more concise version:

by(y, group, function(x)

That’s quite ugly compared to the simple tabstat command, but I could deal with it… Now suppose I am working on survey data and observations have sampling weights, and the syntax will have to get even more complicated — I’d have to think about something for a few minutes, when all Stata would need is a quick [fw=weight] statement added before the comma.

True, R can deal with survey weights, but it almost never matches the simplicity of Stata when all I am trying to do is get a few simple descriptive statistics on survey data:

One of my latest problems with R involved trying to make a two-way table of relative frequencies by column with weighted data… yes, a simple contingency table! The table() function cannot even compare with Stata’s tabulate twoway command, since:

  1. it does not handle weights;
  2. it does not report marginal distributions in the last row and column of the table (which I always find helpful);
  3. it calculates cell frequencies but not relative frequencies by row or column.

Luckily, writing an R function that can achieve this is not too hard:

col.table <- function(var1, var2, weights=rep(1,length(var1)), margins=TRUE){
# Creating table of (weighted) relative frequencies by column, and adding row variable margins as the last column
crosstab <- prop.table(xtabs(weights ~ var1 + var2), margin=2)
t <- cbind(crosstab, Total=prop.table(xtabs(weights ~ var1)))
# Adding column sums in the last row
t <- rbind(t,Total = colSums(t))
# Naming rows and columns of the table after var1 and var2 used, and returning result
names(dimnames(t)) <- c(deparse(substitute(var1)), deparse(substitute(var2)))

col.table(x,y,w) gives the same output as Stata’s “tabulate x y [fw=w], col nofreq”. Note that the weight argument is optional so that: col.table(x,y) is equivalent to tabulate x y, col nofreq.

Here’s the same function, but for relative distributions by row:

row.table <- function(var1, var2, weights=rep(1,length(var1)), margins=TRUE){
t <- rbind(prop.table(xtabs(weights ~ var1 + var2), margin=1),
Total=prop.table(xtabs(weights ~ var2)))
t <- cbind(t,Total = rowSums(t))
names(dimnames(t)) <- c(deparse(substitute(var1)), deparse(substitute(var2)))

May 6, 2009 at 7:41 pm 11 comments

Older Posts Newer Posts

The Culture Geeks