Archive for May, 2009

Gretl

| Gabriel |

I experimented today with Gretl, a cross-platform free and open source statistical package that is designed to be user-friendly. I found it worked very well and was very intuitive. Furthermore it was fast, executing a logit of a large dataset in under a second. Although it is described as being for econometrics the package will work just as well for sociology. It has a pretty good variety of regression commands, not just OLS but things like multinomial logit, poisson, censored data, time-series,  and seemingly-unrelated regression.

The interface is similar to SPSS or Stata’s interactive mode and there is also a scripting mode accessible from the program’s console which uses a “command y x1 x2″ syntax very similar to Stata. The only real difference is that the options are prefaced by “–” like in shell-scripting instead of a comma as in Stata. I recommend it to people who are teaching statistics so that each of your students can have a free package on his/her computer. This is especially so if you’re teaching students who lack access to a lab equipped with Stata and the long-term interest in statistics to justify buying Stata themselves. On the other hand it may not be the best package for full-blown quants as it encourages the bad habit of relying on interactive mode and treating scripting as a secondary activity. There’s also the fact that it’s not very popular which will make collaboration and other support (such as finding text editor language modules) difficult.

One note for Stata people, the import filter for Stata files doesn’t recognize the newest Stata format so use the Stata command “saveold” instead of “save” if you plan to have Gretl import your files.

Some mac-specific notes. I was very glad that they provide binaries as I think Fink is so difficult and buggy as to be useless for most people (including me), but of these binaries only the “snapshot” build worked for me. Also, I found that I have to launch X11 before launching Gretl.

May 13, 2009 at 12:05 pm

The “by” prefix

| Gabriel |
In the referrer logs, somebody apparently reached us by googling “tag last record for each individual in stata”. Thisis pretty easy to do with the “by” syntax, which applies the command within the cluster. This syntax is very useful for all sorts of cleaning tasks on multilevel data. Suppose we have data where “i” is nested within “j” and the numbers of “i” are ordinally meaningful. For instance, it could be longitudinal data where “j” is a person and “i” is a person-interview so the highest value of “i” is the most recent interview of “j”.

. use “/Users/rossman/Documents/oscars/IMDB_tiny_sqrt.dta”
. browse
. sort film
. browse
. gen x=0
. by film: replace x=1 if [_n]==[_N]
(16392 real changes made)
sort j i 
gen lastobs=0
by j: replace lastobs=1 if [_n]==[_N]

May 12, 2009 at 8:29 am 1 comment

Reply to All: Unsubscribe!

| Gabriel |

I subscribe to an academic listserv that’s usually very low traffic. Yesterday between 12:30 and 2:00 pm EDT there were a grand total of three messages discussing an issue within the list’s purview but not of interest to everyone on it. This was apparently too much for one reader of the list who at 2:54 pm EDT hit “reply to all” and wrote “Please remove me from this email list.  Thanks.”

And that’s when all Hell broke loose.

What ensued over the next few hours was 47 messages (on a list that usually gets maybe 10 messages a month), most of which consisted of some minor variation of “unsubscribe.” A few messages were people explaining that this wouldn’t work and providing detailed instructions on how one actually could unsubscribe (a multi-step process). Two others were from foundation officers pleading with people to stay on the list so they could use it to disseminate RFPs (“take my grants, please!”). Finally at about 9:15 pm EDT the listserv admin wrote and said he was pulling the plug on the whole list for a cooling off period until things could get sorted out.

To most of the people on the list this must have been a very unpleasant experience, either because they were bothered by a flood of messages all saying “unsubscribe” or (as with the foundation officers) because they were people who valued the list and were dismayed to mass defections from it. I mostly found it intellectually fascinating since I was seeing an epidemic occur in real time and this is my favorite subject.

I went through each of the messages and recorded the time it was sent. Because the messages are bounced through a central server the timestamps are on the same clock. Here’s the time-series, counting from the first “unsubscribe” message:

0 12 19 26 29 30 30 30 30 33 33 33 41 43 49 51 55 58 58 59 60 65 67 68
68 76 79 81 83 85 86 87 98 107 116 122 125 131 137 169 287 311 317 345
355 383 390

Here’s the graph of the cumulative count.

t1

The first 150 minutes or so of this is a classic s-curve, which tips at about the 30 minute mark, increases rapidly for about an hour, then starts to go asymptotic around 90 minutes.

OK, so there’s some kind of contagious process going on, but what kind? I’m thinking that it has to be mostly network externalities. That is, it’s unpleasant to get a bunch of emails that all say “unsubscribe” or “remove.” Some people may stoically delete them (or take a perverse pleasure in watching an epidemic unfold) whereas others may be very attached to the list and willing to put up with a lot of garbage to keep what they like. That is, there is a distribution of tolerance for annoying emails. For those people with a weak attachment to the list (many people apparently didn’t even remember subscribing) and little patience, they’re going to want to escape as soon as they get a few annoying emails, and they’re not going to think that carefully about the correct (and fairly elaborate) way to do it. So they hit reply to all. This of course makes it even more annoying and so people who are slightly less impatient will hit reply to all. My favorite example of the unthinking panic this can involve is one message, the body of which was “Unsubscribe me!” and the subject was “RE: abc-listserv: Please DO NOT email the whole list serve to be REMOVED from the mailing list.”

Another thing to note is that the tipping point occurs really early and outliers trickle in really late. If you ignore the late outliers coming in after 150 minutes, the curve is almost a perfect fit for the Gompertz function, described on pp 19-21 of the Mahajan and Peterson RSF green book as:

delta=[b*extantdiffusion]*[log(pop)-log(extantdiffusion)]

What the logarithms do is move the tipping point up a little earlier so that the diffusion is not symmetrical but the laggards trickle in over a long time. Note this is the opposite of the curve Ryan and Gross found for hybrid corn, where it took forever to reach the tipping point but once it did there were very few laggards. It’s nice to have a formula for it, but why does it follow this pattern? My guess is that it is not that some people read 30 annoying emails in the space of an hour, ignore them, and then an hour later two more emails are the straw that breaks the camel’s back. Rather I think that what’s going on is that some people are away from their email for a few hours, they get back, and what on Earth is this in my inbox? So there are really two random variables to consider, a distribution of thresholds of tolerance for annoying emails and a distribution of times for when people will go on the internet and become exposed to those emails. Diffusion models, especially as practiced by sociologists, tend to be much more attentive to the first kind of effect and much less to the second. However there are lots of situations where both may be going on and failing to account for the latter may give skewed views of the former.

May 7, 2009 at 6:21 am 4 comments

R, Stata and descriptive stats

| Pierre |

It’s amazing how R can make complicated things look simple and simple things look complicated.

I tried to explain in my previous post that R could have important advantages over Stata when it came to managing weird, “non-rectangular” datasets that need to be transformed/combined/reshaped in non-trivial ways: R makes it much easier to work on several datasets at the same time, and different types of objects can be used in consistent ways.

Still, I haven’t completely stopped using Stata: one of the things that bother me when I use R is the lack of nice and quick descriptive statistics functions like “tabulate” or “tabstat”. Of course, it is possible to use standard R functions to get about the same desired output, but they tend to be quite a bit more cumbersome. Here’s an example:

tabstat y, by(group) stats(N mean p10 median p90)

could be translated into R as:

tapply(levels(group), levels(group), function(i)
cbind(N=length(y[group == i],
mean(y[group == i]),
quantile(y[group == i], c(.1,.5,.9)))

or, for a more concise version:

by(y, group, function(x)
c(N=length(x),mean=mean(x),quantile(x,c(.1,.5,.9)))

That’s quite ugly compared to the simple tabstat command, but I could deal with it… Now suppose I am working on survey data and observations have sampling weights, and the syntax will have to get even more complicated — I’d have to think about something for a few minutes, when all Stata would need is a quick [fw=weight] statement added before the comma.

True, R can deal with survey weights, but it almost never matches the simplicity of Stata when all I am trying to do is get a few simple descriptive statistics on survey data:

One of my latest problems with R involved trying to make a two-way table of relative frequencies by column with weighted data… yes, a simple contingency table! The table() function cannot even compare with Stata’s tabulate twoway command, since:

  1. it does not handle weights;
  2. it does not report marginal distributions in the last row and column of the table (which I always find helpful);
  3. it calculates cell frequencies but not relative frequencies by row or column.

Luckily, writing an R function that can achieve this is not too hard:

col.table <- function(var1, var2, weights=rep(1,length(var1)), margins=TRUE){
# Creating table of (weighted) relative frequencies by column, and adding row variable margins as the last column
crosstab <- prop.table(xtabs(weights ~ var1 + var2), margin=2)
t <- cbind(crosstab, Total=prop.table(xtabs(weights ~ var1)))
# Adding column sums in the last row
t <- rbind(t,Total = colSums(t))
# Naming rows and columns of the table after var1 and var2 used, and returning result
names(dimnames(t)) <- c(deparse(substitute(var1)), deparse(substitute(var2)))
return(round(100*t,2))
}

col.table(x,y,w) gives the same output as Stata’s “tabulate x y [fw=w], col nofreq”. Note that the weight argument is optional so that: col.table(x,y) is equivalent to tabulate x y, col nofreq.

Here’s the same function, but for relative distributions by row:

row.table <- function(var1, var2, weights=rep(1,length(var1)), margins=TRUE){
t <- rbind(prop.table(xtabs(weights ~ var1 + var2), margin=1),
Total=prop.table(xtabs(weights ~ var2)))
t <- cbind(t,Total = rowSums(t))
names(dimnames(t)) <- c(deparse(substitute(var1)), deparse(substitute(var2)))
return(round(100*t,2))
}

May 6, 2009 at 7:41 pm 11 comments

Computer viruses, herd immunity, and public goods

| Gabriel |

Today Slashdot has a post arguing that anti-virus software has both a private character (your computer works) and positive externalities (you don’t spread the virus to others) and this latter quality implies a public goods character that may make anti-virus software worthy of some kind of subsidy, including the indirect subsidy of a public service propaganda campaign. (I don’t see how much value a “the more you know” PSA would have, given that Windows is already obnoxiously in your face if you fail to install anti-virus software or let your the subscription lapse). Something the article doesn’t mention but is entirely consistent with is that there are subnational units who benefit from widespread immunity as a club good and so almost all corporate and university buy a site license for anti-virus software. This is perfectly rational behavior as they end up capturing most of the private and public (or rather, club) benefits of anti-virus in the form of less IT support as well as avoiding things like lowered productivity, bandwidth siphoned by botnets, and exposure to corporate espionage.

Similarly, Megan McArdle occasionally talks about how parents who refuse to vaccinate their children are not just endangering their own children but creating a public health problem. This is something I take seriously, not only as someone with a professional interest in diffusion but I also have a personal stake given that I have a toddler and autism junk science is very popular in west LA. So basically my daughter has an elevated risk of measles because these crackpots are terrified of a vaccine additive that is a) harmless and b) hasn’t even been in use for over ten years.

Both of these cases rely on the logic of diffusion. The big picture is that in an endogenous growth process the hazard is a function of extant saturation so the more infected there are the more at risk any given uninfected person is and by implication anything that lowers the overall infection rate lowers the hazard. A more complex and more micro version comes from the mixed contagion / threshold diffusion model of the sort that was modeled in Abrahamson and Rosenkopf’s 1997 Organization Science paper. (I’ve mentioned it before, but I really do love that paper). In these models individuals have a frailty level drawn from a random distribution and are exposed to (network) contagion and (generalized) cascades from their environment. When the contagion effect exceeds the individual’s threshold, the individual becomes infected and starts spreading the infection, thereby increasing the cascade and contributing to the social network effect on alters. What immunization does is it raises the individual’s threshold appreciably, but not to infinity. This makes the individual less likely to be infected, and where the public good aspect comes in is that this is especially so if the individual is facing only low to medium contagion pressure from the environment.

Another way to think of it is as the flipside of public goods, the tragedy of the commons. If everyone else in the world but you installed Symantec and got a measles shot, it would actually be very safe for you as an individual to forgo these protections because everyone else is healthy and won’t infect you or your computer. On the other hand if nobody else in the world had this protection you would be constantly bombarded by both corporeal and computer viruses and so personal immunization would be more attractive, though because your threshold is finite even with immunization you would still be much more vulnerable than if immunization were widespread. The irony is that in the high-vaccination and low-vaccination contexts there are very different individual benefits on the margin vs on average.

May 6, 2009 at 12:20 pm 3 comments

Strip the spaces from a string

| Gabriel |

Because both Stata and the OS treat (non-quote-escaped) whitespace as syntax parsing, I try to keep spaces out of strings when possible and either just run the words together or put in underscores. I especially like to do this for anything having to do with file paths. On the other hand I sometimes want to keep the spaces. For instance, if I have a file with lots of pop songs (many of which have spaces in their titles) and I want to graph them, I like to have the regular spelling in the title (as displayed in the graph) but take the spaces out for the filename. I wrote a little program called “nospaces” to strip the spaces out of a string and return it as a global.

capture program drop nospaces
program define nospaces
	set more off
	local x=lower("`1'")
	local char ""
	local char "`2'"
	local nspaces=wordcount("`x'")-1
	forvalues ns=1/`nspaces' {
		quietly disp regexm("`x'","(.+) (.+)")
		local x=regexs(1)+"`char'"+regexs(2)
	}
	quietly disp "`x'"
	global x = "`x'"
end

Note that I’m too clumsy of a programmer to figure out how to get it to return a local so there’s the somewhat clumsy workaround of having it return a global called “x.” There’s no reason to use this program interactively but it could be a good routine to work into a do-file. Here’s an example of how it would work. This little program (which would only be useful if you looped it) opens a dataset, keeps a subset, and saves that subset as a file named after the keep criteria.

local thisartist "Dance Hall Crashers"
local thissong "Fight All Night" 
use alldata, clear
keep if artist=="`thisartist'" & song=="`thissong'"
nospaces "`thisartist'"
local thisartist=$x
nospaces "`thissong'"
save `thisartist'_$x.dta, replace

May 5, 2009 at 3:23 am 2 comments

Soo-Wee!

| Gabriel |

In the face of the swine flu France has suggested suspending all EU flights to Mexico and you likewise occasionally hear calls for the US to temporarily close the border as a public health measure. Of course France has nothing close to the levels of social and economic integration with Mexico that we do so it’s a little easier for them to consider this than it would be for us. President Obama was asked about closing the border and said it would be “akin to closing the barn door after the horses are out, because we already have cases here in the United States.” Instead the US government has emphasized measures to curtail the domestic transmission of this disease through things like public transportation, schools, etc.

Having a completely one-track mind, I heard all this and thought it’s all about diffusion from without vs within a bounded population, I know this! Assume for the sake of argument that (absent action by American authorities) Mexico has a constant impact on the hazard rate of infection for Americans. This can either be because Mexico has a stable number of infections or, more realistically, because an increasing number of infections in Mexico are offset by a completely voluntary reduction in border traffic. We can thus treat Mexico as an exogenous influence. Of course Americans infecting each other is an endogenous influence. Now assume that there are two public health measures available, close the border and reduce the (domestic) transmission rate. The latter would involve things like face masks, encouraging sick people to stay home, closing schools that have an infection, etc. Further imagine that each measure would respectively cut the effect of exogenous and endogenous diffusion in half. What is the projected trajectory of the disease under various scenarios?

I’ve plotted some projections below, but first a few caveats:

  • For simplicity, I’ve assumed a linear baseline hazard rather than the more realistic Gompertz hazard. The projection is basically robust to this.
  • I’m assuming “no exit,” i.e. once infected, people remain contagious rather than getting better, being quarantined, or dying. This assumption is realistic over the short-run but absurd over the medium- to long-run.
  • I’m assuming that at this point 1% of the potential American risk pool is already infected. I also tried it with 5% and it works out the same.
  • Most importantly of all, I know nothing at all about the substantive issues of infectious disease, the efficacy of public health measures, and all of that sort of thing. Both the baseline numbers and the projected impacts of the public health measures are totally made up, and not even on made up an informed basis. I’m more interested in the math (which I know) than plausible assumptions to feed into it (which I don’t know).

Anyway, here are the projected number of infections, which again assume that public health measures suppress the relevant disease vector by 50%.

hypotheticals

As can be seen, in the very short run closing the border is more effective but in the medium-run, measures to reduce domestic transmission are more effective. This is just efficacy, not efficiency (i.e., cost-benefit).

The code is below the fold.

(more…)

May 3, 2009 at 2:57 pm

A threshold model for gay marriage

| Gabriel |
In a post at Volokh Conspiracy, Dale Carpenter notes that many states have recently made a push towards gay marriage and this may reflect a “bandwagon” effect. Although most of my work is on pop culture, I’ve experimented with applying these models to state policy and it looks like there’s rather a lot of the kind of bandwagon thing that Carpenter is describing. In my secondary analysis of the Walker data, I found that the typical law spreads from state to state via a mixed-influence curve very similar to that which Bass found for consumer appliances.

I think Carpenter’s analysis is basically accurate as far as it goes, but some of the details are a bit fuzzier than they ought to be. First, “bandwagon” is a vague term by which can mean any cumulative advantage process, be it cohesive contagion, structural equivalence contagion, information cascades, or network externalities. In quoting a post by Ryan Sager, Carpenter implies that it’s mostly cascades. Second, Carpenter talks a lot about public opinion, but this isn’t really the issue, rather what really matters are the opinions of policy makers. For a long time it has been apparent that courts are much more open to gay marriage than democratic policy institutions, but increasingly we are now seeing a gap open between small-d democratic plebiscites and small-r republican state legislatures. For instance in California, the gay marriage issue lines up as the courts and the state legislature (pro) versus plebiscites and the governor (con). It seems that part of the reasons for the public opinion versus policy maker opinion gap is that educated people are more cosmopolitan and part has to do with the coalition politics of the Democratic party (for instance, in California many Democratic legislators voted for AB 849 whose districts voted for prop 8, likewise I would be very surprised if gay marriage is as popular with DC residents as with the DC city council).

When you combine these two vagaries of what is the exact cumulative advantage mechanism and cumulative advantage among whom, you come to a very interesting synthesis about how this may be working. I would suggest that a very large part of the issue is not an information cascade but a network externality among policy makers. These points are subtly different. In an information cascade we don’t know the value of things and so we figure that the consensus about it is informative. With network externalities the consensus itself implies value so the important thing is to be with the consensus.

A simple recent example of network externality dynamics is the format war between HD-DVD and Blu-Ray. Aside from Sony no movie studio really cared about the differences between the formats (and to the extent they did care, they preferred HD-DVD which was cheaper to manufacture) but they cared a lot about making sure they didn’t commit to the wrong format because nobody wants to own a bunch of equipment and a big disc inventory for a format that consumers have rejected. The studios dithered about making a big commitment to either format until Sony basically sent its Playstation brand on a suicide mission to build a critical mass of Blu-Ray players at which point the remaining studios abandoned HD-DVD almost immediately.

Likewise at a certain point gay marriage began to seem inevitable (a prediction shared even by many people who see this as unfortunate). Now many ordinary people would say popular or not is irrelevant, I’d support [marriage equality / traditional marriage] even if everyone disagreed with me. However there is another way to think about it as “being on the right side of history,” a concern made more salient by the frequent analogies drawn to Jim Crow and especially to miscegenation laws. The Sager piece alludes to some pro-segregation pieces published by National Review in the 1950s and this is interesting. At the time these were not considered crackpot ideas (they were probably more mainstream than NR‘s pro drug legalization pieces in the 1990s) but in retrospect they are repulsive. I think this is a big part of what’s going on here, policy makers are not just judging themselves via public opinion today but against what they project public opinion to be in the future. Since they (probably accurately) perceive that gay marriage will become more popular over time they are calibrating their actions to this future metric rather than current opinion, which is basically divided (at present the median voters opposes gay marriage per se but favors the Solomonic “civil unions” compromise). In contrast, some voters care about “being on the right side of history” but many do not, in part because unlike legislators their votes are not recorded and thus if they change their minds in the future (or if they remain the same but their opinions become less popular than they are currently) they will suffer little problem from the inter-temporal contradiction.

(Note: I’m interested in this as a question of diffusion, not a substantive one of morality or policy, and will enforce this in the comments.)

May 1, 2009 at 5:00 pm 3 comments

Stata 64 for Mac

| Gabriel |

I’m way late to the party on this, but a couple months ago Stata released a free 64-bit upgrade for users who already have licenses of Stata 10 for Mac. Note that this upgrade does not download with the usual “update all” command, you need to follow instructions on the website.

The major advantage of this is it lets you access more than 2gb of memory. For reasons that they explain thoroughly it performs calculations faster on many variables but slower on others. If you’re worried about this open your favorite large dataset and type “desc” to see how most of your variables are stored. If most of them are “double” you definitely want 64-bit, if most of them are “byte” maybe not. (They don’t explain how it handles strings). Likewise think about how often you use things like “forvalue” loops, which involve the kind of little numbers that run faster in the old version of Stata.

May 1, 2009 at 3:42 am

Newer Posts


The Culture Geeks

Recent Posts


Follow

Get every new post delivered to your Inbox.

Join 1,477 other followers