Archive for August, 2012
| Gabriel |
Yglesias has a good post on Samsung but he kind of buried the lede with this passage:
A second reason [that supporting incumbents through patents is problematic] is that we may not be able to recombine ideas. What if pinch-to-zoom (iPhone) is a good idea, but so are Android-style widgets that display live information? Well if nobody can copy each other’s ideas, then nobody will ever be able to buy a phone that has them both.
That was interesting to me because it’s exactly the same argument that biologists give as to why sexual reproduction is important at the macro level (the explanations at the micro level are different and mostly have to do with avoiding a monoculture for parasites). This is literally the textbook explanation, as seen in this passage from Futuyama’s Evolution 2nd ed “Recombination breaks down linkage disequilibrium, so that combinations of deleterious alleles on the one hand, and of advantageous alleles on the other, arise and thus increase the variance in fitness, so that selection can increase fitness more effectively” (p. 391). In plain English, Futuyama is saying that with asexual reproduction advantageous mutations have to occur in the same lineage whereas with sexual reproduction they can occur in different lineages then combine through sex. Over the course of a few generations this radically increases the odds of getting more fit organisms. Another way to think of it is to imagine how much easier it would be to build a winning poker hand if you could pair up with another player, xerox their cards, and build several hands out of the copies. It is noteworthy that asexual lineages tend to be very recent lineages descended from sexual lineages which in turn suggests that over the long-run asexual lineages tend to go extinct at much higher rates than sexual lineages.
Patents are actually even worse than asexual reproduction since unlike US patent law, nature allows for the (unlikely) possibility of “independent invention” in two distinct asexual lineages. Let’s put that aside though and assume that patents are no worse for technical innovation than parthenogenesis is for biological innovation — it’s still pretty bad. That nature is dominated by organisms that recombined advantageous traits from independent lineages and that parthenogenetic lineages are so prone to extinction is at least suggestive that we want an intellectual property regime that facilitates borrowing through relatively short patent/copyright terms, narrow patents, and a more robust (and low transaction cost) fair use doctrine to facilitate “derived works.” As is though, we are letting our IP regime drift its way towards the amoeba.
[Update: My colleague Bill Roy observed that my argument assumes a lack of patent licensing. This is true, but it may or may not matter depending on how efficiently licensing works. It is obviously true that patent licensing does not follow perfect Coasean bargaining but equally obvious that there is some licensing. I tend to be a skeptic of the power of licensing to implement anything like perfect Coasean bargaining for three reasons: lack of indexability, transaction costs to negotiations, and prisoner’s dilemma among licensors. However it is worth noting that the recombination issue loses its power to the extent that we assume a relatively efficient licensing regime.]
| Gabriel |
Last time we talked about Social Security names data and how to query particular names. Today I want to talk about the big picture of the variety of names and the question of whether names are getting more diverse over time. Well, this is basically a question of entropy and so let’s do a time-trend of the gini by cohort.
To me, the main thing the graph demonstrates is actually how sensitive these things are to measurement to the extent that I’m unwilling to make substantive claims without better understanding the history of how the data were collected. As you can see, there’s a huge jump in the gini starting with birth cohorts dating to WWI. This predates the enactment of Social Security (which I’ve drawn as a black vertical line) by about twenty years and so my best guess it corresponds to cohorts that were aging into the labor force around the time the act passed. Alternately it could be something about WWI accelerating assimilation in naming practices, but when I see a sharp discontinuity like that my instincts tell me it’s a measurement artifact not a real social change.
Putting that aside, let’s think about the index itself. The Gini coefficient was developed to study social inequality and as such it’s sensitive to both the top and the bottom. Gini is basically a better version of taking the ratio of a high percentile and a low percentile. If you have exactly two people with exactly equal wealth (or exactly two names with equal numbers of babies) then you’d have a very low Gini.
Two names sounds ridiculous but not as much as you’d think. Consider Republican era Rome. We have a pretty good idea of Roman names, at least in the upper classes, because they kept lists called “fasti consulares” of every man who served as consul. I previously did a post showing how a few clans dominated, but for today I want to just use these lists to show how few first names there were. These lists show only 29 male first names, of which only 17 were popular.* (In contrast, the Social Security data lists thousands of male names in circulation in any given year.) The Gini coefficient for praenomen on Republican fasti consulares is .72, which is not that far below the pre-1910 Social Security data. If you’re wondering, the most popular praenomen were Lucius, Gaius, Marcus, and Quintus. Here’s a kernel density plot for praenomen frequency.
As you can see, Roman names follow a count but it’s not ridiculously steep like American names in any arbitrary year (like this graph of 1920). The fact that 29 names following a fairly shallow count could show a comparable gini to thousands of names following an extremely steep count suggests to me that there is something unsatisfying about the metric for our purposes.
Another entropy index we can use is the Herfindahl Hirschman Index (HHI). HHI is meant to measure the potential for monopolies and cartels and as such it’s only really sensitive to the top. HHI is basically a better version of taking the share held by the top-4 (or top-8 or top-k) actors in the system. If you have exactly two people with exactly equal wealth (or exactly two names with equal numbers of babies) then you’d have a very high HHI.
A thought experiment that reveals the difference between Gini and HHI is that if the United States were to suddenly add a few million desperately poor people, for instance by annexing Haiti, this wouldn’t change our income HHI at all but it would drive our income Gini up appreciably. Nonetheless, under a wide range of circumstances the Gini and HHI will be correlated as both measure inequality, they just have different emphases.
In the case of names, HHI will capture the dominance of stock names like “Jake” and “Mary” whereas Gini is better at capturing how common weird names are. So that said, let’s do the time trend again, but this time with HHI.
It’s very interesting that we now don’t see a precipitous change in the late teens but rather a gradual shift leading up to that time. For comparison, the HHI of consular praenomen is 1152, which is off the charts compared to the Social Security data. Finally, let’s note that HHI and Gini agree that girls names show more entropy than boy’s names.
* I say male names because there were no female consuls. Roman women took the feminized form of their father’s clan name. Hence, most of the women of the Julio-Claudian dynasty were (by adoption) descendants of Gaius Julius Caesar and were named “Julia,” which is the feminine form of “Julius.”
| Gabriel |
There’s a lot of great research on names and I’ve been a big fan of it for years, although it’s hard to commensurate with my own favorite diffusion models since names are a flow whereas the stuff I’m interested in generally concern diffusion across a finite population.
Anyway, I was inspired to play with this data by two things in conversation. The one I’ll discuss today is somebody repeated a story about a girl named “Lah-d,” which is pronounced “La dash da” since “the dash is not silent.”
This appears to be a slight variation on an existing apocryphal story, but it reflects three real social facts that are well documented in the name literature. First, black girls have the most eclectic names of any demographic group, with a high premium put on on creativity and about 30% having unique names. Second, even when their names are unique coinages they still follow systematic rules, as with the characteristic prefix “La” and consonant pair “sh.” Third, these distinctly black names are an object of bewildered mockery (and a basis for exclusion) by others, which is the appeal in retelling this and other urban legends on the same theme.*
To tell if there was any evidence for this story I checked the Social Security data, but the web searchable interface only includes the top 1000 names per year. Thus checking on very rare names requires downloading the raw text files. There’s one file per year, but you can efficiently search all of them from the command line by going to the directory where you unzipped the archive and grepping.
cd ~/Downloads/names grep '^Lah-d' *.txt grep '^Lahd' *.txt
As you can see, this name does not appear anywhere in the data. Case closed? Well, there’s a slight caveat in that for privacy reasons the data only include names that occur at least five times in a given birth year. So while it includes rare names, it misses extremely rare names. For instance, you also get a big fat nothing if you do this search:
grep '^Reihan' *.txt
This despite the fact that I personally know an American named Reihan. (Actually I’ve never asked him to show me a photo ID so I should remain open to the possibility that “Reihan Salam” is just a memorable nom de plume and his birth certificate really says “Jason Miller” or “Brian Davis”).
For names that do meet the minimal threshold though you can use grep as the basis for a quick and dirty time series. To automate this I wrote a little Stata script to do this called grepnames. To call it, you give it two arguments, the (case-sensitive) name you’re looking for and the directory where you put the name files. It gives you back a time-series for how many births had that name.
capture program drop grepnames program define grepnames local name "`1'" local directory "`2'" tempfile namequery shell grep -r '^`name'' "`directory'" > `namequery' insheet using `namequery', clear gen year=real(regexs(1)) if regexm(v1,"`directory'yob([0-9][0-9][0-9])\.txt") gen name=regexs(1) if regexm(v1,"`directory'yob[0-9][0-9][0-9]\.txt:(.+)") keep if name=="`name'" ren v3 frequency ren v2 sex fillin sex year recode frequency .=0 sort year sex twoway (line frequency year if sex=="M") (line frequency year if sex=="F"), legend(order(1 "Male" 2 "Female")) title(`"Time Series for "`name'" by Birth Cohort"') end
grepnames Gabriel "/Users/rossman/Documents/codeandculture/names/"
Note that these numbers are not scaled for the size of the cohorts, either in reality or as observed by the Social Security administration. (Their data is noticeably worse for cohorts prior to about 1920). Still, it’s pretty obvious that my first name has grown more popular over time.
We can also replicate a classic example from Lieberson of a name that became less popular over time, for rather obvious reasons.
grepnames Adolph "/Users/rossman/Documents/codeandculture/names/"
Next time, how diverse are names over time with thoughts on entropy indices.
(Also see Jay’s thoughts on names, as well as taking inspiration from my book to apply Bass models to film box office).
* Yes, I know that one of those stories is true but the interesting thing is that people like to retell it (and do so with mocking commentary), not that the underlying incident is true. It is also true that yesterday I had eggs and coffee for breakfast, but nobody is likely to forward an e-mail to their friends repeating that particular banal but accurate nugget.