Now These Are the Names, Pt 2

August 23, 2012 at 6:27 am 2 comments

| Gabriel |

Last time we talked about Social Security names data and how to query particular names. Today I want to talk about the big picture of the variety of names and the question of whether names are getting more diverse over time. Well, this is basically a question of entropy and so let’s do a time-trend of the gini by cohort.

To me, the main thing the graph demonstrates is actually how sensitive these things are to measurement to the extent that I’m unwilling to make substantive claims without better understanding the history of how the data were collected. As you can see, there’s a huge jump in the gini starting with birth cohorts dating to WWI. This predates the enactment of Social Security (which I’ve drawn as a black vertical line) by about twenty years and so my best guess it corresponds to cohorts that were aging into the labor force around the time the act passed. Alternately it could be something about WWI accelerating assimilation in naming practices, but when I see a sharp discontinuity like that my instincts tell me it’s a measurement artifact not a real social change.

Putting that aside, let’s think about the index itself. The Gini coefficient was developed to study social inequality and as such it’s sensitive to both the top and the bottom. Gini is basically a better version of taking the ratio of a high percentile and a low percentile. If you have exactly two people with exactly equal wealth (or exactly two names with equal numbers of babies) then you’d have a very low Gini.

Two names sounds ridiculous but not as much as you’d think. Consider Republican era Rome. We have a pretty good idea of Roman names, at least in the upper classes, because they kept lists called “fasti consulares” of every man who served as consul. I previously did a post showing how a few clans dominated, but for today I want to just use these lists to show how few first names there were. These lists show only 29 male first names, of which only 17 were popular.* (In contrast, the Social Security data lists thousands of male names in circulation in any given year.) The Gini coefficient for praenomen on Republican fasti consulares is .72, which is not that far below the pre-1910 Social Security data. If you’re wondering, the most popular praenomen were Lucius, Gaius, Marcus, and Quintus. Here’s a kernel density plot for praenomen frequency.

As you can see, Roman names follow a count but it’s not ridiculously steep like American names in any arbitrary year (like this graph of 1920). The fact that 29 names following a fairly shallow count could show a comparable gini to thousands of names following an extremely steep count suggests to me that there is something unsatisfying about the metric for our purposes.

Another entropy index we can use is the Herfindahl Hirschman Index (HHI). HHI is meant to measure the potential for monopolies and cartels and as such it’s only really sensitive to the top. HHI is basically a better version of taking the share held by the top-4 (or top-8 or top-k) actors in the system. If you have exactly two people with exactly equal wealth (or exactly two names with equal numbers of babies) then you’d have a very high HHI.

A thought experiment that reveals the difference between Gini and HHI is that if the United States were to suddenly add a few million desperately poor people, for instance by annexing Haiti, this wouldn’t change our income HHI at all but it would drive our income Gini up appreciably. Nonetheless, under a wide range of circumstances the Gini and HHI will be correlated as both measure inequality, they just have different emphases.

In the case of names, HHI will capture the dominance of stock names like “Jake” and “Mary” whereas Gini is better at capturing how common weird names are. So that said, let’s do the time trend again, but this time with HHI.

It’s very interesting that we now don’t see a precipitous change in the late teens but rather a gradual shift leading up to that time. For comparison, the HHI of consular praenomen is 1152, which is off the charts compared to the Social Security data. Finally, let’s note that HHI and Gini agree that girls names show more entropy than boy’s names.

* I say male names because there were no female consuls. Roman women took the feminized form of their father’s clan name. Hence, most of the women of the Julio-Claudian dynasty were (by adoption) descendants of Gaius Julius Caesar and were named “Julia,” which is the feminine form of “Julius.”

  • 1. Jon Hersh  |  August 28, 2012 at 7:06 am

    Fantastic post! I liked the discussion of comparing the Gini and HHI metrics. Migration inflows vary widely during this period, both in magnitude and country of origin. How do those facts affect the above? How much of the trend is a result of changing tastes versus compositional effects? What would be a good way to decompose the two?

    • 2. gabrielrossman  |  August 28, 2012 at 8:33 am

      I think the first step is to factor out the measurement effects of Social Security itself. This strikes me as nontrivial but let’s assume we can do it then move on to your question. Unfortunately we don’t have child name by personal or parental nativity in this dataset but you could probably find it in other datasets. For example, both Lieberson and Fryer+Levitt used birth certificate data that included ethnicity. Even within just the Social Security data, you can break it down by both state and by year which you could combine with Census data to get a sense at the ecological level. I think you’d like Lieberson’s work, both the book and the associated articles, as he does some pretty good stabs at answering these questions. Also see Sue and Telles 2007

