Archive for July, 2009

The death spiral

| Gabriel |

Two years ago EMI fired all its middle management. That was pretty ominous to cultural sociologists because research by Paul Lopes and Tim Dowd has established that record label middle management was the only thing that let the industry stay creative despite formal consolidation. Since then other labels, book publishers, and film studios have all similarly cut loose their semi-autonomous divisions aimed at niche tastes, instead centralizing management and focusing products on mass appeal, general interest products.

This cost-cutting was apparently insufficient as EMI is no longer doing business with mom and pop record stores. They actually told these stores to send someone to Best Buy to get cds, bring them back to the store, and mark them up. The reason was that the transaction costs are just too high to process their orders and they feel it’s cheaper to ship cds to WalMart by the palette-load than to rinky-dink independent stores one at a time. (This makes me wonder how cdbaby stays in business, as low volume orders are their entire business model).

Taking these two issues in combination means you can kiss the long tail goodbye, or more precisely, the big cultural firms are telling the long tail not to let the door hit its ass on the way out. Yes, we have an enormous access to the depths of the catalog through iTunes and Amazon, but the big cultural firms are increasingly uninterested in producing that kind of material in the first place. We seem to be returning to an older cultural ecology, where the big firms focused on the cultural mainstream and left niche tastes to the little firms, rather than the roughly 1975-2007 cultural system of the major firms coopting niche tastes within semi-autonomous internal fiefdoms.

———-update (8/3/09)———–

Also see this NYT op-ed on the decline of the music industry. The gist of it is that digital has not been kind to recorded music sales but the infographic is very good at illustrating this.


July 30, 2009 at 12:55 pm 1 comment

Priceless wisdom

| Gabriel |

I was double-checking a reference on Amazon and I learned that they are offering Tilly’s Durable Inequality for just $15,221.23. The good news is that it qualifies for super saver shipping.


July 30, 2009 at 1:08 am

Lights, cameras, corporate welfare!

| Gabriel |

The current episode of KCRW’s radio show The Business (or as I call it, Production of Culture: The Series) is all about state tax incentives for film production. They talk to Bill Gerber (a producer who chose to shoot in Michigan over Minnesota because of tax breaks) and host a debate between Steve D’Amico (a Massachusetts legislator who killed a local tax break proposal) and Cameron Henry (a Louisiana legislator who has helped build up his state’s program). The first thing to note is that Gerber tells us that he (a) shopped his production between states and (b) imported 70% of the film’s workforce from Los Angeles. From that it’s pretty easy to infer that Representative D’Amico is right and Representative Henry is wrong about whether states benefit from production credits but it’s worth unpacking why.

A few decades ago the Canadians wanted to create a jobs program and were scared of American cultural imperialism so they fostered their domestic film and television industry with tax incentives. At the time (though no longer) the loonie was weak against the greenback so between the tax breaks and the exchange rate a fair amount of production took a two hour flight from LA to Vancouver. Although the commodity boom of 2007 hurt Vancouver production in the last few years (a case of Dutch disease), the city has developed a viable film industry. Largely imitating the Vancouver example, an increasing number of jurisdictions, including most American states have created film incentives of their own. Of course this results in the situation seen in Gerber’s example, where states end up bidding against each other for one or two shoots, with none of them becoming the next Hollywood, or even the next Vancouver.

Throughout this California has refused to match the incentives and only in the last year has the state begun seriously considering offering film incentives of its own. In the meantime, although a lot of film production has left California, a lot of it has not. Why is actually an interesting question since it’s not like the film industry has a lot of big factories that would be difficult to move. We’re talking about a project-based industry with equipment that is routinely moved by truck on a daily basis, so you might expect that if some state offered tax breaks the industry would run for it. The answer is that as the incumbent Los Angeles benefits from cluster economies and so it doesn’t have to match incentives to remain an attractive place to shoot. Some of this is simply because there are so many stage sets, subcontractors, skilled labor, etc within an hour’s drive. (And it actually is fair to say “an hour” because location shoots often have really long days that start before morning rush hour and end after evening rush hour). There’s also the fact that many elite cultural workers live in Los Angeles and prefer not to travel. Most famously, David Duchovny insisted that the X-Files relocate from Vancouver to LA because he wanted to spend more time with his family (why exactly his wife wanted him to stay home so badly became apparent a few years later). Of course it is itself endogenous that most of these elite workers live in Los Angeles. Give it a few years and you may find that elite workers are insisting that shooting stay not in Los Angeles, but in New Orleans, or Austin, or Vancouver, or wherever it may be that is an attractive place for elite film workers to live because it has a critical mass of production (and perhaps personal income tax shelters).

Representative Henry’s favorite talking point on the program was that every dollar the program costs Louisiana is returned to the economy sevenfold. Although this sounds like pure magical thinking, it’s known in the policy literature as the “fiscal impact multiplier.” The idea is that if a local government subsidizes a farm growing magic beans, not only does the local economy get the payroll for the farmhands, but the farmhands buy services and (locally-produced) goods so the local economy gets not just farmhand jobs but jobs serving farmhands. Plus, you may get tourists coming in to gawk at the magic beans and the economy gets their tourism expenditures too. This kind of crap is endemic to policy discussions of things like sports stadiums and academic economists mostly agree that it’s total flim-flam. The way it works is that somebody will propose that the local government use public funds to build a stadium and instead of just saying “because I like sports,” they hire consultants to make a utilitarian case that we can’t afford not to subsidize sports. Typically the argument goes that Magic Beans Stadium will attract thousands of out-of-town visitors, none of whom would have visited anyway, none of whom are displacing other visitors, and all of whom stay at the Ritz and eat nothing but foie gras during their visit.

So here’s how it works for film. If a movie shoots in New Orleans there are basically two things that can occur. One is that they use local labor. However because the volume of film production is extremely variable these workers will have a lot of downtime between shoots, during which the local economy will be underutilizing skilled labor. The other is that the film producers can bring in labor from Los Angeles who will take up an enormous amount of tourism resources (hotel rooms, air travel, restaurant meals, etc) which is serviced by locals. If this tourist infrastructure is to always be available for visiting film production then it means that the rest of the time these hotel rooms, flights, etc will be empty and you have a peak load problem. Now the peak load problem is of hotel rooms (and hotel maids) rather than sound stages (and key assistant grips). Also note that there is a huge deadweight loss involved since it’s cheaper and more comfortable for that key assistant grip to live his wife in Santa Clarita and drive to shooting locations listed by their Thomas Guide grid coordinates than it is for him to fly to Detroit and stay in a motel for three months, go home for a few months, then fly to New Orleans and stay in a motel, etc. On the other hand if the hotels are continuously full, taken up by tourists and non-film business travelers whenever there is no shooting, then the film workers are just displacing these other travelers. This is an important point because fiscal impact multipliers usually fails to factor in the (public and private) opportunity cost of developing the subsidized sector. The only real solution to the deadweight loss of the peak load problem is risk-pooling, but then you’re back to cluster economies and you could have stayed in Los Angeles to get that.

So, aside from simply being mistaken about the merits, why would a politician support tax breaks? Representative D’Amico’s theory is that they are mesmerized by the glamor of Hollywood. This is plausible as Hollywood has always been able to attract (this is a technical term) “sucker money.” However I think equally important is the basic dynamic of public choice theory — note that Representative Henry has three film facilities in his district. The tax breaks may very well be a good thing for his district and yet still be a bad thing for both the state of Louisiana and America as a whole. On the other hand, Representative D’Amico seems to have been arguing against interest as his district is near the proposed film facility in Plymouth and his constituency includes craftworker unions. You might be asking where is my stake in this? Given that I live in LA, it’s in my personal interests for the state of California to give tax breaks to Hollywood as this represents a net transfer of wealth from Northern California to SoCal. Better yet for me if no jurisdictions offer tax incentives and cluster economy dynamics alone keep production in LA so you can consider this post to itself be rent-seeking.

July 23, 2009 at 5:07 am

Merging Pajek vertices into Stata

| Gabriel |

Sometimes I use Pajek (or something that behaves similarly like Mathematica or Network Workbench) to generate a variable which I then want to merge back onto Stata. However the problem is that the output requires a little cleaning because it’s not as if the first column is your “id” variable as it exists in Stata and the second column the metric and you can just merge on “id.” Instead they tend to encode your Stata id variable, which means you have to merge twice, first to associate the Stata id variable with the Pajek id variable, second to associate the new data with your main dataset.

So the first step is to create a merge file to associate the encoded key with the Stata id variable. You get this from the Pajek “.net” file (ie, the data file). The first part of this file is the encoding of the nodes, the rest (which you don’t care about for these purposes) is the connections between these nodes. In other words you want to go from this:

*Vertices 3
1 "tom"
2 "dick"
3 "harry"
1 2
2 3

to this:

pajek_id	stata_id
1	Tom
2	Dick
3	Harry

The thing that makes this a pain is that “.net” files are usually really big so if you try to just select the “vertices” part of the file you may be holding down the mouse button for a really long time. My solution is to open the file in a text editor (I prefer TextWrangler for this) and put the cursor at the end of what I want. I then enter the regular expression search pattern “^.+$\r” (or “^.+$\n”) to be replaced with nothing, which has the effect of erasing everything after the cursor. Note that the search should start at the cursor and not wrap so don’t check “start at top” or “wrap around.” You’ll then be left with just the labels, the edge list having been deleted. Another way to do it is to search the whole file and tell it to delete lines that do not include quotes marks.

Having eliminated the edge list and kept only the encoding key, at this point you still need to get the vertice labels into a nice tab-delimited format, which is easily accomplished with this pattern.


Note the leading space in the search regular expression. Also note that if the labels have embedded spaces there should be quotes around \1 in the replacement regular expression.

Manually name the first column “pajek_id” and the second column “stata_id” (or better yet, whatever you call your id variable in Stata) and save the file as something like “pajekmerge.txt”. Now go to Stata and use “insheet,” “sort,” and “merge” to add the “pajek_id” variable into Stata. You’re now ready to import the foreign data. Use “insheet” to get it into Stata. Some of these programs include an id variable, if so name it “pajek_id.” Others (eg Mathematica) don’t and just rely on ordering. If so, enter the command “gen mathematica_id=[_n]”. You’re now ready to merge the foreign data into Stata.

This is obviously a tricky process and there are a lot of stupid ways it could go wrong. Therefore it is absolutely imperative that you spot-check the results. There are usually some cases where you intuitively know about what the new metric should be. Likewise, you may have another variable native to your Stata dataset that should have a reasonably high (positive or negative) correlation with the new metric imported from Pajek. Check this correlation as when things should be correlated but ain’t it often means a merge error.

Note that it’s not necessarily a problem if some cases in your Stata dataset don’t have corresponding entries in your Pajek output. This is because isolates are often dropped from your Pajek data. However you should know who these isolates are and be able to spot-check that the right people are missing. If you’re doing an inter-locking board studies and you see that an investment bank in your Stata data doesn’t appear in your Pajek data then you probably have a merge error.

July 22, 2009 at 5:02 am 2 comments

The Institutional Logic of War

| Gabriel |

We lost Robert McNamara last week, although I was too busy to blog it at the time. (Organizations and Markets was more timely). McNamara began as an academic, educated at Berkeley and working at Harvard. During WWII he advised on strategic bombing and later on went to work at Ford, rising through the ranks until being tapped by JFK as Secretary of Defense. From there he went on to have an important role shaping the World Bank. McNamara was interesting in part for his particular accomplishments but mostly for what he represented as an elite moving between the heights of academia, military, industry, and international NGOs. He was the very best and very brightest of the best and the brightest; one of those Harvard faculty whose rule William F. Buckley judged inferior to persons drawn at random from the phone book; the mid-20th century technocratic liberal consensus made flesh and sent to dwell among us.

I highly recommend that any social scientist interested in organizations, elites, or politics rent the documentary The Fog of War, in which McNamara describes his experiences in the 40s-60s at Harvard, the Air Corps, Ford, and Defense. In all of these areas McNamara rigorously applied the methods of systems analysis and operations research, treating organizational problems as technical engineering problems where measurement and mathematical analysis are the tasks of the manager. The film is organized as a series of lessons he offers drawn from his own life, but I suggest that the really interesting way to view the film is to treat his experience as data but get some critical distance from his own interpretation of it.

McNamara is thoroughly penitent, almost flagellantly so, about the moral aspects of his decisions. However he is thoroughly convinced of the technical wisdom of his decisions. In particular, his position on the Viet Nam war seems to have been something like “if I couldn’t have won that war then nobody could have and since I couldn’t win it then I should have put my foot down and insisted that we not get into the war in the first place.” At no point does he consider the argument, even for purposes of rebutting it, that it might have been smarter to treat the war as a counter-insurgency campaign waged over the control of territory and populations rather than a war of attrition waged over resources.

Consider two books that together make this argument, one of them by a sociologist (and a dove) and the other by a historian (and a hawk). James W. Gibson’s The Perfect War: Technowar in Vietnam describes the McNamara/Westmoreland phase of the Viet Nam war in all of its bureaucratic absurdity, which reminds me of nothing so much as the Jonathan Pryce character in the Adventures of Baron Munchausen. Hence the notorious obsession with body counts, or better yet, ratios of body counts. However it was worse than that. For instance, one infantry division ran a scored competition where the junior officers competed on the basis of a formula where (among other things) a captured enemy machinegun or mortar counted for 100 points, a captured enemy tactical radio 200 points, and a wounded American negative fifty points.

The Pentagon Papers reveal an obsession with using strategic bombing to disrupt North Viet Namese imports and supply lines from the North to forward NVA units and the VC. These documents show formulas demonstrating that if the enemy consumed so many tons of supplies per day, and we disrupted their supply lines such that they could not ship a quantity equal or greater to that quantity, then mathematically the enemy eventually had to lose. However the theory was that long before such attrition reached its actual mathematical solution, the enemy would extrapolate the function and surrender. Gibson writes that “For the Americans, the question became one of estimating when the foreign Other would recognize U.S. technical superiority and consequently give up all hopes of victory.”

A key component of this was aerial bombing, which was not so much tactical or strategic as symbolic. In WWII we leveled entire German and Japanese cities with aerial bombing, but we dropped an even larger quantity of bombs in Indochina — mostly in wilderness and rural areas (often leafleted in advance with evacuation warnings). In theory this bombing was targeted at enemy units and supply lines, but Gibson argues that we often had no real idea what was below the canopy and were basically just blowing up huge batches of countryside (and more than a few peasants) for symbolic effect, to send what he sarcastically calls “bomb-o-grams,” reminding the enemy in case he had forgotten that, yes, the American military was indeed backed up by the largest economy in the history of the world.

Gibson does not explicitly limit his critique of the war to the McNamara/Westmoreland era and I think it’s fair to say that he’d probably agree with McNamara’s recent assesment of the war as intrinsically immoral and unwinable, but I think it’s still telling that his attention focuses on 1965-1968. For the last phase of the war you can consider Lewis Sorley’s A Better War, which describes Abrams’ use of more traditional counter-insurgency doctrine and contrasts it favorably with the McNamara/Westmoreland strategy that Gibson calls “technowar.” Sorley argues that Abrams abandoned the use of hyper-rationality, attrition, etc and emphasized controlling populations and controlling territory.

You needn’t go so far as to agree with Sorley that the Abrams strategy was winnable to say that it was more suited (or perhaps, less poorly suited) than the McNamara strategy of treating Viet Nam like an assembly line at Ford or an operations case study at Harvard Business School. Assuming this mismatch is an accurate description, then the interesting thing is the transposition of an institutional logic of bureaucratic rationality from some domains where it work pretty well (academia, manufacturing, bombing Japan into the stone age, nuclear deterrence) to a domain where it is just flat out absurd (counter-insurgency).

Furthermore there is the issue that even outside of a comparative context, military doctrine is just plain interesting. Ian Roxborough had a 2004 review essay in Sociological Forum in which he argues that sociologists do not give enough attention to the military, and when we do we often treat war as an extremely abstract macro-historical problem or (as the DOD itself likes to hire us to do) as an organization with human resources issues. What we are less likely to do is focus on the main thing that the military exists to do, that is, use organized violence to coerce the enemy to our will. This is a key issue for organizational theory, and especially for institutionalism, because making war is organized around doctrine. The McNamara case described above shows the failure of importing doctrine to a situation where it really doesn’t work, but there are other interesting issues as well. For instance, how did the concerns of internal stakeholders promote the development of strategic air doctrine before WWII? Why was the Air Force so much more interested in the fact that nuclear bombs blow stuff up than in the fact that they also create firestorms? Why did the admiralty of the Royal Navy resist innovations like adapting gunnery to the waves that junior officers had developed? All of these are questions of military doctrine that are best understood by understanding the military as an organization that is subject to all the other pressures, constraints, and bounded rationality of any other organization.

July 20, 2009 at 5:21 am

Moonshine by 2nd Day Air

| Gabriel |

Jonathan Adler @ Volokh relays a very interesting and somewhat unpleasant story of how the regulatory sausage gets made that reveals breathtaking cynicism from several parties. Basically, UPS is subject to a more pro-union regulatory jurisdiction than FedEx and so UPS is cooperating with the unions to try to push FedEx into the same regulatory jurisdiction. Assuming for the sake of argument that pro-union regulations are a good thing, this is a classic “baptist and bootlegger” coalition in which an ideologically motivated party (pro-labor people) ally with a selfishly interested party (UPS) to reach a common end (unionizing FedEx, and thereby raising its prices to be less attractive relative to UPS).

Here’s where it gets ugly. The American Conservative Union sent a letter to FedEx offering to help defend FedEx for $2 million. Fair enough, the ACU probably isn’t swimming in fungible expenses and it’s trying to create a baptist and bootlegger coalition of its own between its own open shop ideological interests and FedEx’s selfish interests in keeping its labor costs low. What’s nasty is that when FedEx declined the offer, the ACU turned around and publicly sided with UPS. That is, the ACU wasn’t merely proposing an alliance of convenience on an issue on which it had a pre-existing principled position, but it was basically auctioning its services. This shouldn’t be surprising after the Abramoff scandal, which at its core was about the religious right soliciting bribes from gambling interests to freeze out competition, but it’s still amazing to see the “baptists” acting as cravenly as the “bootleggers.”

July 17, 2009 at 12:30 pm 1 comment

To the philosopher equally false

| Gabriel |

Mark Kleiman and Robert Wright posted a bloggingheads diavlog (here’s the mp3 link*). In it Mark describes the UCLA faculty Tanakh discussion group and I can confirm that it’s exactly as he describes it and is really good. Although I haven’t actually attended in a few years I enjoyed it very much when I did and since then I have followed it vicariously through the excellent set of notes that Mark circulates every week.

Since Wright is obsessed with the evolution of cooperation, and his new book is about the social contingencies of religion supporting inter-group cooperation, Wright and Kleiman share a few thoughts on the “intolerant monotheism” thesis. This reminded me of Gibbon’s Decline and Fall of the Roman Empire. The thesis of the book is that Rome was destroyed by “immoderate greatness” and “superstition” (read: Christianity). The latter is often interpreted by people who haven’t read the book as meaning that Gibbon is arguing that the Christianized Romans took all that “turn the other cheek” stuff seriously and became a bunch of pussies. Of course, Gibbon wasn’t that stupid and was well aware that, for instance, the (Christian) Byzantine emperors especially weren’t shy about having their rivals murdered or blinded. What he was really arguing was that Christianity is a religion of orthodoxy, which implies conflict with heretics. Indeed, Constantine had scarcely legalized Christianity when bishops started asking him to take sides in various theological disputes. In contrast the concept of “heresy” was absurd to the pagan Roman mind. The pagan Romans acknowledged different versions of myth and ceremony, but they just kind of bracketed them and moved on as being kind of the same thing, kind of different, but who cares, we’ll do it both ways if we have to.

To this day the parishioners at a Catholic mass still recite “Lord Jesus Christ … begotten not made, being of one substance with the father” and “we acknowledge one baptism for the forgiveness of sin.” Most of them don’t know that these two clauses are references to extremely violent 4th century church controversies.

The “begotten” phrase is part of the perennially controversial “Christological” question as to what sort of entity exactly was Jesus. The orthodox answer is, as it says in John, “In the beginning was the Word, and the Word was with God, and the Word was God. … The Word became flesh and made his dwelling among us.” Among the many heretical answers are the good and abstract god as compared to the evil and material father (Gnosticism), a single person with the father (Unitarianism), and a subordinate spiritual entity created by the father (Arianism). The last of these in particular caused a lot of trouble as Arianist missionaries got to the various German tribes before the Catholic Church and the fact that the German foederati were heretics created all sorts of headaches for Roman diplomacy for centuries.

The “one baptism” language is mostly a reference to the Donatist controversy. During the Diocletian persecution there were some very famous martyrs but a much larger number of collaborators. After the Edict of Milan the official policy of the church was amnesty, but the followers of the bishop Donatus disagreed and did things like trying to impeach collaborator bishops. Long story short, the legions marched through the province of Africa massacring Donatists but even a century later they were still a problem for Augustine.

Gibbon’s thesis as to Christianity is thus that religion created a source of cleavage within the empire. Of course this can’t be the whole story because even before the birth of Christ, Rome saw plenty of civil wars and succession movements brought on by such fractious figures as: Tiberius and Gaius Gracchus, Marius, Sulla, Quintus Sertorius, Pompey Magnus, Julius Caesar, Marcus Brutus, Mark Antony, and Octavian. Likewise in the Kleiman-Wright conversation Wright flat out asserts that “wars of religion aren’t really about religion.” I half agree.

On the one hand, Rome was almost always a fractious place — especially in the century immediately preceding Christianization. Nonetheless, I don’t think it’s fair to say that the interjection of accusations of “heresy” was only super-structure or window-dressing. Basically I think that a cognitive toolkit approach is a useful way to approach the issue. Roman civil wars continued to have terrestrial motives, sometimes ethnic/provincial separatism and other times the personal ambitions of usurpers. Nonetheless, the accusation of heresy (or as the heretics themselves might put it, a new conception of orthodoxy) provided an ideological rallying point for fraction. I think it’s telling to compare the accounts of pre-Christian strife in Plutarch with those of Christian strife in Eusebius or Augustine. The civil wars of the (pagan) late Republic and the principate were almost exclusively about the personal ambitions of noblemen with their followers mostly being organized around a mix of patron-client ties and social class interests (I think the easiest way to understand Julius Caesar is to imagine Hugo Chavez in a toga). In contrast, the religious wars of the dominate were less personality-driven and more ideological and ethnic in character, often providing a unifying agenda to revolt of the sort that is very recognizable to us modern people used to ideological wars between, say, fascists and communists or anti-colonial wars of national liberation. Note that Arius himself was an Egyptian and the Germans he converted were across the frontier. The Donatists were mostly Berbers. Thus underlying theological disputes about Christology or reconciliation were essentially political differences. However, this is not to say that the theological disputes did not matter in that these theological disputes provided an ideological tool for framing the struggle in a way that changed their character.

Likewise you see similar issues at play in early modern history. For instance on one level the English civil war was about social class with the emerging middle class opposed to the aristocracy whereas on the other hand it was about Calvinism versus high church Anglicanism. I think it’s fair to say that the roundheads would not have been nearly so tenacious if the war were only about the power of parliament versus that of the king. Calvinism served as an organizing toolkit to impose sense on the underlying class and political issues in a way that changed the character of those issues.


*I love text because I can skim it, and I love audio because it’s conducive to multi-tasking (while driving, etc), but I really don’t get the point of videos where the visual element adds essentially no entertainment or information. I can’t even imagine having so much time (or attention span) that I’d sit in front of my computer staring at a lo-resolution image of a couple of bush-league pundits for an hour. As far as I’m concerned it could be instead of

July 17, 2009 at 11:13 am 1 comment

News diffusion

| Gabriel |

The New York Times and Slashdot both have stories on an ambitious paper by a team of computer scientists that studies the diffusion of news stories. The website for the paper and supplementary materials (including the cleaned dataset) is at

The most impressive thing about the project is the data collection / cleaning effort. The team scraped basically all of the mainstream media and blogosphere for the last three months of the 2008 election. They then used a fuzzy algorithm to identify temporally-unusual phrases so “palling around with terrorists” would count whereas “in a press release” would not. What’s really impressive is that they not only identify exact phrases (which is pretty easy to code) but paraphrases (which is really hard to code). For instance, they identify about 40 versions of Sarah Palin’s “palling around with terrorists” characterization of Barack Obama’s relationship with Bill Ayers. They then identify and time-stamp every usage of each phrase in their scraped data. The dataset is human-readable and is arranged as a three level hierarchy of time-stamped news items within specific phrasings within broad phrasings. This nesting of paraphrases within general phrases goes a long way towards solving the problem of “reinvention” which might otherwise obscure that several “different” phrases are really only minimally distinct versions of the same phrase. Here’s a sample of the dataset:

  2  8  we're not commenting on that story i'm afraid   2131865
     3  3  we're not commenting on that    489007
        2008-08-18 14:23:05  1  M
        2008-11-26 01:27:13  1  B
        2008-11-27 18:55:30  1  B
     5  2  we're not commenting on that story      2131864
        2008-12-08 14:50:18  3  B
        2008-12-08 19:35:31  2  B

Their analysis was also very good (but nowhere near as amazing as the cleaning effort). Basically their findings were entirely consistent with the diffusion literature. They found that the popularity of different phrases followed a power law and the distribution of new mentions of a phrase followed a bell curve (which is equivalent to saying that the cumulative mentions of a phrase follow an s-curve). Both of these findings are consistent with a cumulative advantage process, and indeed, they model the process as a tension between “imitation” and “recency.”

This “two forces in tension” thing is typical of many endogenous models. It’s actually very easy to figure out a model that results in a stable equilibrium of “everything” or “nothing,” for instance the Schelling segregation model. However, it’s much harder to work out a model that has a more moderate equilibrium. So in cumulative advantage models like this, the trick is to explain why popularity is “only” described by a power-law when it’s easier to see how it could be described by a step function (one target gets all the popularity, everyone else gets absolutely none). Because the memetracker data has a temporal element they use a time decay function. Other similar models have used things like reciprocity (Gould), cost of linking (Podolny), and heterogeneity in taste (Rosen).

In addition to “imitation” and “recency,” they also note that some kind of intrinsic “attractiveness” of the phrase might be an issue, though they bracket this largely because it would require a lot of hard human content analysis. From perusing the most popular stories, my guess is that the ideal phrase, from a newsworthiness perspective, is an ambiguous but potentially inflammatory quote from a prominent person describing another prominent person that should be comprehensible to someone with minimal background knowledge. So something like “lipstick on a pig” (which Barack Obama said about part of the McCain-Palin platform, though it was often interpreted as being a personal insult to Sarah Palin) is just about perfect in all these respects. Slightly less wonderful is the “palling around with terrorists” quote, because this requires some background knowledge about Bill Ayers and the Weather Underground (or, under the dog whistle theory of this quote, comparable familiarity with the “secret Muslim” theory).

The most novel finding was that most of the action for any given phrase occurs within an eight hour window more or less symmetrically centered around the peak popularity, and indeed they describe the process as being qualitatively different (for one thing, it’s more sparse) outside of this window than it is within it. This struck me as an issue where there could be some profit in treating the issue not just as some disembodied complexity science issue, but as the result of a particular social process involving real human beings about whom we know something. One of the things we know about people is that we tend to go to work during the day, usually for eight hours, and this characterizes not just journalists but many of the most important political bloggers. One simple prediction based on this is that the eight hours should similar for most innovations, my guess being that they would be roughly 8am to 4pm east coast time. (Why, you might ask, don’t I just test this myself given that they make their data available? Because the files are too big to open with a text editor and it would take me hours to figure out a way to cull out the aspects of the data that aren’t relevant to this purpose, though someone who was good at “sed” or “awk” could probably write a script to do it in five minutes).

While I think the team made excellent operational decisions, these decisions may nonetheless imply (probably small) biases. The team acknowledges that catch phrases are only somewhat intrinsically interesting but they are mostly using them as a proxy for something even more interesting, what you might call “stories” or “ideas.” This probably has the effect of giving the study a disproportionate emphasis on gotcha and insults like “lipstick on a pig” or “palling around with terrorists” rather than more complex ideas which may be less likely to be consistently described with the same or similar strings of words (though even very complex ideas eventually develop a shorthand, as with “cap and trade”). Similarly, the algorithm itself selects temporally unusual phrases, which may imply a selection for short-run issues rather than perennial debates.

Overall a very impressive paper that’s well worth reading for anyone interested in diffusion, news, blogs, or just really high quality database work. Even better is that they provide their data (which as of now they are continuing to update) and a detailed description of how they processed it, thereby providing a platform on which other people can build, perhaps by things like focusing on substantive concerns.

July 14, 2009 at 5:47 pm 2 comments

Friending race

| Gabriel |

Noah pointed out to me that some of Eszter’s work got a plug in the NYT. In her UIC freshmen survey, she found that Hispanics were still mostly on MySpace but others had mostly moved to Facebook. The argument for this differential shift by race is that these kinds of things benefit tremendously from network externalities and since underlying social networks are segregated, the social network websites come to reflect this.

Although the NYT article mention “white flight,” mostly in the context of discussing another researcher, this characterization doesn’t seem exactly right to me both empirically and theoretically. Eszter’s work shows that blacks have mostly moved to Facebook but in most types of interaction (residential segregation, marriage, etc) Anglo whites are more likely to associate with Hispanics than with blacks. Likewise in classic white flight models, whites are fleeing the presence of blacks but what seems to be going on here is that whites are drawn by other whites (or more specifically, by their friends, who are mostly white). Unlike housing, where you know who your neighbors are, on a social networking site you only associate with the people you choose. In other words, it’s a pull of being drawn by your friends, not a push of avoiding people you look down upon. It’s interesting to contrast the types of differences (I hesitate to use the word “segregation”) that can result entirely from the pull of homophily rather than the push of heteroantipathy (is that a word?).

July 10, 2009 at 3:19 pm 1 comment


Continuing my discussion of Long’s Workflow

One of the things that Long is appropriately insistent on is good archiving for the long-term. First, he notes that the most serious issue is the physical storage medium and the need to migrate data whenever you get a new system given that even formats that were popular in recent memory like zip disks and 3.5 floppies are now almost impossible to find hardware for. I think in the future this should become easier as hard drives get so ginormous that it’s increasingly feasible to keep everything on one disk rather than pushing your archives to removable medium that can get lost. When it’s all on one (or a few) internal disks then you tend to migrate and backup, unlike removable media that get lost in your file cabinet until they are obsolete and/or physically corroded. Of course in those increasingly rare instances where IRB or proprietary data provision issues are not a concern the best way to handle this is to use ICPSR, CPANDA, or a similar public archive.

Even if you can access the files, the issue is can you read them. Long appropriately stresses the issue of keeping data in several formats but I think he’s a bit too agnostic about which formats are likely to last. As I see it there are basically two issues: popularity and opacity. Popularity is simply how many people use the format. For this reason Long endorses SAS Transport because it’s the official format of the FDA. However Long overlooks the other key issue of opacity, which basically comes down to the two-related issues of being proprietary and being binary (as compared to text).

The more popular and the less opaque, the more likely it is that you’ll be able to read your data in the future. So looking into my crystal ball twenty years or so I think it’s fair to guess that Stata binary will not be readable with any ease and uncompressed tab-delimited ASCII will remain the lingua franca of data. I say tab-delimited instead of fixed-width because dictionary files get lost, tab-delimited instead of csv because embedded literal commas are common whereas embedded literal tabs are nonexistent, and I say uncompressed because compressed files are more vulnerable to corruption.

The problem with ASCII is that if (like Long) you find value labels and variable labels to be crucial then ASCII loses a lot of value. I think a good compromise is the Stata XML format. As you can see by opening it in a text editor, XML is human-readable text so even if no off-the-shelf import filters exist (which is unlikely as XML is increasingly the standard) you could with relatively little effort/cost write a filter yourself in a text-processing language like perl — or whatever the equivalent of perl will be in a generation.

Because it’s smaller and faster than XML, I still use Stata binary for day to day usage but I’m going to make a point of periodically making uncompressed XML archives, especially when I finish a project.

July 1, 2009 at 5:01 am 2 comments

The Culture Geeks