| Gabriel |
The New York Times and Slashdot both have stories on an ambitious paper by a team of computer scientists that studies the diffusion of news stories. The website for the paper and supplementary materials (including the cleaned dataset) is at Memetracker.org.
The most impressive thing about the project is the data collection / cleaning effort. The team scraped basically all of the mainstream media and blogosphere for the last three months of the 2008 election. They then used a fuzzy algorithm to identify temporally-unusual phrases so “palling around with terrorists” would count whereas “in a press release” would not. What’s really impressive is that they not only identify exact phrases (which is pretty easy to code) but paraphrases (which is really hard to code). For instance, they identify about 40 versions of Sarah Palin’s “palling around with terrorists” characterization of Barack Obama’s relationship with Bill Ayers. They then identify and time-stamp every usage of each phrase in their scraped data. The dataset is human-readable and is arranged as a three level hierarchy of time-stamped news items within specific phrasings within broad phrasings. This nesting of paraphrases within general phrases goes a long way towards solving the problem of “reinvention” which might otherwise obscure that several “different” phrases are really only minimally distinct versions of the same phrase. Here’s a sample of the dataset:
2 8 we're not commenting on that story i'm afraid 2131865 3 3 we're not commenting on that 489007 2008-08-18 14:23:05 1 M http://business.theage.com.au/business/bb-chief-set-to-walk-plank-20080818-3xp7.html 2008-11-26 01:27:13 1 B http://sfweekly.com/2008-11-26/news/buy-line 2008-11-27 18:55:30 1 B http://aconstantineblacklist.blogspot.com/2008/11/re-researcher-matt-janovic.html 5 2 we're not commenting on that story 2131864 2008-12-08 14:50:18 3 B http://videogaming247.com/2008/12/08/home-in-10-days-were-not-commenting-on-that-story-says-scee 2008-12-08 19:35:31 2 B http://jplaystation.com/2008/12/08/home-in-10-days-were-not-commenting-on-that-story-says-scee
Their analysis was also very good (but nowhere near as amazing as the cleaning effort). Basically their findings were entirely consistent with the diffusion literature. They found that the popularity of different phrases followed a power law and the distribution of new mentions of a phrase followed a bell curve (which is equivalent to saying that the cumulative mentions of a phrase follow an s-curve). Both of these findings are consistent with a cumulative advantage process, and indeed, they model the process as a tension between “imitation” and “recency.”
This “two forces in tension” thing is typical of many endogenous models. It’s actually very easy to figure out a model that results in a stable equilibrium of “everything” or “nothing,” for instance the Schelling segregation model. However, it’s much harder to work out a model that has a more moderate equilibrium. So in cumulative advantage models like this, the trick is to explain why popularity is “only” described by a power-law when it’s easier to see how it could be described by a step function (one target gets all the popularity, everyone else gets absolutely none). Because the memetracker data has a temporal element they use a time decay function. Other similar models have used things like reciprocity (Gould), cost of linking (Podolny), and heterogeneity in taste (Rosen).
In addition to “imitation” and “recency,” they also note that some kind of intrinsic “attractiveness” of the phrase might be an issue, though they bracket this largely because it would require a lot of hard human content analysis. From perusing the most popular stories, my guess is that the ideal phrase, from a newsworthiness perspective, is an ambiguous but potentially inflammatory quote from a prominent person describing another prominent person that should be comprehensible to someone with minimal background knowledge. So something like “lipstick on a pig” (which Barack Obama said about part of the McCain-Palin platform, though it was often interpreted as being a personal insult to Sarah Palin) is just about perfect in all these respects. Slightly less wonderful is the “palling around with terrorists” quote, because this requires some background knowledge about Bill Ayers and the Weather Underground (or, under the dog whistle theory of this quote, comparable familiarity with the “secret Muslim” theory).
The most novel finding was that most of the action for any given phrase occurs within an eight hour window more or less symmetrically centered around the peak popularity, and indeed they describe the process as being qualitatively different (for one thing, it’s more sparse) outside of this window than it is within it. This struck me as an issue where there could be some profit in treating the issue not just as some disembodied complexity science issue, but as the result of a particular social process involving real human beings about whom we know something. One of the things we know about people is that we tend to go to work during the day, usually for eight hours, and this characterizes not just journalists but many of the most important political bloggers. One simple prediction based on this is that the eight hours should similar for most innovations, my guess being that they would be roughly 8am to 4pm east coast time. (Why, you might ask, don’t I just test this myself given that they make their data available? Because the files are too big to open with a text editor and it would take me hours to figure out a way to cull out the aspects of the data that aren’t relevant to this purpose, though someone who was good at “sed” or “awk” could probably write a script to do it in five minutes).
While I think the team made excellent operational decisions, these decisions may nonetheless imply (probably small) biases. The team acknowledges that catch phrases are only somewhat intrinsically interesting but they are mostly using them as a proxy for something even more interesting, what you might call “stories” or “ideas.” This probably has the effect of giving the study a disproportionate emphasis on gotcha and insults like “lipstick on a pig” or “palling around with terrorists” rather than more complex ideas which may be less likely to be consistently described with the same or similar strings of words (though even very complex ideas eventually develop a shorthand, as with “cap and trade”). Similarly, the algorithm itself selects temporally unusual phrases, which may imply a selection for short-run issues rather than perennial debates.
Overall a very impressive paper that’s well worth reading for anyone interested in diffusion, news, blogs, or just really high quality database work. Even better is that they provide their data (which as of now they are continuing to update) and a detailed description of how they processed it, thereby providing a platform on which other people can build, perhaps by things like focusing on substantive concerns.