News diffusion

July 14, 2009 at 5:47 pm 2 comments

| Gabriel |

The New York Times and Slashdot both have stories on an ambitious paper by a team of computer scientists that studies the diffusion of news stories. The website for the paper and supplementary materials (including the cleaned dataset) is at

The most impressive thing about the project is the data collection / cleaning effort. The team scraped basically all of the mainstream media and blogosphere for the last three months of the 2008 election. They then used a fuzzy algorithm to identify temporally-unusual phrases so “palling around with terrorists” would count whereas “in a press release” would not. What’s really impressive is that they not only identify exact phrases (which is pretty easy to code) but paraphrases (which is really hard to code). For instance, they identify about 40 versions of Sarah Palin’s “palling around with terrorists” characterization of Barack Obama’s relationship with Bill Ayers. They then identify and time-stamp every usage of each phrase in their scraped data. The dataset is human-readable and is arranged as a three level hierarchy of time-stamped news items within specific phrasings within broad phrasings. This nesting of paraphrases within general phrases goes a long way towards solving the problem of “reinvention” which might otherwise obscure that several “different” phrases are really only minimally distinct versions of the same phrase. Here’s a sample of the dataset:

  2  8  we're not commenting on that story i'm afraid   2131865
     3  3  we're not commenting on that    489007
        2008-08-18 14:23:05  1  M
        2008-11-26 01:27:13  1  B
        2008-11-27 18:55:30  1  B
     5  2  we're not commenting on that story      2131864
        2008-12-08 14:50:18  3  B
        2008-12-08 19:35:31  2  B

Their analysis was also very good (but nowhere near as amazing as the cleaning effort). Basically their findings were entirely consistent with the diffusion literature. They found that the popularity of different phrases followed a power law and the distribution of new mentions of a phrase followed a bell curve (which is equivalent to saying that the cumulative mentions of a phrase follow an s-curve). Both of these findings are consistent with a cumulative advantage process, and indeed, they model the process as a tension between “imitation” and “recency.”

This “two forces in tension” thing is typical of many endogenous models. It’s actually very easy to figure out a model that results in a stable equilibrium of “everything” or “nothing,” for instance the Schelling segregation model. However, it’s much harder to work out a model that has a more moderate equilibrium. So in cumulative advantage models like this, the trick is to explain why popularity is “only” described by a power-law when it’s easier to see how it could be described by a step function (one target gets all the popularity, everyone else gets absolutely none). Because the memetracker data has a temporal element they use a time decay function. Other similar models have used things like reciprocity (Gould), cost of linking (Podolny), and heterogeneity in taste (Rosen).

In addition to “imitation” and “recency,” they also note that some kind of intrinsic “attractiveness” of the phrase might be an issue, though they bracket this largely because it would require a lot of hard human content analysis. From perusing the most popular stories, my guess is that the ideal phrase, from a newsworthiness perspective, is an ambiguous but potentially inflammatory quote from a prominent person describing another prominent person that should be comprehensible to someone with minimal background knowledge. So something like “lipstick on a pig” (which Barack Obama said about part of the McCain-Palin platform, though it was often interpreted as being a personal insult to Sarah Palin) is just about perfect in all these respects. Slightly less wonderful is the “palling around with terrorists” quote, because this requires some background knowledge about Bill Ayers and the Weather Underground (or, under the dog whistle theory of this quote, comparable familiarity with the “secret Muslim” theory).

The most novel finding was that most of the action for any given phrase occurs within an eight hour window more or less symmetrically centered around the peak popularity, and indeed they describe the process as being qualitatively different (for one thing, it’s more sparse) outside of this window than it is within it. This struck me as an issue where there could be some profit in treating the issue not just as some disembodied complexity science issue, but as the result of a particular social process involving real human beings about whom we know something. One of the things we know about people is that we tend to go to work during the day, usually for eight hours, and this characterizes not just journalists but many of the most important political bloggers. One simple prediction based on this is that the eight hours should similar for most innovations, my guess being that they would be roughly 8am to 4pm east coast time. (Why, you might ask, don’t I just test this myself given that they make their data available? Because the files are too big to open with a text editor and it would take me hours to figure out a way to cull out the aspects of the data that aren’t relevant to this purpose, though someone who was good at “sed” or “awk” could probably write a script to do it in five minutes).

While I think the team made excellent operational decisions, these decisions may nonetheless imply (probably small) biases. The team acknowledges that catch phrases are only somewhat intrinsically interesting but they are mostly using them as a proxy for something even more interesting, what you might call “stories” or “ideas.” This probably has the effect of giving the study a disproportionate emphasis on gotcha and insults like “lipstick on a pig” or “palling around with terrorists” rather than more complex ideas which may be less likely to be consistently described with the same or similar strings of words (though even very complex ideas eventually develop a shorthand, as with “cap and trade”). Similarly, the algorithm itself selects temporally unusual phrases, which may imply a selection for short-run issues rather than perennial debates.

Overall a very impressive paper that’s well worth reading for anyone interested in diffusion, news, blogs, or just really high quality database work. Even better is that they provide their data (which as of now they are continuing to update) and a detailed description of how they processed it, thereby providing a platform on which other people can build, perhaps by things like focusing on substantive concerns.

Entry filed under: Uncategorized. Tags: , , .

Friending race To the philosopher equally false


  • 1. Noah  |  July 14, 2009 at 6:51 pm

    Other research I have done suggests that quotes which can be understood on their own (without needing to know prior events in the news cycle) would help increase how widely a quote could spread. On a more general level, this is a way to try and maximize the audience. Local TV news, a genre which tends to emphasize the raw number of viewers over any more detailed demographics, tends to present news events in a de-contextualized way where viewers do not need any background to understand the current story. This extends to designing a TV program where viewers do not need to have seen prior segments of the broadcast to understand the current one. (This helps stations to attract new viewers coming home from work, because it is easier to catch up halfway through a local TV news broadcast than an entertainment program.)

    As far as ambiguity in quotes, it’s hard to say whether that really matters, because there’s no comparison. If Obama said “I believe that the Republicans cannot govern, because they picked someone as stupid and unqualified as Sarah Palin to be on the ticket,” that quote would spread like wildfire. But mainstream US candidates are not going to be that direct in their criticisms.

    (This made up quote would also be a great example of how useful this study’s fuzzy logic is for tracing quotes. Adding “I believe” to the statement would probably make it more quotable, but the phrase itself would likely be truncated.)

    The eight hour window makes a lot of sense when placed in the context of how news gets produced. Political news is usually a 9-5 job, especially for the sources who give journalists news. This is a part of why coverage of any event held after 5 PM gets really weird. (If you want inane details about how prime time press conferences are covered, e-mail me.)

  • 2. Memetracker into Stata « Code and Culture  |  February 8, 2010 at 4:41 am

    […] few months ago I mentioned the Memetracker project to scrape the internet and look for the diffusion of (various variants of) […]

The Culture Geeks

%d bloggers like this: