Archive for July, 2009

News diffusion

| Gabriel |

The New York Times and Slashdot both have stories on an ambitious paper by a team of computer scientists that studies the diffusion of news stories. The website for the paper and supplementary materials (including the cleaned dataset) is at

The most impressive thing about the project is the data collection / cleaning effort. The team scraped basically all of the mainstream media and blogosphere for the last three months of the 2008 election. They then used a fuzzy algorithm to identify temporally-unusual phrases so “palling around with terrorists” would count whereas “in a press release” would not. What’s really impressive is that they not only identify exact phrases (which is pretty easy to code) but paraphrases (which is really hard to code). For instance, they identify about 40 versions of Sarah Palin’s “palling around with terrorists” characterization of Barack Obama’s relationship with Bill Ayers. They then identify and time-stamp every usage of each phrase in their scraped data. The dataset is human-readable and is arranged as a three level hierarchy of time-stamped news items within specific phrasings within broad phrasings. This nesting of paraphrases within general phrases goes a long way towards solving the problem of “reinvention” which might otherwise obscure that several “different” phrases are really only minimally distinct versions of the same phrase. Here’s a sample of the dataset:

  2  8  we're not commenting on that story i'm afraid   2131865
     3  3  we're not commenting on that    489007
        2008-08-18 14:23:05  1  M
        2008-11-26 01:27:13  1  B
        2008-11-27 18:55:30  1  B
     5  2  we're not commenting on that story      2131864
        2008-12-08 14:50:18  3  B
        2008-12-08 19:35:31  2  B

Their analysis was also very good (but nowhere near as amazing as the cleaning effort). Basically their findings were entirely consistent with the diffusion literature. They found that the popularity of different phrases followed a power law and the distribution of new mentions of a phrase followed a bell curve (which is equivalent to saying that the cumulative mentions of a phrase follow an s-curve). Both of these findings are consistent with a cumulative advantage process, and indeed, they model the process as a tension between “imitation” and “recency.”

This “two forces in tension” thing is typical of many endogenous models. It’s actually very easy to figure out a model that results in a stable equilibrium of “everything” or “nothing,” for instance the Schelling segregation model. However, it’s much harder to work out a model that has a more moderate equilibrium. So in cumulative advantage models like this, the trick is to explain why popularity is “only” described by a power-law when it’s easier to see how it could be described by a step function (one target gets all the popularity, everyone else gets absolutely none). Because the memetracker data has a temporal element they use a time decay function. Other similar models have used things like reciprocity (Gould), cost of linking (Podolny), and heterogeneity in taste (Rosen).

In addition to “imitation” and “recency,” they also note that some kind of intrinsic “attractiveness” of the phrase might be an issue, though they bracket this largely because it would require a lot of hard human content analysis. From perusing the most popular stories, my guess is that the ideal phrase, from a newsworthiness perspective, is an ambiguous but potentially inflammatory quote from a prominent person describing another prominent person that should be comprehensible to someone with minimal background knowledge. So something like “lipstick on a pig” (which Barack Obama said about part of the McCain-Palin platform, though it was often interpreted as being a personal insult to Sarah Palin) is just about perfect in all these respects. Slightly less wonderful is the “palling around with terrorists” quote, because this requires some background knowledge about Bill Ayers and the Weather Underground (or, under the dog whistle theory of this quote, comparable familiarity with the “secret Muslim” theory).

The most novel finding was that most of the action for any given phrase occurs within an eight hour window more or less symmetrically centered around the peak popularity, and indeed they describe the process as being qualitatively different (for one thing, it’s more sparse) outside of this window than it is within it. This struck me as an issue where there could be some profit in treating the issue not just as some disembodied complexity science issue, but as the result of a particular social process involving real human beings about whom we know something. One of the things we know about people is that we tend to go to work during the day, usually for eight hours, and this characterizes not just journalists but many of the most important political bloggers. One simple prediction based on this is that the eight hours should similar for most innovations, my guess being that they would be roughly 8am to 4pm east coast time. (Why, you might ask, don’t I just test this myself given that they make their data available? Because the files are too big to open with a text editor and it would take me hours to figure out a way to cull out the aspects of the data that aren’t relevant to this purpose, though someone who was good at “sed” or “awk” could probably write a script to do it in five minutes).

While I think the team made excellent operational decisions, these decisions may nonetheless imply (probably small) biases. The team acknowledges that catch phrases are only somewhat intrinsically interesting but they are mostly using them as a proxy for something even more interesting, what you might call “stories” or “ideas.” This probably has the effect of giving the study a disproportionate emphasis on gotcha and insults like “lipstick on a pig” or “palling around with terrorists” rather than more complex ideas which may be less likely to be consistently described with the same or similar strings of words (though even very complex ideas eventually develop a shorthand, as with “cap and trade”). Similarly, the algorithm itself selects temporally unusual phrases, which may imply a selection for short-run issues rather than perennial debates.

Overall a very impressive paper that’s well worth reading for anyone interested in diffusion, news, blogs, or just really high quality database work. Even better is that they provide their data (which as of now they are continuing to update) and a detailed description of how they processed it, thereby providing a platform on which other people can build, perhaps by things like focusing on substantive concerns.

July 14, 2009 at 5:47 pm 2 comments

Friending race

| Gabriel |

Noah pointed out to me that some of Eszter’s work got a plug in the NYT. In her UIC freshmen survey, she found that Hispanics were still mostly on MySpace but others had mostly moved to Facebook. The argument for this differential shift by race is that these kinds of things benefit tremendously from network externalities and since underlying social networks are segregated, the social network websites come to reflect this.

Although the NYT article mention “white flight,” mostly in the context of discussing another researcher, this characterization doesn’t seem exactly right to me both empirically and theoretically. Eszter’s work shows that blacks have mostly moved to Facebook but in most types of interaction (residential segregation, marriage, etc) Anglo whites are more likely to associate with Hispanics than with blacks. Likewise in classic white flight models, whites are fleeing the presence of blacks but what seems to be going on here is that whites are drawn by other whites (or more specifically, by their friends, who are mostly white). Unlike housing, where you know who your neighbors are, on a social networking site you only associate with the people you choose. In other words, it’s a pull of being drawn by your friends, not a push of avoiding people you look down upon. It’s interesting to contrast the types of differences (I hesitate to use the word “segregation”) that can result entirely from the pull of homophily rather than the push of heteroantipathy (is that a word?).

July 10, 2009 at 3:19 pm 1 comment


Continuing my discussion of Long’s Workflow

One of the things that Long is appropriately insistent on is good archiving for the long-term. First, he notes that the most serious issue is the physical storage medium and the need to migrate data whenever you get a new system given that even formats that were popular in recent memory like zip disks and 3.5 floppies are now almost impossible to find hardware for. I think in the future this should become easier as hard drives get so ginormous that it’s increasingly feasible to keep everything on one disk rather than pushing your archives to removable medium that can get lost. When it’s all on one (or a few) internal disks then you tend to migrate and backup, unlike removable media that get lost in your file cabinet until they are obsolete and/or physically corroded. Of course in those increasingly rare instances where IRB or proprietary data provision issues are not a concern the best way to handle this is to use ICPSR, CPANDA, or a similar public archive.

Even if you can access the files, the issue is can you read them. Long appropriately stresses the issue of keeping data in several formats but I think he’s a bit too agnostic about which formats are likely to last. As I see it there are basically two issues: popularity and opacity. Popularity is simply how many people use the format. For this reason Long endorses SAS Transport because it’s the official format of the FDA. However Long overlooks the other key issue of opacity, which basically comes down to the two-related issues of being proprietary and being binary (as compared to text).

The more popular and the less opaque, the more likely it is that you’ll be able to read your data in the future. So looking into my crystal ball twenty years or so I think it’s fair to guess that Stata binary will not be readable with any ease and uncompressed tab-delimited ASCII will remain the lingua franca of data. I say tab-delimited instead of fixed-width because dictionary files get lost, tab-delimited instead of csv because embedded literal commas are common whereas embedded literal tabs are nonexistent, and I say uncompressed because compressed files are more vulnerable to corruption.

The problem with ASCII is that if (like Long) you find value labels and variable labels to be crucial then ASCII loses a lot of value. I think a good compromise is the Stata XML format. As you can see by opening it in a text editor, XML is human-readable text so even if no off-the-shelf import filters exist (which is unlikely as XML is increasingly the standard) you could with relatively little effort/cost write a filter yourself in a text-processing language like perl — or whatever the equivalent of perl will be in a generation.

Because it’s smaller and faster than XML, I still use Stata binary for day to day usage but I’m going to make a point of periodically making uncompressed XML archives, especially when I finish a project.

July 1, 2009 at 5:01 am 2 comments

Newer Posts

The Culture Geeks