Posts tagged ‘diffusion’

Social Structures

| Gabriel |

Shortly before ASA, I finished John Levi Martin’s Social Structures and I loved it, loved it, loved it. (Also see thoughts from Paul DiMaggio, Omar Lizardo, Neil GrossFabio Rojas, and Science). I find myself hoping I have to prep contemporary theory just so I can inflict it on unsuspecting undergrads. The book is all about emergence and how fairly minor changes in the nature of social mechanisms can create quite different macro social structures.* It’s just crying out for someone to write a companion suite in NetLogo, chapter by chapter. In addition, JLM knows an enormous amount of history, anthropology, and even animal behavior and uses it all very well to both illustrate his points and show how they work when the friction of reality enters. For instance, he notes that balance theory breaks down to the extent that people have some agency in defining the nature of ties and/or keeping some relations “neutral” rather than the ally versus enemy dichotomy.**

An interesting contrast is Francis Fukuyama’s Origins of Political Order, which I also liked. The two books are broadly similar in scope, giving a sweeping comparative overview of history that starts with animals and attempts to work up to the early modern era. (There are also some similarities in detail, such as their very similar understandings of the “big man” system and that domination is more likely in bounded populations). There is an obvious difference of style in that Fukuyama is easier to read and goes into more extended historical discussions but the more important differences are thematic and theoretical. One such difference is that Fukuyama follows Polybius in seeing the three major socio-political classes as the people, the aristocracy, and the monarch, with the people and the monarch often combining against the aristocracy (as seen in the Roman Revolution and in early modern absolute monarchies). In contrast, JLM’s model tends to see the monarch as just the top aristocrat, though his emphasis on the development of transitivity in command effectively accomplishes some of the same work as the Fukuyama/Polybius model.

The most important difference comes in that  Fukuyama is inspired by Weber whereas JLM uses Simmel, a distinction that becomes especially distinct as they move from small tribal bands to early modern societies. Fukuyama’s book is fundamentally about the tension between kinship and law as the fundamental organizing principle of society. In Fukuyama’s account both have very old roots and modernity represents the triumph of law. In contrast, JLM sees kinship (and analogous structures like patronage) as the fundamental logics of society with modernity being similar in kind but grander in scale. In the last chapter and a half JLM discusses the early modern era and here he sounds a bit more like Fukuyama, but he’s clearly more interested in, for instance, the origins of political parties than in their transformation into modern ideological actors.

In part this is because, as Duncan Watts observed at the “author meets critics” at ASA, JLM is mostly interested in that which can be derived from micro-macro emergence and tends to downplay issues that do not fit into this framework.*** This is seen most clearly in the fact that the book winds down around the year 1800 after noting that (a) institutionalization can partially decouple mature structures from their micro origins and (b) ideology can in effect form a sort of bipartite network structure through which otherwise disconnected factions and patronage structures can be united (usually in order to provide a heuristic through which elites can practice balance theory), as with the formation of America’s original party system of Federalists and Democrats which JLM discusses in detail. Of course as I said in the “critics” Q&A, at the present most politically active Americans have a primarily ideological attachment to their party without things like ward bosses and perhaps more interestingly, a role for ideology as a bridge is not an issue restricted to the transition from early modern to modern. As is known to any reader of Gibbon, there was a similar pattern in late antiquity in how esoteric theological disputes over adoptionist Christology and reconciliation of sinners provided rallying points for core vs periphery political struggles in the late Roman empire. Since this is largely a dispute over emphasis, it’s not surprising that JLM was sympathetic to this but he noted that there are limits to what ideological affinity can accomplish and when it comes to costly action you really need micro structures. (He is of course entirely right about this as seen most clearly in the military importance of unit cohesion, but it’s still interesting that ideology has waxed and patronage waned in party systems of advanced democracies).

There are a few places in the book where JLM seemed to be arguing from end states back to micro-mechanisms and I couldn’t tell whether he meant that the micro-mechanisms necessarily exist (i.e., functionalism) or that such demanding specifications of micro-mechanisms implied that the end state was inherently unstable (i.e., emergence). For instance, in chapter three he discusses exchange of women between patrilineal lineages and notes that if there is not simple reciprocity (usually through cross-cousin marriage) then there must be either be some form of generalized reciprocity or else the bottom-ranked male lineages will go extinct. On reading this I was reminded of this classic exchange:

That is, I think it is entirely possible that powerful male lineages could have asymmetric marital exchange with less powerful male lineages and if the latter are eventually driven into extinction then that sucks for them. (The reason this wouldn’t lead to just a single male lineage clan is because, as Fukuyama notes, large clans can fissure and tracing descent back past the 5th or 6th generation is usually more political than genealogical). This is the sort of thing that can actually be answered empirically by contrasting Y chromosomes with mitochondrial DNA. For instance, a recent much publicized study showed that pretty much all ethnically English men carry the Germanic “Frisian Y” chromosome. The authors’ interpretation of this is that a Saxon mass migration displaced the indigenous Gallo-Roman population but I don’t see how this is at all inconsistent with the older elite transfer model of the Saxon invasion if we assume that the transplanted foreign elite hoarded women, including indigenous women. A testable implication of the elite transfer model is that the English would have the same Y as the Danes and Germans but similar mitochondria as the Irish and Welsh. Similarly, a 2003 study showed that 8% of men in East and Central Asia show descent on the male line from Ghengis Khan but nobody has suggested that this reflects a mass migration. Rather in the 12th and 13th centuries the Mongols used rape and polygamy to impregnate women of many Asian nations and they didn’t really give a damn if this meant extinction of the indigenous male lineages.

A very minor point but one that is important to me as a diffusion guy is that chapter five uses the technical jargon of diffusion in non-standard ways, or to be more neutral about it, he and I use terms differently. That said it’s a good chapter, it just needs to be read carefully to avoid semantic confusion.

This post may read like I’m critical of the book but that’s only because I prefer to react to and puzzle out the book rather than summarize it. What reservations I have are fairly minor and unconfident. My overall assessment is that this is a tremendously important book that should be read carefully by anyone interested in social networks, political sociology, social psychology, or economic sociology. For instance, I wish it had been published before my paper with Esparza and Bonacich as using the chapter on pecking orders would have allowed us to develop more depth to the finding about credit ranking networks. (That and it would have given us a pretext to compare Hollywood celebrities to poultry and small children). Despite the book’s foundation in graph theory, this interest should span qualitative/quantitative — at ASA Randy Collins praised the book enthusiastically and gave a very thoughtful reading and from personal conversation I know that Alice Goffman was also very impressed. I think this is because JLM’s relentless focus on interaction between people is a much thinner but nonetheless similar approach to the kinds of issues that qualitative researchers tend to engage with. Indeed, at a deep level Social Structures has more in common with ethnography than with anything that uses regression to try to describe society as a series of slope-intercept equations.

————-

* Technically, it’s about weak emergence, not strong emergence. At “author meets critics” JLM was very clear that he rejects the idea of sui generis social facts with an independent ontological status rather than just a summary or aggregation of micro structure.

** One of the small delights in the early parts of the book is that he notes how our understanding of network structure is driven in part by the ways we measure and record it. So networks based on observation of proximity are necessarily symmetric whereas networks based on sociometric surveys highlight the contingent nature of reciprocity, networks based on balance theory tend to be positive/negative whereas matrices emphasize presence/absence and are often sparse, etc. I might add to his observations in this line that the extremely common practice of projecting bipartite networks into unipartite space (as with studies of Hollywood, Broadway, corporate boards, and technical consortia) has its own sets of biases, most obviously exaggerating the importance and scalability of cliques. Also, I’ve previously remarked on a similar issue in Saller’s Personal Patronage as to how we need to be careful about directed ties being euphemistically  described as symmetric ties in some of our data.

*** Watts also observed that JLM’s approach is very much a sort of 1960s sociometry and doesn’t use the recent advances in social network analysis driven by the availability of big data about computer-mediated communication (such as Watts’ current work on Twitter). JLM responded with what was essentially a performativity critique of naive reliance on web 2.0 data, noting for instance that Facebook encourages triadic closure, enforces reciprocity, and discourages deletion of old ties.

August 24, 2011 at 3:54 pm 9 comments

Misc Links

| Gabriel |

  • Useful detailed overview of Lion. The user interface stuff doesn’t interest me nearly as much as the tight integration of version control and “resume.” Also, worth checking if your apps are compatible. (Stata and Lyx are supposed to work fine. TextMate is supposed to run OK with some minor bugs. No word on R. Fink doesn’t work yet). It sounds good but I’m once again sitting it out for a few months until the compatibility bugs get worked out. Also, as with Snow Leopard many of the features won’t really do anything until developers implement them in their applications.
  • I absolutely loved the NPR Planet Money story on the making of Rihanna’s “Man Down.” (Not so fond of the song itself, which reminds me of Bing Crosby and David Bowie singing “Little Drummer Boy” in matching cardigans). If you have any interest at all in production of culture read the blog post and listen to the long form podcast (the ATC version linked from the blog post is the short version).
  • Good explanation of e, which comes up surprisingly often in sociology (logit regression, diffusion models, etc.). I like this a lot as in my own pedagogy I really try to emphasize the intuitive meaning of mathematical concepts rather than just the plug and chug formulae on the one hand or the proofs on the other.
  • People are using “bimbots” to scrape Facebook. And to think that I have ethical misgivings about forging a user-agent string so wget looks like Firefox.

July 20, 2011 at 3:46 pm

Misc Links

  • Lisa sends along this set of instructions for doing a wide-long reshape in R. Useful and I’m passing it along for the benefit of R users, but the relative intuition and simplicity of “reshape wide stub, i(i) j(j)” is why I still do my mise en place in Stata whenever I use R. Ideally though, as my grad student Brooks likes to remind me, we really should be doing this kind of data mise en place in a dedicated database and use the Stata and R ODBC commands/functions to read it in.
  • The days change at night, change in an instant.”
  • Anyone interested in replicating this paper should be paying close attention to this pending natural experiment. In particular I hope the administrators of this survey are smart enough to oversample California in the next wave. I’d consider doing the replication myself but I’m too busy installing a new set of deadbolts and adopting a dog from a pit bull rescue center.
  • In Vermont, a state government push to get 100% broadband penetration is using horses to wire remote areas that are off the supply curve beaten path. I see this as a nice illustration both of cluster economies and of the different logics used by markets (market clearing price) and states (fairness, which often cashes out as universal access) in the provision of resources. (h/t Slashdot)
  • Yglesias discusses some poll results showing that voters in most of the states that recently elected Republican governors now would have elected the Democrats. There are no poll results for California, the only state that switched to the Democrats last November. Repeat after me: REGRESSION TO THE MEAN. I don’t doubt that some of this is substantive backlash to overreach on the part of politically ignorant swing voters who didn’t really understand the GOP platform, but really, you’ve still got to keep in mind REGRESSION TO THE MEAN.
  • Speaking of Yglesias, the ThinkProgress redesign only allows commenting from Facebook users, which is both a pain for those of us who don’t wish to bear the awesome responsibility of adjudicating friend requests and a nice illustration of how network externalities can become coercive as you reach the right side of the s-curve.

May 31, 2011 at 10:22 am

Zeno’s Webmail Security Team Account Confirmation

| Gabriel |

Last year I described how a “reply to all” cascade follows an s-curve. Now (via Slashdot) I see that another pathology of email results in the other classic diffusion process. That is, the number of hits received by phishing scams follow the constant hazard function, otherwise known as an “external influence” diffusion curve or Zeno’s paradox of Achilles and the tortoise.


link to original story

This is of course entirely predictable from theory. Once you realize that people aren’t forwarding links to phishing scams, but only clicking on links spammed to them directly then it’s obvious that there will not be an endogenous hazard function. Furthermore, con artists know that the good guys will shut down their site ASAP which means that it is in their interest to send out all their spam messages essentially simultaneously. Thus you have a force that people are exposed to simultaneously and they react to individualistically. Under these scope conditions it is necessarily the case that you’d get this diffusion curve and you’d get a majority of fraud victims within the first hour.

This only comes as at all surprising to people because we’re so enamored of s-curves that we forget that sometimes people open their umbrellas because it’s raining. (Which is not to say that such behavior is asocial in a broader sense).

December 3, 2010 at 12:14 am

Scraping for Event History

| Gabriel |

As I’ve previously mentioned, there’s a lot of great data out there but much of it is ephemeral so if you’re interested in change (which given our obsession with event history, many sociologists are) you’ve got to know how to grab it. I provided a script (using cron and curl) for grabbing specific pages and timestamping them but this doesn’t scale up very well to getting entire sites, both because you need to specify each specific URL and because it saves a complete copy each time rather than the diff. I’ve recently developed another approach that relies on wget and rsync and is much better for scaling up to a more ambitious scraping project.

Note that because of subtle differences between dialects of Unix, I’m assuming Linux for the data collection but Mac for the data cleaning.* Using one or the other for everything requires some adjustments. Also note that because you’ll want to “cron” this, I don’t recommend running it on your regular desktop computer unless you leave it on all night. If you don’t have server space (or an old computer on which you can install Linux and then treat as a server), your cheapest option is probably to run it on a wall wart computer for about $100 (plus hard drive).

Wget is similar to curl in that it’s a tool for downloading internet content but it has several useful features, some of which aren’t available in curl. First, wget can do recursion, which means it will automatically follows links and thus can get an entire site as compared to just a page. Second, it reads links from a text file a bit better than curl. Third, it has a good time-stamping feature where you can tell it to only download new or modified files. Fourth, you can exclude files (e.g., video files) that are huge and you’re unlikely to ever make use of. Put these all together and it means that wget is scalable — it’s very good at getting and updating several websites.

Unfortunately, wget is good at updating, but not at archiving. It assumes that you only want the current version, not the current version and several archival copies. Of course this is exactly what you do need for any kind of event history analysis. That’s where rsync comes in.

Rsync is, as the name implies, a syncing utility. It’s commonly used as a backup tool (both remote and local). However the simplest use for it is just to sync several directories and we’ll be applying it to a directory structure like this:

project/
  current/
  backup/
    t0/
    t1/
    t2/
  logs/

In this set up, wget only ever works on the “current” directory, which it freely updates. That is, whatever is in “current” is a pretty close reflection of the current state of the websites you’re monitoring. The timestamped stuff, which you’ll eventually be using for event history analysis, goes in the “backup” directories. Every time you run wget you then run rsync after it so that next week’s wget run doesn’t throw this week’s wget run down the memory hole.

The first time you do a scrape you basically just copy current/ to backup/t0. However if you were to do this for each scrape it would waste a lot of disk space since you’d have a lot of identical files. This is where incremental backup comes in, which Mac users will know as Time Machine. You can use hard links (similar to aliases or shortcuts) to get rsync to accomplish this.** The net result is that backup/t0 takes the same disk space as current/ but each subsequent “backup” directory takes only about 15% as much space. (A lot of web pages are generated dynamically and so they show up as “recently modified” every time, even if there’s no actual difference with the “old” file.) Note that your disk space requirements get big fast. If a complete scrape is X, then the amount of disk space you need is approximately 2 * X + .15 * X * number of updates. So if your baseline scrape is 100 gigabytes, this works out to a full terabyte after about a year of weekly updates.

Finally, when you’re ready to analyze it, just use mdfind (or grep) to search the backup/ directory (and its subdirectories) for the term whose diffusion you’re trying to track and pipe the results to a text file. Then use a regular expression to parse each line of this query into the timestamp and website components of the file path to see on which dates each website used your query term — exactly the kind of data you need for event history. Furthermore, you can actually read the underlying files to get the qualitative side of it.

So on to the code. The wget part of the script looks like this

DATESTAMP=`date '+%Y%m%d'`
cd ~/Documents/project
mkdir logs/$DATESTAMP
cd current
wget -S --output-file=../logs/$DATESTAMP/wget.log --input-file=../links.txt -r --level=3 -R mpg,mpeg,mp4,au,mp3,flv,jpg,gif,swf,wmv,wma,avi,m4v,mov,zip --tries=10 --random-wait --user-agent=""

That’s what it looks like the first time you run it. When you’re just trying to update “current/” you need to change “wget -S” to “wget -N” but aside from that this first part is exactly the same. Also note that if links.txt is long, I suggest you break it into several parts. This will make it easier to rerun only part of a large scrape, for instance if you’re debugging, or there’s a crash, or if you want to run the scrape only at night but it’s too big to completely run in a single night. Likewise it will also allow you to parallelize the scraping.

Now for the rsync part. After your first run of wget, run this code.

cd ..
rsync -a current/ backup/baseline/

After your update wget runs, you do this.

cd ..
cp -al backup/baseline/ backup/$DATESTAMP/
rsync -av --delete current/ backup/$DATESTAMP/

* The reason to use Linux for data collection is that OS X doesn’t include wget and has an older version of the cp command, though it’s possible to solve both issues by using Fink to install wget and by rewriting cp in Mac/BSD syntax. The reason to use Mac for data analysis is that mdfind is faster (at least once it builds an index) and can read a lot of important binary file formats (like “.doc”) out of the box, whereas grep only likes to read text-based formats (like “.htm”). There are apparently Linux programs (e.g., Beagle) that allow indexed search of many file formats, but I don’t have personal experience with using them as part of a script.

** I adapted this use of hard links and rsync from this tutorial, but note that there are some important differences. He’s interested in a rolling “two weeks ago,” “last week,” “this week” type of thing, whereas I’m interested in absolute dates and don’t want to overwrite them after a few weeks

September 28, 2010 at 4:26 am 5 comments

Older Posts Newer Posts


The Culture Geeks

Recent Posts


Follow

Get every new post delivered to your Inbox.

Join 1,480 other followers