Archive for February, 2010

More R headaches

| Gabriel |

I’ve continued to play with the R package igraph and I think I’ve gotten the hang of igraph itself but R itself is still pretty opaque to me and that limits my ability to use igraph. Specifically, I’m trying to merge diffusion data onto a network graph and plot it as kind of a slideshow, where each successive image is a new period. I can actually do this, but I’m having trouble looping it and so my code is very repetitive. [Update, thanks to Brian Rubineau I’ve gotten it to work]

First the two tricks that I have solved.

1. Merging the diffusion data (a vertice-level trait) onto the network.

The preferred way to do this is to read in the network file to R then merge in the vertice-level trait (or more technically, read the vertice data and associate it with the network data). This has proven really difficult to me so instead I wrote an alternate version of stata2pajek that let’s me do this within Stata. The upside is that I spend more time in Stata and less time in R, the downside to this is that I need a separate “.net” file for every version of the graph.

2. Getting the different versions of the graph to appear comparable.

It does no good to have multiple versions of a graph if they don’t look at all similar, which is what you get if you generate the layout on the fly as part of each plot. The solution to this is to first generate a layout (the object “la”) then apply it to each graph by defining the “layout” parameter to “=la” within the “plot.igraph()” function.

So instead of this:

plot.igraph(chrnetbounded, layout=layout.fruchterman.reingold, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)

Do this:

la = layout.fruchterman.reingold(chrnetbounded)
plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)

Now the problem I can’t solve, at least with anything approaching elegance.

3. Looping it.

Currently my code is really long and repetitive. I want one graph per week for a whole year. To make 52 versions of the graph, I need about two hundred lines of code, whereas I should be able to just loop my core four lines of code 52 times — pdf(); read.graph();plot.igraph(); This works, but is ridiculous.

The thing is that I can’t figure out how to make a loop in R where the looping local feeds to be part of a filename. This kind of thing is trivially easy in Stata since Stata expands locals and then interprets them. For example, here’s how I would do this in Stata. (I’m using “twoway scatter” as a placeholder for “plot.igraph()”, which of course doesn’t exist in Stata).

forvalues i=0/52 {
	use ties_bounded`i', clear
	twoway scatter x y
	graph export chrnet_hc`i'.png, replace

R not only doesn’t let you do this directly, but I can’t even figure out how to do it by adding an extra step where the looped local feeds into a new object to write the filename. (The problem is that the “paste()” function adds whitespace). So basically I’m at an impasse and I see three ways to do it:

  • Wait for somebody to tell me in the comments how to do this kind of loop in R. (hint, hint).
  • Give up on writing this kind of loop and just resign myself to writing really repetitive R code.
  • Give up on using igraph from within R and learn to use it from within Python. I have basically zero experience using Python but it has a good reputation for usability. In fact, it only took me, a complete Python-noob, about ten minutes to figure out what’s been a real stumper in R. I haven’t yet worked igraph into this loop but I’m thinking it can’t be that hard.
for i in range(0,52):
     datafile='ties_bounded%d' % i
     # igraph code here that treats the python object "datafile" as a filename

So thanks to Brian Rubineau’s suggestion on how to better use the “paste()” function, which is functionally equivalent to the second line of the Python code above. The catch is that you have to add a “, sep=”” ” parameter to suppress the whitespace that had been annoying me. I thought I tried this already, but apparently not. Anyway, the Python / R method of first defining a new object then calling it is an extra step compared to Stata loops (where the looping local can expand directly) but it’s still reasonably easy. Here’s my complete R code, which I’m now very happy with.

# File-Name:       chrnetwork.R                 
# Date:            2010-02-26
# Created Date:    2009-11-24                               
# Author:          Gabriel Rossman                                       
# Purpose:         graph CHR station network
# Data Used:
# Packages Used:   igraph    
#ties bounded to only top 40, includes adoption time color-codes, but use is optional
chrnetbounded <- read.graph("", c("pajek"))
la = layout.fruchterman.reingold(chrnetbounded)  #create layout for use on several related graphs
#graph structure only
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
#graph color coded diffusion
 plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0)
#do as flipbook
for(i in 0:52) {
	datafile<-paste('ties_bounded_hc',i,'.net', sep="")
	pngfile<-paste('~/Documents/book/images/chrnet_hc',i,'.png', sep="")
	chrnetbounded <- read.graph(datafile, c("pajek"))
	plot.igraph(chrnetbounded, layout=la, vertex.size=4, vertex.label=NA, edge.color="gray60", edge.arrow.size=0.3, margin=0)

February 28, 2010 at 1:23 pm 9 comments


| Gabriel |

I wrote this little program to create a symmetric id for a dyad so that it will have the same value regardless of who is ego and who is alter. I wrote it because one of my grad students was having trouble with a dataset on mergers in that it was hard to get the computer to understand that “exxon_mobil” is the same thing as “mobil_exxon”. The solution is to just get it so that the two components always combine in alphabetical order (or any other order, it doesn’t matter, as long as it’s consistent). If you’re using numeric key variables you might want to “string()” them before running it.

capture program drop dyadkey
program define dyadkey
	local a_var `1'
	local b_var `2'
	if "`3'"=="" {
		local dyadkey "dyadkey"
	else {
		local dyadkey `3'
	gen `dyadkey'=""
	local lastrow=[_N]
	forvalues row=1/`lastrow' {
		local aval=`a_var' in `row'
		local bval=`b_var' in `row'
		local ab="`aval'"+"_"+"`bval'"
		local ba="`bval'"+"_"+"`aval'"
		if "`ab'">"`ba'" {
			quietly replace `dyadkey'="`ab'" in `row'
		if "`ba'">="`ab'" {
			quietly replace `dyadkey'="`ba'" in `row'

It takes as arguments the ego id, the alter id, and (optional) the desired name for the new dyad id. For example:
dyadkey part1_cusip part2_cusip joint_cusip

February 26, 2010 at 4:47 am

Soc of Mass Media, week 8

| Gabriel |

On Monday’s lecture I talked about social networks, especially creativity and status. On Wednesday I discussed genre, including producer coordination, audience reception, and trajectory. Next week I’m doing gatekeeping and the origins and practice of journalist objectivity.

February 25, 2010 at 2:38 pm 3 comments

Misc Links, etc

| Gabriel |

  • David Grazian has a two part interview with the rock critic Chuck Klosterman at the Contexts podcast (part one and part two). There’s a lot of neat stuff in there about cultural reception, but the thing that really grabbed me was that Klosterman has become cynical about the interview as being anything more than an arena for cultural scripts, a suspicion I’ve shared.
  • There was just a big lawsuit over age discrimination in Hollywood (Variety and KCRW’s The Business). I haven’t seen it mentioned anywhere, but my bet is that at least one Bielby testified given that they published on exactly this issue in this industry and they do the expert witness thing very effectively.
  • Slate has just started a series of stories on how the army used social network analysis to get Saddam Hussein. So far it looks well worth reading but the “never before told” puffery is more than a little exaggerated. I remember hearing this story years ago and when David Petraeus rewrote the army field manual on counter-insurgency he added a (very good) chapter on social network analysis. Furthermore, Mark Bowden tells a similar story (albeit without the formal analysis) in his amazing book on the decline and fall of Pablo Escobar.
  • On a less grim, but equally sociological, note, Slate had a very cute video slide show on how films use class markers to make glamorous actresses look and sound working class.
  • In yet another Slate article, Reihan Salam writes a whimsical tongue-in-cheek rant about how white the Winter Olympics is. I think the article is funny, but what’s really interesting is the comments thread, most of which has completely missed the sarcasm of the article and imagines Salam as some kind of ethnic grievance monger rather than what he actually is, which is a Republican policy wonk with a weird sense of humor. There’s got to be a story in this about framing, political discourse, and all that Bill Gamson type stuff.
  • The Wall Street Journal has an article on autism research that relies heavily on Peter Bearman’s ginormous autism project. It’s a good article but if you’ll permit me my own grievance mongering, I think it’s interesting how the scientific pecking order comes through. The article doesn’t include the words “sociology” or “sociologist,” instead identifying him only as “Dr. Bearman” from “Columbia University.” In contrast, the other experts interviewed for the piece are identified as “a child psychiatrist at the UCLA Center for Autism Research and Treatment” and “a CDC epidemiologist,” or generically as “medical experts.” That is, it seems like the journalist, probably correctly, thought it would diminish from the authority of the report to attribute it to somebody who doesn’t own a lab coat.
  • Turns out the reliability problems of the cloud aren’t just an issue with airplanes but, you know, at my desk. For the last few weeks I’ve had a lot of trouble reliably connecting to any Google service from UCLA. (No problems from home). This is just annoying when I want to read my RSS feeds but is a real problem when I’m trying to do thing like check my calendar to make appointments. As such I’ve increasingly been migrating my stuff off the cloud and onto local applications on my laptop (which I have with me pretty much all of the time), treating the cloud as little more than a syncing platform. For instance, I access GMail through which lets me compose and read old mail even when I can’t connect to the service. For search I’ve mostly been using Bing for the simple reason that it’s more reliable, even though I prefer Google. The promise of the cloud was supposed to be that you can access your resources from any computer but it’s turning out that I can’t access it from the place I work most. I had been considering getting an ARM netbook running Android or Chrome, but what’s the point if it would turn into a paperweight whenever the server is lagging?

February 23, 2010 at 4:43 am 2 comments

Soc of Mass Media, weeks 6 and 7

| Gabriel |

Monday of last week I talked about starving artists and Baumol’s disease. (Note, there was some technical trouble last Monday so fast forward to 2:30 to skip it). Wednesday I talked about long-term contracts (and, following Gary Becker, whether human capital development is carried by the firm or the worker) and about Galenson’s conceptualist/ experimentalist dichotomy. There was no class this Monday for presidents’ day. This wednesday, I talked about team work, especially (Baker and Faulkner on) role combination/separation, (myself, Esparza, and Bonacich) on spillovers, and (Faulkner and Anderson) on sorting.

Next week, I’m doing social networks and genre. The latter lecture will be a new prep for me but I’ve been increasingly interested in it because of Jenn and Pete’s paper, Greta Hsu’s work on Hollywood, Ezra’s work on typecasting, and my own work on reggaeton. It’s older, but I also see Becker’s Art Worlds as being mostly about genre, especially the “conventions” chapter. Overall, the course was originally organized around the soc 101 trifecta of capital, labor, and state, but I’ve gradually moved away from that framework and made it increasingly organized around active research concerns in sociology (especially econ soc) rather than our discipline’s quaint pedagogical obsession with our imagined past.

Just a reminder, the best way to get the course is ITunes U as I include a pdf of the slides in the RSS and I mark-up the meta-text of the MP3s. There’s also a generic RSS feed for non-ITunes users, but unfortunately I don’t have the admin rights to mark it up and add non-audio content.

February 18, 2010 at 5:25 pm

A literary style, darkly

| Gabriel |

The core theoretical statement of my recent Oscars article with Esparza and Bonacich (ungated version) is:

This mismatch between the essentially collaborative nature of most arts and the essentially individualistic nature of most awards provides us analytic leverage to see how individual achievements are assessed when critical observers have access only to the collaborative efforts within which these achievements are embedded. We do not see individual talent face to face, but through the glass darkly of social context and team effort.

In the course of getting this article to publication, I’ve found a lot of people don’t like the phrase “glass darkly.” I think the real issue is simply that they aren’t familiar with the reference to the KJV of Paul’s first letter to the Corinthians (Chapter 13):

Charity never faileth: but whether there be prophecies, they shall fail; whether there be tongues, they shall cease; whether there be knowledge, it shall vanish away. For we know in part, and we prophesy in part. But when that which is perfect is come, then that which is in part shall be done away. When I was a child, I spake as a child, I understood as a child, I thought as a child: but when I became a man, I put away childish things. For now we see through a glass, darkly; but then face to face: now I know in part; but then shall I know even as also I am known.

Given that many people don’t get the reference, why use it? Basically, I believe in literary style which is why I didn’t ditch it the first time somebody misinterpreted it or said it sounds weird. I can break down the style issue into a few parts:

  1. It powerfully expresses the idea that our awareness is limited, whether that be a limitation on the ability to perceive the kingdom of God in Paul’s example or the limited ability to perceive individual merit in my example. This idea of the limits of perception is the usual meaning of the phrase “glass darkly” and is why Philip K. Dick, among many others, has worked with the reference.
  2. One of the running themes in the paper is the charismatic nature of art, and the several religious allusions (“consecration,” “Matthew Effect,” “Advent,” “glass darkly”) in the paper help set the tone that we’re dealing with something transcendent and not wholly rational. This is why we didn’t use the slightly more familiar secular reference of “shadows in a cave.”
  3. 1 Corinthians 13 KJV is a beautiful and relatively familiar passage from one of the core documents of the Western tradition and I have a normative and aesthetic belief that both the Bible and the secular classics ought to remain in the cultural repertoire. In this sense, if some people don’t recognize the reference it makes it all the more important to use it, as an act of cultural preservation.

February 16, 2010 at 5:35 am 5 comments

Memetracker into Stata

| Gabriel |

A few months ago I mentioned the Memetracker project to scrape the internet and look for the diffusion of (various variants of) catchphrases. I wanted to play with the dataset but there were a few tricks. First, the dataset is really, really, big. The summary file is 862 megabytes when stored as text and would no doubt be bigger in Stata (because of how Stata allocates memory to string variables). Second, the data is in a moderately complicated hierarchical format, with “C” specific occurrences, nested within “B” phrase variants, which are in turn nested within “A” phrase families. You can immediately identify whether a row is A, B, or C by the numer of leading tabs (0, 1, and 2, respectively).

I figured that the best way to interpret this data in Stata would be two create two flat-files, one a record of all the “A” records that I call “key”, and the other a simplified version of all the “C” records but with the key variable to allow merging with the “A” records. Rather than do this all in Stata, I figured it would be good to pre-process it in perl, which reads text one line at a time and thus is well-suited for handling very large files. The easy part was to make a first pass through the file with grep to create the “key” file by copying all the “A” rows (i.e., those with no leading tabs).

Slightly harder was to cull the “C” rows. If I just wanted the “C” rows this would be easy, but I wanted to associate them with the cluster key variable from the “A” rows. This required looking for “A” rows, copying the key, and keeping it in memory until the next “A” row. Meanwhile, every time I hit a “C” row, I copy it but add in the key variable from the most recent “A” row. Both for debugging and because I get nervous when a program doesn’t give any output for several minutes, I have it print to screen every new “A” key. Finally, to keep the file size down, I set a floor to eliminate reasonably rare phrase clusters (anything with less than 500 occurrences total).

At that point I had two text files, “key” which associates the phrase cluster serial number with the actual phrase string and “data” which records occurrences of the phrases. The reason I didn’t merge them is that it would massively bloat the file size and it’s not necessary for analytic purposes. Anyway, at this point I could easily get both the key and data files into Stata and do whatever I want with them. As a first pass, I graphed the time-series for each catchphrase, with and without special attention drawn to mentions occurring in the top 10 news websites.

Here’s a sample graph.

Here’s the perl file:

#!/usr/bin/perl by ghr
#this script cleans the "phrase cluster" data
#script takes the (local and unzipped) location of this file as an argument
#throws out much of the data, saves as two tab flatfiles
#"key.txt" which associates cluster IDs with phrases
#"data.txt" which contains individual observations of the phrases
# input
# A:  <ClSz>  <TotFq>  <Root>  <ClId>
# B:          <QtFq>   <Urls>  <QtStr>  <QtId>
# C:                   <Tm>    <Fq>     <UrlTy>  <Url>
# output, key file
# A:  <ClSz>  <TotFq>  <Root>  <ClId>
# output, data file
# C:<ClID>	<Tm>	<UrlTy>	<URL>
# make two passes.

use warnings; use strict;
die "usage: <phrase cluster data>\n" unless @ARGV==1;

#define minimum number of occurences a phrase must have
my $minfreq = 500;

my $rawdata = shift(@ARGV);
# use bash grep to write out the "key file"
system("grep '^[0-9]' $rawdata > key.txt");

# read again, and write out the "data file"
# if line=A, redefine the "clid" variable
# optional, if second field of "A" is too small, (eg, below 100), break the loop?
# if line=B, skip
# if line=C, write out with "clid" in front
my $clid  ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">data.txt") or die "error creating data.txt\n";
print OUT "clid\ttm\turlty\turl\n";
while (<IN>) {
	#match "A" lines by looking for numbers in field 0
	if($_=~ /^\d/) {
		my @fields = split("\t", $_); #parse as tab-delimited text
		if($fields[1] < $minfreq) { last;} #quit when you get to a rare phrase
		$clid = $fields[3]; #record the ClID
		$clid =~ s/\015?\012//; #manual chomp
		print "$clid ";
	#match "C" lines, write out with clid
	if ($_ =~ m/^\t\t/) {
		my @fields = split("\t", $_);
		print OUT "$clid\t$fields[2]\t$fields[4]\t$fields[5]\n";
close IN;
close OUT;
print "\ndone\n";

And here’s the Stata file:

set mem 500m
set more off
cd ~/Documents/Sjt/memetracker/
*import key, or "A" records
insheet using key.txt, clear
ren v1 clsz
ren v2 totfq
ren v3 root
ren v4 clid
sort clid
lab var clsz "cluster size, n phrases"
lab var totfq "total frequency"
lab var root "phrase"
lab var clid "cluster id"
save key, replace
*import data, or "C" records
insheet using data.txt, clear
drop if clid==.
gen double timestamp=clock(tm,"YMDhms")
format timestamp %tc
drop tm
gen hostname=regexs(1) if regexm(url, "http://([^/]+)") /*get the website, leaving out the filepath*/
drop url
gen blog=0
replace blog=1 if urlty=="B"
replace blog=1 if hostname==""
gen technoratitop10=0 /*note, as of 2/3/2010, some mismatch with late 2008 memetracker data*/
foreach site in {
	replace technoratitop10=1 if hostname=="`site'"
gen alexanews10=0 /*as w technorati, anachronistic*/
foreach site in {
	replace alexanews10=1 if hostname=="`site'"
drop urlty
sort clid timestamp
contract _all /*eliminate redundant "C" records (from different "B" branches)*/
drop _freq
save data, replace
*draw a graph of each meme's occurrences
levelsof clid, local(clidvalues)
foreach clid in `clidvalues' {
	disp "`clid'"
	quietly use key, clear
	quietly keep if clid==`clid'
	local title=root in 1
	quietly use data, clear
	histogram timestamp if clid==`clid', frequency xlabel(#5, labsize(small) angle(forty_five)) title(`title', size(medsmall))
	graph export graphs/`clid'.png, replace
	twoway (histogram timestamp if clid==`clid') (line alexanews10 timestamp if clid==`clid', yaxis(2)), legend(off) xlabel(#5, labsize(small) angle(forty_five)) title(`title', size(medsmall))
	graph export graphs_alexa/`clid'.png, replace
*have a nice day

February 8, 2010 at 4:31 am 7 comments

Older Posts

The Culture Geeks