Affiliation network 2 edge list

March 17, 2010 at 5:30 am 6 comments

| Gabriel |

The grad student for whom I wrote dyadkey came back after realizing he had a more complicated problem than we thought. Long story short, he had a dataset where each record was an affiliation with one of the variables being a space-delimited list of all the members of the affiliation. That is his dataset was “wide” and basically looks like this:

j	i
a	"1 2 3 4"
b	"2 5 6"

Note that in most citation databases the data is field-tagged but the author tag itself is “wide.” Hence this code would be useful for making a co-authorship network but you’d first have to rearrange it so that the publication key is “j” and the author list is “i”.

In contrast, “long” affiliation data like IMDB is structured like this:

j	i
a	1
a	2
a	3
a	4
b	2
b	5
b	6

My student wanted to project the affiliation network into an edge list at the “i” level. As before he only wanted each edge in once (so if we have “1 & 2” we don’t also want “2 & 1”). To accomplish this, I wrote him a program that takes as arguments the name of the “affiliation” variable and the name of the “members list” variable. To do this it first reshapes to a long file (mostly lines 11-18), then uses joinby against itself to create all permutations (mostly lines 22-24), and finally drops redundant cases by only keeping dyads where ego was listed before alter in the original list of affiliation members (mostly lines 25-32). With minor modifications, the script would also work with affiliation data that starts out as long, like IMDB. Also note that it should work well in combination with stata2pajek classic (“ssc install stata2pajek”) or the version that lets you save vertice traits like color.

capture program drop affiliation2edges
program define affiliation2edges
	local affiliation `1'
	local memberslist `2'
	tempfile dataset
	tempvar sortorder
	quietly gen `sortorder'=[_n]
	sort `affiliation'
	quietly save "`dataset'"
	keep `affiliation' `memberslist'
	gen wc=wordcount(`memberslist')
	quietly sum wc
	local maxsize=`r(max)'
	forvalues word=1/`maxsize' {
		quietly gen member`word'=word(`memberslist',`word')
	reshape long member, i(`affiliation') j(membernumber)
	quietly drop if member=="" /*drop the empty cell, n_dropped ~ sd(wc) */
	sort `affiliation'
	tempfile datasetlong
	quietly save "`datasetlong'"
	ren member member_a
	joinby `affiliation' using "`datasetlong'"
	ren member member_b
	ren membernumber membernumber_a
	quietly drop if member_a==member_b /*drop loops*/
	quietly gen membernumber_b=.
	forvalues w=1/`maxsize' {
		quietly replace membernumber_b=`w' if member_b==word(`memberslist',`w')
	quietly drop if membernumber_a < membernumber_b /*keep only one version of a combinatio; ie, treat as "edges"*/
	drop membernumber_a membernumber_b `memberslist'
	lab var wc "n members in affiliation"
	sort `affiliation'
	merge m:1 `affiliation' using "`dataset'"
	disp "'from using' only are those with only one member in affiliation's memberlist"
	quietly sort `sortorder'
	quietly drop `sortorder'

Entry filed under: Uncategorized. Tags: , , .

Thanks for the Aspirin Guys Ratings game


  • 1. Brooks  |  July 20, 2010 at 10:09 pm

    My concern with this approach is that it is bimodal. Where we don’t lose information we should reduce to unimodal data. This would mean that affiliations are edges, not nodes.

    In the case of corporate alliances, we make a theoretical distinction by claiming that a node can be either an alliance or a corporation in a bimodal fashion. This grants an ontological status to the alliance as a distinct thing that relates to corporations. Unimodally, we would say an alliance is equal to a set of relations among corporations.

    Neither configuration is necessarily “correct,” however the distinction is not trivial, as the unimodal and bimodal versions will have different network characteristics. In a bimodal configuration, an affiliation among 5 units would be represented as 6 nodes and 5 relations. In a unimodal configuration it would be represented as 5 nodes and 25 relations, as affiliations are represented as cliques among all members of the affiliation.

    Thus somewhat paradoxically a unimodal configuration has more complexity to it because a reduction in nodes yields a proliferation of ties. In terms of attribute data the problem is how to keep track of the edges. Matrix format treats edges as quantitative and thus aggregable values. A unimodal approach may yield more meaningful network information, but under normal circumstances we would lose qualitative distinctions among alliances.

    Of course it matters what one is trying to accomplish analytically. In my situation, I’m choosing to go the unimodal route because I’m concerned with clustering and partitioning algorithms and it is hard for me to interpret the meaning of bimodal clusters. One concern is that there can be no triangles in a bimodal network; the smallest cycles would have four edges as units of the same mode can only connect to each other indirectly. Converting the network allows triangles and thus allows me to calculate clustering coefficients and local densities. More to the point of my research, though I haven’t tested it yet, my hunch is that community detection algorithms will be thrown off by a bimodal configuration because there will be less relational information to draw on. Briefly, I expect a bimodal configuration of the same data to exhibit greater variability for a given node in partition assignments when running community detection algorithms. Easily testable, so more on this later.

    • 2. gabrielrossman  |  July 21, 2010 at 2:39 pm

      the code quoted above takes bimodal data and projects it to being unimodal.

      you have a very interesting point that in some bimodal data (alliances and firms, artists and creative collaborations, etc) there is a clear ontological difference between the partitions whereas in a citation network they are all papers but are partitioned nonetheless because of time.

      btw, the hyper-clustering implied by projection can be really problematic if there are some large collaborations. for instance in IMDb, General Hospital has about 1000 credited actors whereas the typical film has about 25. since the number of edges in a collaboration clique is i^2 / 2 -i, this means that having 40 times the typical number of actors implies 1700 times the typical number of edges. fortunately i don’t think this is as much of an issue for citation networks since even review articles seldom have more than 200 cites.

  • 3. Brooks  |  July 21, 2010 at 7:53 pm

    Whoops, got my edge formula way wrong, not the simple square. 5 nodes, 10 edges undirected.

    • 4. Brooks  |  July 21, 2010 at 8:18 pm

      What I meant about the ontological status is that it’s a thorny question whether we conceptualize the affiliation as an actor or a set of relationships. I’d like to see some concerted theoretical work on that question. For IMDB, it’s easy to “see” the affiliation independently of the actors, because it’s media and the media is a discernible “thing,” albeit a cultural and not a social object. Marx refused to grant ontological status to commodities, demanding that commodities could not have relationships with each other as only social objects could relate. I think it’s ok for us to represent social actors as relating to cultural objects, but cultural objects can’t relate to each other independently of actors.

      Interestingly, formatting our data requires us to confront these questions. It’s a tricky decision whether a movie should be given the status of a node like the actors in it. The reductionist would steadfastly refuse to grant ontological status to the superordinate unit, conceptualizing it only as the units that make it up.

      I’m construct a cultural unit using citation data, that is, I want to isolate the cultural object to which social actors relate independent of the social relations among the actors themselves. I know what an article is as a cultural object, but the question of how to conceptualize the citation is a puzzle. I am betraying Marx by allowing cultural objects to relate to each other, which is the definition of fetishism.

  • 5. Brooks  |  January 19, 2013 at 5:27 pm

    I’m back on this again, and trying to do it with an extra-large bimodal edgelist, 2.5 million edges, the vast majority of which are pendants that become isolates when converting to a single mode. I wrote what I thought was a pretty straightforward loop to do this in R, but after running all night it was still going. Methinks that there is a memory problem, so perhaps a perl solution is in order?

    This basically operates on a sorted list and records an edge between two sources if the following line has the same target (which it will if the list is sorted). It skips isolates, which is an added bonus. But I thought it would run relatively quickly. I don’t understand the memory architecture of the different languages very well, but maybe this isn’t the job for R. Am debugging now to see if calculating the length of the list is what is slowing things down.

    for(i in 1:(d-1)){
    else break

The Culture Geeks

%d bloggers like this: