Drop the entirely missing variables

| Gabriel |

One of the purely technical frustrations with GSS is that it’s hard to figure out if a particular question was in a particular wave. The SDA server is pretty good about kicking these from extracts, but it still leaves some in. The other day I was playing with the 2008 wave and got sick of saying “oooh, that looks interesting” only to find the variable was missing for my wave, which happened repeatedly. (For instance, I thought it would be fun to run Biblical literalism against the Israel feeling thermometer, but no dice, at least for the 2008 wave).

To get rid of these phantom variables, I wrote this little loop that drops variables with entirely missing data (or that are coded as strings, see below):

foreach varname of varlist * {

quietly sum `varname'
if `r(N)'==0 {

drop `varname'
disp "dropped `varname' for too much missing data"

}

}

Unfortunately “sum” thinks that string variables have no observations so this will also drop strings. There’s a workaround, but it involves the “ds” command, which works in Stata 11 but has been deprecated and so may not work in future versions.

ds, not(type string)
foreach varname of varlist `r(varlist)' {

quietly sum `varname'
if `r(N)'==0 {

drop `varname'
disp "dropped `varname' for too much missing data"

}

}

7 comments December 2, 2009

R and TextMate

| Gabriel |

Now that I’ve started dabbling in R, I figured I needed to get my text editor to highlight the Klingon-esque syntax. TextWrangler and Smultron already support R, but getting it for TextMate requires the Terminal:

cd "~/Library/Application Support/TextMate/Bundles"
svn co http://svn.textmate.org/trunk/Bundles/R.tmbundle/

Note that 64-bit R is buggy so if you have trouble piping scripts from TextMate to Rdaemon (i.e., the command line R running in the background), you can use the bundle editor to redirect it to “R32″ instead of just “R” which will force it to use the slightly slower but more reliable 32-bit R. Or if that’s too hard, just stick to piping to R.app instead of Rdaemon.

Also, as long as you’re playing with the TextMate library, you might as well install “GetBundles,” a GUI frontend for browsing the TextMate bundle server.
svn co http://svn.textmate.org/trunk/Review/Bundles/GetBundles.tmbundle/

Note that GetBundles (with “s”) supersedes the now defunct GetBundle (without “s”) that you might see mentioned if you google things like “TextMate Bundle” or “TextMate R syntax”.

Add comment December 1, 2009

Perl text library

| Gabriel |

I found this very useful library of perl scripts for text cleaning. You can use them even if you can’t code perl yourself, for instance to transpose a dataset just download “transpose.pl” script to your ~/scripts directory and enter the shell command:
perl ~/scripts/transpose.pl row_col.txt > col_row.txt

The transpose script is particularly useful to me as I’ve never gotten Excel’s transpose function to work and for some bizarre reason Stata’s “xpose” command only works with numeric variables. You can even use these scripts from directly in a do-file like so:

tempfile foo1
tempfile foo2
outsheet using `foo1'.txt
shell perl ~/scripts/transpose.pl `foo1'.txt > `foo2'.txt
insheet using `foo2'.txt, clear

1 comment November 30, 2009

some R baby steps

| Gabriel |

I’ve tried to learn R a few times but the syntax has always been opaque to my Stata-centric mind. I think the trick is to realize that two of the key differences are that:

  • whereas handles only come up in a few Stata commands (e.g., file, postfile, log), they are very important in R, what with all the “<-” statements
  • in R there’s a much fuzzier line between commands and functions than in Stata. What I mean by this is both the superficial thing of all the parentheses and also the more substantive issue that often you don’t put them one to a line and they just do something (like Stata commands) but you usually put them many to a line and feed them into something else (like Stata functions). Related to this is that the typical Stata line has the syntax “verb object, adverb” whereas the typical R line has the syntax “object <- verb(object2, adverb)”

The two combine in an obvious way with something as simple as opening a dataset, which is just use file in Stata but is filehandle <- read.table(“file”) in R, that is, there’s not a read.table() command but a read.table() function and you feed this function to a handle. (And people say R isn’t intuitive!)

At least I think that’s a good way to think about the basic syntax — I suck at R and I really could be totally wrong about this. (Pierre or Kieran please correct me).

Anyway, I wrote my first useful R file the other day. It reads my Pajek formatted network data on top 40 radio stations and does a graph.

# File-Name: testgraph.R
# Date: 2009-11-20
# Author: Gabriel Rossman
# Purpose: graph CHR station network
# Data Used: ties_bounded.net
# Packages Used: igraph
library(igraph)
setwd("~/Documents/Sjt/radio/survey")
chrnet <- read.graph("ties.net", c("pajek"))
pdf("~/Documents/book/images/chrnetworkbounded.pdf")
plot.igraph(chrnet, layout=layout.fruchterman.reingold, vertex.size=2, vertex.label=NA, vertex.color="red", edge.color="gray20", edge.arrow.size=0.3, margin=0)
dev.off()

The weird thing is that it works fine in R.app but breaks when I try to R run from the Terminal, regardless of whether I try to do it all in one line or first invoke R and then feed it the script. [Update: the issue is a 32 bit library and 64 bit R, the simple solution is to invoke "R32" rather than just plain "R". see the comments for details]. Here’s a session with both problems:

gabriel-rossmans-macbook-2:~ rossman$ Rscript ~/Documents/book/stata/testgraph.R
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared library '/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so':
dlopen(/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so, 10): Symbol not found: ___gmpz_clear
Referenced from: /Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so
Expected in: dynamic lookup

Error : .onLoad failed in 'loadNamespace' for 'igraph'
Error: package/namespace load failed for 'igraph'
Execution halted
gabriel-rossmans-macbook-2:~ rossman$ R 'source("~/Documents/book/stata/testgraph.R")'
ARGUMENT 'source("~/Documents/book/stata/testgraph.R")' __ignored__

R version 2.10.0 (2009-10-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> source("~/Documents/book/stata/testgraph.R")
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared library '/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so':
dlopen(/Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so, 10): Symbol not found: ___gmpz_clear
Referenced from: /Library/Frameworks/R.framework/Resources/library/igraph/libs/x86_64/igraph.so
Expected in: dynamic lookup

 

Error : .onLoad failed in 'loadNamespace' for 'igraph'
Error: package/namespace load failed for 'igraph'
>

The problem seems to be that R (terminal) can’t find the igraph library. This is weird because R.app has no trouble finding it. Furthermore, I get the same error even if I make sure igraph is installed directly from R (Terminal) in the same R session:

chooseCRANmirror(graphics = FALSE)
install.packages("igraph")
source("/Users/rossman/Documents/book/stata/testgraph.R")

I guess that’s another difference with Stata, StataConsole knows where the ado library is. I’d like to be able to use the Terminal mode for R as this would let me to reach my Nirvana-like goal of having a single script that does everything without any proximate human intervention. So I’ll just ask? How do I get R (Terminal) to run as reliably as R.app? Is this a naive question?

Or would it be better to try to feed R.app a “source” script from the command line? Much how like I can do this for Stata to launch a do-file into the Stata GUI
exec /Applications/Stata/StataMP.app/Contents/MacOS/stataMP ~/Documents/book/stata/import.do

10 comments November 24, 2009

I am shocked–shocked–to find scientists abusing peer review

| Gabriel |

A major climate lab in Britain was hacked (leaked?) last week and a lot of the material was really embarrassing. Stuff along the lines of obstruction of freedom of information requests, smoothing messy data, and using peer review and shunning to freeze out contradictory perspectives. From the WaPo write-up:

“I can’t see either of these papers being in the next IPCC report,” Jones writes. “Kevin and I will keep them out somehow — even if we have to redefine what the peer-review literature is!”

In another, Jones and Mann discuss how they can pressure an academic journal not to accept the work of climate skeptics with whom they disagree. “Perhaps we should encourage our colleagues in the climate research community to no longer submit to, or cite papers in, this journal,” Mann writes.

“I will be emailing the journal to tell them I’m having nothing more to do with it until they rid themselves of this troublesome editor,” Jones replies.

All I can say is:

Most people have been looking at this in terms of the science or politics of climate change, but I’m completely with Robin Hanson in thinking that those are non sequiturs and what’s really interesting about this is the (office) politics of science. I mean, is anyone who has ever been through peer review at all surprised to hear that peer reviewers can be malicious assholes willing to use power plays to effect closure against minority perspectives?

On the other hand, while I think this is an affront to decency, this doesn’t really give me severe problems as a matter of scientific epistemology. Sure, I’d rather that scientists took the JS Mills ideal of the market of ideas with a “let me hear you out and then if I’m still unconvinced I’ll give you my good faith rebuttal.” Nonetheless, I’m enough of a Quinean/Kuhnian to think that science isn’t about isolated findings but the big picture and the dominant perspective is probably still right, even if its adherents aren’t themselves exactly Popperians actively seeking out (and failing to find) evidence against their perspective.

Add comment November 23, 2009

12 weeks of culture

| Gabriel |

Jenn posted her draft syllabus for grad soc of culture / cultural sociology. It looks like about the best survey of the literature you could get in 12 weeks and about a thousand pages of material. Aside from simply choosing good readings, she’s managed to organize them into weeks in a way that imposes a good sense of order on an often messy and amorphous set of issues. She also does a much better job than I do of covering all parts of the field. (My syllabus is unabashedly production-centric, as it’s part of a two quarter sequence with a sister course taught by a colleague on meaning-centric approaches). I highly recommend checking it out for any grad students prepping for a field exam or faculty prepping a course.

Add comment November 23, 2009

A few stats pedagogy notes

| Gabriel |

I’ve found the OS X zoom feature to be very effective when teaching stats. Most of the time I have the projector at full resolution (so any given thing on it looks small), but when I want to show a piece of code or output I just mouse over to it and zoom (hold “control” and scroll-wheel or two-finger swipe up). This lets me keep my mac at the regular resolution and use Stata and TextMate in class instead of setting it to a lo-res and/or putting zoomed screenshots in Powerpoint or Keynote. This both has a more improvisatory feel and cuts down on the purely technical aspects of course prep.

Speaking of Powerpoint/Keynote, one of the problems with teaching code is you lose syntax highlighting. However you can keep it by copying from a text editor as rtf.

Finally, via Kai Arzheimer, I see the new site Teaching With Data, which includes both sample datasets and pedagogical materials.

Add comment November 17, 2009

Programming notes

| Gabriel |

I gave a lecture to my grad stats class today on Stata programming. Regular readers of the blog will have heard most of this before but I thought it would help to have it consolidated and let more recent readers catch up.

Here’s my lecture notes.

2 comments November 12, 2009

Science (esp. econ) made fun

| Gabriel |

In a review essay, Vromen talks about the (whodathunkit) popular book/magazine-column/blog genre of economics-made-fun that’s become a huge hit with the mass audience in the last 5 to 10 years. Although Vromen doesn’t mention it, this can be seen as a special case of the science-can-be-fun genre (e.g., Stephen Jay Gould’s short essays that use things like Hershey bars and Mickey Mouse to explain reasonably complex principles of evolutionary biology.)

Vromen makes a careful distinction from the older genre of economists-can-be-funny (currently exemplified by the stand-up economist), which is really a special case of the general genre of scientists doing elaborate satires of their own disciplines for the benefit of their peers. There is an entire journal of this, but my all time favorite example is a satire of mid-20th century psychology in the form of a review of the literature on when people are willing to pass the salt at the dinner table.  Two excerpts from the “references” section should suffice to convince you to click the link and read the whole thing.

  • Festinger, R. “Let’s Give Some Subjects $20 and Some Subjects $1 and See What Happens.” Journal for Predictions Contrary to Common Sense 10, 1956, pp. 1-20.
  • Milgram, R. “An Electrician’s Wiring Guide to Social Science Experiments.” Popular Mechanics 23, 1969, pp. 74-87.

If you don’t remember what Festinger and Milgram actually did in the 50s and 60s this won’t be funny, but if you do it’s hilarious. Hence, the scientists-can-be-funny genre is a self-deprecating genre for an audience of insiders that simultaneously demonstrates the joker’s mastery of the field and the field’s foibles. In contrast, the science-can-be-fun genre is targeted to a mass audience and is about demonstrating the elegance and power of the field. The former inspires humility among practitioners, the latter awe among the yokels.

One of the interesting things about the econ-made-fun literary genre is that it is largely orthogonal to any theoretical distinction within scholarly economics. The most prominent “econ made fun” practitioners span such theoretical areas as applied micro (Levitt), behavioral (Ariely), and Austrian (Cowen). In part because the “econ made fun” genre exploded at about the same time as the Kahneman Nobel and in part because “econ made fun” tends to focus on unusual substantive issues (i.e., anything but financial markets), this has led a lot of people to conflate “econ made fun” and behavioral econ. I’ve heard Steve Levitt referred to as a “behavioral economist” several times. This drives me crazy as at a theoretical level, behavioral economics is the opposite of applied micro, and in fact Levitt has done important work suggesting that behavioral econ may not generalize very well from the lab to the real world. That people (including people who ought to know better) nonetheless refer to him as a “behavioral economist” suggests to me that in the popular imagination literary genre is vastly more salient than theoretical content.

I myself occasionally do the “sociologists can be funny” genre (see here , here, and here) but these are basically elaborate deadpan in-jokes and I am under no illusions that anyone without a PhD would find them at all funny. I have no idea how to go about writing “sociology can be fun” (this is probably the closest I’ve come) along the lines of Levitt/Dubner or Harford, nor to be honest do I see any other sociologist doing it particularly well. There are plenty of sociologists who try to speak to a mass audience, but the tone tends to be professorial exposition or political exhortation rather than amusement at the surprising intricacy of social life. Fortunately Malcolm Gladwell has an intense and fairly serious interest in sociology and is very talented at making our field look fun.

1 comment November 10, 2009

Team Sorting

| Gabriel |

Tyler Cowen links to an NBER paper by Hoxby that shows that in recent decades, status sorting has gotten more intense for college. Cowen asks “is this a more general prediction in a superstars model?” The archetypal superstar system is Hollywood, and here’s my quick and dirty stab at answering Tyler’s question for that field. Faulkner and Anderson’s 1987 AJS showed that there is a lot of quality sorting in Hollywood, but they didn’t give a time trend. As shown in my forthcoming ASR with Esparza and Bonacich, there are big team spillovers so this is something we ought to care about.

I’m reusing the dataset from our paper, which is a subset of IMDB for Oscar eligible films (basically, theatrically-released non-porn) from 1936-2005. If I were doing it for publication I’d do it better (i.e., I’d allow the data to have more structure and I’d build confidence intervals from randomness), but for exploratory purposes the simplest way to measure sorting is to see if a given film had at least one prior Oscar nominee writer, director, and actor. From that I can calculate the odds-ratio of having an elite peer in the other occupation.

Overall, a movie that has at least one prior nominee writer is 7.3 times more likely than other films to have a prior nominee director and 4.4 times more likely to have a prior nominee cast. A cast with a prior nominee is 6.5 times more likely to have a prior nominee director. Of course we already knew there was a lot of sorting from Faulker and Anderson, the question suggested by Hoxby/Cowen is what are the effects over time?

This little table shows odds-ratios for cast-director, writer-director, and writer-cast. Big numbers mean more intense sorting.

...+--------------------------------------+
...| decade    cd       wd       wc       |
...|--------------------------------------|
1. | 1936-1945 6.545898 6.452388 4.306554 |
2. | 1946-1955 9.407476 6.425553 5.368151 |
3. | 1956-1965 12.09229 8.741302 6.720059 |
4. | 1966-1975 4.697238 5.399081 4.781106 |
5. | 1976-1985 4.113508 6.984528 4.450109 |
6. | 1986-1995 4.923809 7.599852 3.301461 |
7. | 1996-2005 4.826018 12.35915 3.641975 |
+-----------------------------------------+

The trend is a little complicated. For collaborations between Oscar-nominated casts on the one-hand and either writers or directors, the sorting is most intense in the 1946-1955 decade and especially the 1956-1965 decade. My guess is that this is tied to the decline of the studio system and/or the peak power of MCA. The odds-ratio of good director for nom vs non-nom writers also has a jump around the end of the studio system, but it seems there’s a second jump starting in the 80s. My guess is that this is an artifact of the increasing number of writer-directors (see Baker and Faulkner AJS 1991), but it’s an empirical question.

Putting aside the writer-director thing, it seems that sorting is not growing stronger in Hollywood. My guess is that ever more intense sorting is not a logical necessity of superstar markets, but has to do with contingencies, such as the rise of a national market for elite education in Hoxby’s case or the machinations of Lew Wasserman and Olivia deHavilland in my case.

The Stata code is below. (sorry that wordpress won’t preserve the whitespace). The data consists of film-level data with dummies for having at least one prior nominee for the three occupations.

global parentpath "/Users/rossman/Documents/oscars"

capture program drop makedecade
program define makedecade
gen decade=year
recode decade 1900/1935=. 1936/1945=1 1946/1955=2 1956/1965=3 1966/1975=4 1976/1985=5 1986/1995=6 1996/2005=7
capture lab drop decade
lab def decade 1 "1936-1945" 2 "1946-1955" 3 "1956-1965" 4 "1966-1975" 5 "1976-1985" 6 "1986-1995" 7 "1996-2005"
lab val decade decade
end

cd $parentpath

capture log close
log using $parentpath/sorting_analysis.log, replace

use sorting, clear
makedecade

*do odds-ratio of working w oscar nom, by own status

capture program drop allstar
program define allstar
preserve
if "`1'"!="" {
keep if decade==`1'
}
tabulate cast director, matcell(CD)
local pooled_cd=(CD[2,2]*CD[1,1])/(CD[1,2]*CD[2,1])
tabulate writers director, matcell(WD)
local pooled_wd=(WD[2,2]*WD[1,1])/(WD[1,2]*WD[2,1])
tabulate writers cast, matcell(WC)
local pooled_wc=(WC[2,2]*WC[1,1])/(WC[1,2]*WC[2,1])
shell echo "`pooled_cd' `pooled_wd' `pooled_wc' `1'" >> sortingresults.txt
restore
end

shell echo "cd wd wc decade" > sortingresults.txt
quietly allstar
forvalues t=1/7 {
quietly allstar `t'
}

insheet using sortingresults.txt, delimiter(" ") names clear
lab val decade decade

 

*have a nice day

4 comments November 8, 2009

Previous Posts


The Culture Geeks

Tags

bayesian cleaning culture diffusion economics economic sociology ethnomethodology financial crisis graphs history IMDB loops lyx macros networks phenomenology philosophy of science R random variables regular expressions resampling shell sociology of organizations sociology of science st Stata superstar text editor typesetting

Archives

Recent Posts

Recent Comments

Blogroll