Archive for March, 2009

Scientific Inference, part 2 of 4

Sociologists tend to divide themselves into “positivists” and “social constructionists” (with “enlightened positivist” sometimes being a middle ground), but these terms don’t do justice to the philosophy of science and if taken seriously neither model is very appealing. Likewise, many scientists will tell you that they follow Popper’s falsifiable hypothesis logic but neither does this reflect the way science is actually done (or ought to be done). We’ll go over several approaches to science, all of which agree that science is possible and desirable, but differ as to exactly what this means. The central problem that all of them are directly or indirectly attempting to grasp with is that of rigorous induction — how do we translate observations about the universe into understandings of natural law without being blinded by our preconceptions.

For both historical reasons and because he is so often invoked by practicing scientists, it’s probably worth starting with Popper. Karl Popper was originally very excited by the social theories of Marx and the psychology of Freud and Adler, at one point working in Adler’s lab. He grew frustrated though when he saw how absolutely any evidence could be made to fit within their theories with even facially disconfirming evidence being interpreted as the result of a previously unstated contingency or of the system’s ability to sublimate contradiction. In contrast, when Einstein stated his theory of general relativity he extrapolated from it a very specific prediction about how during a solar eclipse it would be apparent that the gravity of the sun bends starlight. Sir Arthur Eddington observed an eclipse and found that Einstein’s predictions were correct, but the point is not that Einstein was right but that he gave a specific prediction that could have been wrong. Popper found this contrast fascinating and set out to build an intellectual career out of contrasting science (exemplified by Einstein) with pseudo-science (exemplified by the Marxists and Freudians). Note that Popper thought that there was nothing inherently pseudo-scientific about social or behavioral inquiries, he just wasn’t fond of these particular examples.


Popper’s essential insight from this contrast was that confirmation is cheap. He gave the example of a theory that all swans are white. It would be fairly easy to make a long list of white swans in much the same way that Freud made a long list of people whose neuroses derived from sublimated sexuality. Popper said a much better thing to do would be to search for black swans and fail to find any. In fact (as seen in the photo I took at the Franklin Park Zoo in Boston) there are black swans so we can reject the “all swans are white” hypothesis. Freud never looked for his black swans, which in his case would be neurotics without sublimated sexuality. Worse yet, Freud didn’t really have a non-tautological measure of sublimation so his theory is literally not falsifiable. In contrast, Einstein made a very specific prediction about how the stars would appear during a solar eclipse such that any astronomer could examine a photo of an eclipse and see whether it matched Einstein’s prediction.

Popper can be summed up as emphasizing the importance of “falsifiable hypotheses” as the definitive characteristic of science. Such a definition worked for him as he was far less interested in how science works than in defining what is not science. This is why one of the worst things you can say about a scientist is that his work is “not even wrong” as it implies that the scientist has lapsed into metaphysics. Philosophers call this agenda “demarcation” and sociologists call it “boundary work.” In our time the principle demarcation problem is creationism whereas for Popper himself it was mostly about Marxism and psychoanalysis.

[Creationism comes in two major forms, a hardcore “young Earth” version and a more squishy “intelligent design.” Young Earth creationism argues that Genesis is literally true and about 6000 years ago God created the heavens and the Earth in 144 hours and a few generations later created a massive flood that essentially rebooted the world. Intelligent design accepts the broad outlines of the conventional scientific view of the age of the Earth and the procession of natural history but argues that divine intervention routinely adjusts natural history, chiefly through being responsible for speciation. One bizarre consequence of this is that intelligent design is too vague to test, whereas young Earth creationism gives very concrete predictions (all of which are demonstrably false). Thus in strict Popperian terms intelligent design is more pseudo-scientific than young Earth creationism as the latter gives testable (albeit false) hypotheses whereas the former does not.]

Positivists also use obviously ridiculously things like astrology, pyramidology, and parapsychology for calibrating the gun sights, with the assumption being that any good demarcation criteria should be able to explain why astrology is bullshit. Popper went so far as to say that the theory of natural selection is not scientific because in practice “fitness” is defined tautologically as “that which is associated with survival.” However in a famous lecture he eventually recanted and argued that even if it is tautological to label any given common allele as promoting fitness, we can make a falsifiable hypothesis that selective advantage is more important for explaining complex organs than such alternative descent processes as genetic drift. (Note that this comes very close to saying that while a specific hypothesis is not testable the paradigm taken as a whole is and so here Popper was implicitly embracing holism).

While the idea of the “falsifiable hypothesis” was Popper’s key contribution, it’s worth also reviewing the logical positivist school with which he was loosely affiliated. The positivists drew a very strong distinction between synthetic (empirical data) and analytic (math and logic). Any statement that could not be described as either synthetic or analytic they derided as metaphysics (or more whimsically, “music” or “poetry”). Popper’s work fit within the positivist framework as it assumed a sort of deduction-induction cycle where the scientist would use logic to derive falsifiable hypotheses from theory, then collect data to test these hypotheses. That’s the technical meaning of “positivist” in philosophy, but sociologists usually use the term casually as the opposite of “deconstructionist” or “postmodernist” to mean someone who believes that science is possible without being hopelessly mired by subjectivity. Our usage completely loses any philosophic notions about strong distinctions between analytic, synthetic, and metaphysical and many sociologists who describe themselves as “positivists” probably really mean only that they are empiricists or scientific realists.

March 31, 2009 at 8:06 am

Viral marketing campaigns

| Gabriel |

Ad Age has a very interesting article on viral marketing campaigns today which graphs the views over time of various videos created by corporate advertisers or social movements. The most interesting thing about it is that only one of the campaigns shown in the graph (“Durex: Get It On”) shows anything like a classic s-curve. The rest of the campaigns can broadly be lumped into two categories.

The first type of video follows a pure external influence model (e.g. “Guitar Hero: Bike Hero”) and so they were only viral in aspiration but not in fact. These videos have an initial burst of views, then top out at about a half a million views and stop. What seems to be happening is that the companies buy a bunch of banner ads, “featured video” slots and that sort of thing and a lot of people watch but almost nobody ever sends the link to their friends.

The second type of videos follow Bass’s classic mixed-influence model. These videos also seem to start out with banner ads and such, but they actually do get people to forward the link to their friends and so they ultimately become extremely popular (millions of views).

I think there are interesting findings from this research but also an opening for further work. The findings are further support for the big seed model. Most of these campaigns did not start small like a true epidemic but had an initial marketing push. Furthermore,  the research supports the hypothesis that the asymptote is correlated with the endogenous component of the hazard. The opening is to figure out of those campaigns that had an initial push, which actually went viral? Was it something about where the video was hosted (eg youtube vs it’s own site)? The length of the video? Whether it looked slick or was a simulacra of lo-fi “user content”?

Also see similar research on diffusion models for web videos by Riley Crane.

March 30, 2009 at 10:36 am

Scientific Inference, part 1 of 4

Last week I listened to this bloggingheads diavlog between Lipson and Stemwendel and I found it to be really interesting since a lot of philosophy of science and sociology of science is implicit in their conversation. The focus of their conversation is alternative medicine (which they agree is mostly hogwash) and particularly Senator Harkin’s advocacy of continued funding for the National Center on Complementary and Alternative Medicine (NCCAM).

Stemwendel seems to think there is some value in finding the null and disproving kooky ideas, but Lipson takes an even harder line and argues that ideas which are nonsense on the face of it don’t deserved to be dignified by an empirical test (in part because of the problem of false positives being exaggerated by publication bias). I entirely agree with them about the relative merits of traditional vs. alternative medicine and I actually get angry when I hear bullshit about thimerasol or magnets or vague “toxins” but I also had a series of reactions to this issue on a meta level.

Nonetheless, my initial reaction to this was, wow, talk about social closure. Here we have an instance of a contested field and the dominant faction in the field is arguing that the subordinate faction is ipso facto illegitimate and its ideas don’t even merit a hearing. Even though I support the dominant faction on the merits this still seems like the kind of thing that would make J.S. Mill and Karl Popper turn over in their respective graves.

Then I was thinking about how this social closure problem is particularly severe since there’s the structural bias that, unlike traditional medicine, alternative medicine can’t get funding from big pharmaceutical companies and so it’s government grants or nothing. This is a common argument you hear from alternative medicine advocates but when you think about it, it doesn’t really make sense. The argument has several implicit assumptions:

  • Alternative medicine is not commercial
  • Alternative medicine can’t be protected by intellectual property law
  • Intellectual property law is the only way to capture rents from intellectual property

All of these assumptions are false. First, Alternative medicine actually involves a lot of commerce (which unlike traditional medicine is on a cash basis as it’s usually not covered by insurance and thus, if anything, it’s even more commercial). Some of this could be covered by intellectual property law (most of the little gadgets and doodads they produce are patentable) but much of it is not. Nonetheless this doesn’t mean that the sponsors of research can’t make money off of it. You could imagine scenarios under which research could be a kind of club good allowing the sponsors of the research to capture most of the rents flowing from it. For instance, pomegranates have been around at least since Persephone ate those seeds and so it seems like they would be long since off-patent, and hence there would be no private sector incentive to sponsor research promoting pomegranates. Nonetheless a major pomegranate processing company did sponsor research into pomegranates which is why you now see all those ads talking about antioxidants. The processor owned enough of the market that they were able to capture the rents associated with research making pomegranates more attractive, even though at first glance they seem like a commodity. Likewise you might imagine that some kind of major alternative medicine facility (like the Victorian-era resort in “The Road to Wellsville”) could profit by promoting techniques even if it couldn’t patent them.

My final thought was to unpack what it means to say that even if a hypothesis is testable, it is such utter nonsense that it doesn’t deserve to be tested. For instance, consider the claim that putting magnets in your shoes will align your energy field which will improve your health. It would be easy to design a very rigorous experiment that would test this hypothesis. In this sense Popper and the positivists would approve of it as a hypothesis. However the hypothesis is completely incommensurable with the body of scientific knowledge about human physiology. In other words, it falls outside the paradigm. A lot of people (especially those who haven’t actually read Kuhn) see paradigms as a bad thing, a mechanism for social closure and close-mindedness. However Kuhn makes a positive case for paradigms in that it is impossible to come to the world as a naive empiricist and accomplish anything useful. A paradigm is necessary to give order to the world sufficient to motivate hypotheses and make sense of specific findings. In this sense when Lipson argues for defunding NCCAM he’s really making a very Kuhnian argument — we have a paradigm, it seems to be a pretty good paradigm (i.e., it has few anomalies and continues to inspire productive research agendas), and let’s not get distracted with stuff that falls outside of it.

Since we’re now getting into the philosophy of science, over the next few days I’ll be posting my lecture notes on the subject in a series of three posts.

March 30, 2009 at 9:45 am

True Tales of the IMDB!

| Gabriel |

Sometimes the hardest thing is getting the data into Stata. I do some work with the raw IMDB files and these can be hard to get into Stata for all sorts of reasons, the first of which is that they are huge.

This is doubly frustrating because most of the reason the files are so huge is stuff like pornography that I plan to drop from the dataset as soon as possible. No kidding, I traced one of my data problems today to a writing credit for someone named “McNoise” for a film called “Business Ass.” (I’m presuming this is porn as I’d rather not look into the matter further).

The hugeness of these files is compounded by the fact that Stata doesn’t store memory as efficiently as text files. If you see a text file is 100 megs, you might foolishly type “set mem 120m” and expect the thing to insheet. In fact it almost certainly will not because Stata uses enough memory for each case of each string variable to have as many characters as the single longest value for that variable. In other words, if 99% of the movies in IMDB have a name that’s 20 characters or less long but a handful have names that are 244 characters long, then Stata will use as much RAM as if all of them were 244 characters. Thus the Stata memory allocation might have to be three or four times the size of the text file.

But even if you somehow had a terabyte of RAM it’s not like you could just type insheet and leave it at that because the files are dirty (and not just because they have so much porn). The most obvious thing is that the tabs don’t match up. The basic organization of the file is like this:

writer1{tab}1st film credit
{tab}{tab}{tab}2nd film credit
{tab}{tab}{tab}kth film credit
writer2{tab}1st film credit

This organization means that when you insheet it the first film credit shows up as v2 but subsequent film credits show up as v4 in different rows. You could fix this in Stata (replace v2=v4 if v2==””) but remembering what I said about RAM you really wouldn’t want to. You’re much better off pre-cleaning the data in a good text editor (or if you plan on doing it routinely, perl). In addition to this systematic thing of first credit, later credit, there are also idiosyncratic errors. For instance, the rapper 50 Cent has a writing credit for a direct to video project called “Before I Self Destruct” and there are two tabs between his name and the credit instead of the usual one tab.

Now here’s the real trick. You insheet your data but half of it’s not there. Note that Stata doesn’t tell you this. You have to check it yourself by using your text editor to see how many rows are in your text file and then typing “desc” in Stata to see your n and notice if it matches. It took me about an hour to realize that the IMDB writers’ file has several hanging quotes (i.e. an odd-number of ” characters in a string). Because Stata uses ” as a string delimiter when you insheet, Stata ignores all the rows in your text file between your first hanging quote and your second hanging quote (and then between your third and fourth, and so on). If I needed the quotes and/or were more patient I’d figure out how to write a regular expression to find hanging quotes and close them, but because I don’t need them (IMDB uses quotes for print and tv but not films and I only care about films) I just turned them all into underscores which is usually a safe character for Stata to handle.

Anyway, I did the cleaning in TextWrangler so there’s no script per se but I did keep notes. You could turn these notes into a perl script but it would only be worth it if you needed to do it several times. The notes show find/replace general expression patterns. The notes are for the file “writers.list”. Because each IMDB file is formatted slightly differently (yeah, I know isn’t that great) you’ll need different code for different files.


\)  \(

\) \(as

}  \(




the next few following commands will save memory but are not necessary. use
each of them as a find pattern to be replaced with nothing. they eliminate
non-theatrical credits but only if they are not the writer's first credit in
the file. the last pattern matches credit for a tv episode.
^\t.+ \([1-2][0-9][0-9][0-9]\) \(TV\).+\r
^\t.+ \([1-2][0-9][0-9][0-9]\) \(VG\).+\r
^\t.+ \([1-2][0-9][0-9][0-9]\) \(V\).+\r
^\t".+" \([1-2][0-9][0-9][0-9]\) {.+\r

March 26, 2009 at 3:29 pm 1 comment


| Gabriel |

There have been a lot of updates lately to the completely indispensable estout package.

If you’re thinking, what is this “estout” of which he speaks? Don’t walk, but run to your copy of Stata and type:

ssc install estout

If you already have estout and are trying to install the update try.

ssc install estout, replace

As every quant knows, getting Stata output into journal layout is really, really, tedious and you have to start all over and do it from scratch anytime you change anything about a model. When I was an undergrad I thought I was so cool when I realized I could read a log file into Excel as a fixed-width text file. This and some related tricks cuts down the time it takes to make a decent-sized regression table from about 40 minutes to about twenty minutes, but that’s still a pretty tedious 20 minutes.

So I was pretty happy when I learned about the various table-making commands that can do this for you. The first time somebody showed me how estout works I felt like one of the Munchkins after Dorothy killed the wicked witch of the East.

Estout cuts down table-making to between zero and five minutes, depending on how gung ho you are about tweaking the syntax. Really hardcore people have it output TeX that they embed directly in their write-up. The syntax is a little hard to learn but you generally only have to learn enough syntax to get it to work with one or two styles that you use often. Here’s my syntax to create an ASA-style table for a multi-level model with nested independent variables. I use it as fixed width because it makes it easier to import into a spreadsheet. (Excel really likes to think of parentheses as meaning “negative” rather than as literal strings).

eststo clear
eststo: xtreg y x1, re i(clusterid)
eststo: xtreg y x1 x2, re i(clusterid)
eststo: xtreg y x1 x2 x3, re i(clusterid)
esttab using table.txt , se b(3) se(3) scalars(ll rho) nodepvars nomtitles  label title(Table: REGRESSION MODELS OF SOMETHING) replace fixed

March 25, 2009 at 11:10 am 8 comments


| Gabriel |

My MDC technique is basically a multilevel version of a much older technique (as in so old it could have been used for marketing analysis at Sterling Cooper on “Mad Men”) created by Edwin Mansfield. This older technique first does a series of Bass analyses (Mansfield published the equation before Bass but the older version is under-theorized which is why we call it the “Bass” model today). It then treats the coefficients from the first stage as a dataset to itself be regressed. Although more recent work supersedes it in several ways, it’s still worth using for diagnostic purposes. However it’s a pain in the ass to use as it requires you to run a separate regression for each of your innovations and then aggregate them. As such, I wrote this code to automate it.

Even if for some bizarre reason you’re not particularly interested in diffusion models dating from the Kennedy administration, this code may be interesting for a few reasons:

  • It uses the “estout” package not for the (indispensable) usual purpose of making results meet publication style, but for the off-label purpose of creating a meta-analysis dataset.
  • It makes extensive (and extremely clumsy) use of shell-based regular expression commands to clean this output. (I am under no illusions that the “awk” code is remotely elegant).
  • It saves the cluster id variable in a local, then attaches it back using a loop.
capture program drop mansfield
program define mansfield
 *NOTE: dependency, "vallist" and "estout"

 set more off

 local caseid `1'
 local genre  `1'

 sort `caseid'
 by `caseid': drop if [_N]<5

 vallist `caseid', quoted 

 shell touch emptyresults
 shell mv emptyresults `genre'results.txt
 foreach case in `r(list)' {
  disp "`case'"
  quietly reg w_adds Nt Nt2 if `caseid'==`case'
  esttab using `genre'results.txt, plain append

 shell awk '{ gsub(" +b/t", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub(" +", "\t"); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("\n\t.*", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '/.+/{print $0}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("^\t.+", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("^$", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '/.+/{print $0}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("Nt2\t", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("_cons\t", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("N\t", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("Nt\t", "NR\t"); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell #!/bin/sh
 shell awk -f mansfield.awk `genre'results.txt > tmp ;  mv tmp `genre'results.txt

 insheet using `genre'results.txt, clear
 drop v1 v6
 ren v4 A
 ren v2 B
 ren v3 C
 ren v5 n
 gen b    = -C
 gen nmax = (-B - ((B^2)-4*A*C)^0.5) / (2*C)
 gen a    = A / nmax

 gen `caseid'=.
 global n=1
 foreach case in `r(list)' {
  replace `caseid'=`case' in $n/$n
  global n=$n+1
 save `genre'_mansfield.dta, replace

*note, text of mansfield.awk follows
*it should be in the same directory as the data and made executable the command
*"chmod mansfield.awk -x"
*    FS="\n"
*    RS="Nt\t"
*    ORS=""
*        x=1
*        while ( x<NF ) {
*                print $x "\t"
*                x++
*        }
*        print $NF "\n"

March 24, 2009 at 9:31 am 2 comments

Commercial visualization

| Gabriel |

Today Ad Age has a brief story on the use of data visualization techniques in marketing. Many of the things they describe/link are aesthetically appealing and entertaining and this is no small accomplishment. However if you go by Tufte’s standards they don’t accomplish the main purpose of visualization, which is to concisely convey large, complex, and often hyperplex data. In contrast, things like Social Explorer, Pajek graphs, the Herr et. al. network graph of the IMDB, and some of the projects at Microsoft Research are beautiful and informative. However I don’t want to be too gung ho about insisting that visualization meet the Tufte-test — people like ducks and if they see them in ads or music videos it may lead them to be more interested in more serious graphs. Certainly I’m all for anything that leads to popular attention and private sector employment opportunities for culture quants.

March 20, 2009 at 2:35 pm

Older Posts

The Culture Geeks