Archive for September, 2012
| Gabriel |
A few months ago Stanford’s sociology department was nice enough to invite me up to give a talk on chapter four of Climbing the Charts. This chapter argues that the opinion leadership hypothesis cannot be supported in radio and in the talk I show a simulation of why we should be skeptical of this hypothesis in general. There’s no video, but here’s an enhanced audio file with slideshow. Also a separate PDF of the slides in case you have problems with the integrated version. (A caveat, I knew I was speaking to a technically sophisticated audience so I let the jargon flow freely, the chapter itself is much easier to follow for people without a networks background).
Also in shameless plugging news, Fabio’s review at OrgTheory.
| Gabriel |
[Update 1: From skimming Al Jazeera’s (conveniently date-stamped) blog on this issue for Saturday and Sunday it looks like the protests have slowed considerably, which would imply an s-curve.]
[Update 2: It looks like the prime mover political entrepreneur was the shock artist, who actively tried to get a reaction out of people. That is, this is more similar to Jones threatening to burn Korans than to the Danish imans going on tour with (forged versions of) the Danish cartoons. Of course as in purely domestic culture wars issues, there can be a strange symbiosis between partisans on both sides who disagree on the merits but mutually benefit from discord.]
I took the Atlantic Wire’s map (also see the KML file) of “Innocence of Muslims” protest, did my best to add dates, and graphed it as a (cumulative) diffusion curve. Pretty much as you’d expect, it shows exponential growth indicating a process of imitation. Note that the curve rises a bit above trend on Friday, but on the other hand it’s not entirely a Friday thing since you do see growth on Wednesday and Thursday too. I’m gonna split the difference and say it’s about half garden variety imitation and half the fact that Friday is the Islamic Sabbath.
Let’s hope the curve starts bumping up against the asymptote soon and goes from exponential to s-curve. On a more pessimistic note, even after this particular issue burns out, the tactic itself of drumming up outrage against an obscure blasphemy will be imitated at some point in the future by some political entrepreneur, just as this itself was almost certainly inspired by earlier similar efforts by other policy entrepreneurs. That is, there is a logic of imitation at both micro and macro, for protests within each scandal and for scandals imitating each other.
A caveat, I did my best to get the dates right but it often isn’t clear, even in the original news story linked in the KML file. Also, thanks to Neal Caren and Matt Frost for pointing to and showing me how to download the file.
The following is a guest post from Trey Causey, a long-time reader of codeandculture and a grad student at Washington who does a lot of work with web scraping. This is his second guest post here, his first was on natural language processing SNAFUs.
| Trey |
“We all know that survey methods are becoming increasingly outdated and clunky.” While at the ASA Annual Meeting in Denver, I attended a methods panel where a graduate student opened with this remark; I’m paraphrasing here (but not by much). This remark was met with sympathetic laughter, mock(?) indignation, and some groans. Subsequent presenters, whose papers were mostly based on survey research, duly referred back to this comment. Survey research has been the workhorse of social scientists for generations and large portions of our methods training are usually survey-oriented in one way or another.
While the phrasing may have been indelicate, the graduate student is on to something important. I later spoke to a faculty member from the same department as the aforementioned graduate student. We were discussing the rise of “big data” and non-survey-based methods in the social sciences and he commented that he felt that the discipline was inevitably headed towards these methods, as survey response rates continue to fall.
I was reminded of both of these anecdotes earlier this week when Claude Fischer blogged about the crisis of survey response rates. He writes that Pew is averaging an astonishing 9% completion rate these days. He closes by speculating on what we can use instead of surveys to measure public opinion: “…letters to the editor, Facebook “likes,” calls to congressional offices, tweet vocabulary, street demonstrations — or the most common way we do it, assuming that what we and our friends think is typical. We might track Americans’ health by admissions to the hospitals, death rates, consumption of Lipitor. We could estimate changes in poverty by counting beggars on the street, malnourished kids at school. We could try to figure out the “dark” crime number by… I don’t know.”
Fischer is right to be pessimistic about the use of surveys. It’s not clear that the dwindling numbers of participants in many surveys are representative of the larger population (for many reasons). I think the writing is on the wall for all but the most well-funded and well-staffed survey organizations to provide large, representative samples of the American public.
However, I’m optimistic that new forms of data and methods are waiting to pick up the slack. As readers of this blog are no doubt aware, I am a proponent of using data scraped from the web; this could include Facebook likes or Twitter vocabulary, as Fischer points out, or newspaper articles, message board posts, or music sales data. Although debates inevitably (and rightly) arise when studies using these kinds of data are discussed amongst social scientists, the representativeness argument wielded (mostly) by survey researchers is overblown and perhaps deserves to be redirected back at survey research.
Are data scraped from the web representative of a larger population? That depends on which population we’re talking about. Questions that take the form of “does subgroup A differ significantly on average from subgroup B or population P on some attitudinal measure Y” usually require some kind of representative sample of a general population. However, many questions don’t take this form and social scientists are often interested in studying specific subpopulations, often underrepresented in probability samples.
Data scraped from the web have some real benefits. They are often naturalistic and intentional — individuals decided to write something of their own accord, unprompted by any researcher, posted it online for others to read, hoping to communicate something. They are generated in real-time. Reading message board threads like this one from Metafilter on September 11, 2001 gives us real-time insight into how people were making sense of an unfolding tragedy. No retrospective bias, no need to get a survey team on the ground a day or two after things happen. We can combine what individuals say with behavioral data. Scraping is one of the more unobtrusive forms of data collection — the act of observation by the researcher does not alter the individual’s behavior. It’s cheap and fast–even by sociologists’ standards.
Obviously, as people become more savvy about what companies and governments track about them from their online activities, this will probably change. However, it’s still more naturalistic than answers to closed-form survey questions over the phone. Rather than trying to figure out if survey responses represent a stable attitude or if they are epiphenomenal, the individual generates the data independently. While presentation management will always be an underlying concern when trying to figure out what people “really” mean or “really” want, it is not obvious to me that this problem was ever solved in survey research. Amazingly, people offer up copious information about sensitive topics–their sex lives, their drug habits, and their personal prejudices–without being asked and without the questions being cleared by an IRB (although analyzing this information will still require IRB approval). Combined with new methods to analyze unstructured text, advances in network analysis, and increases in computing power, the possibilities are really quite amazing.
It’s also important to note that data scraped from the web are not limited to the activities of computer users themselves — governments store loads of information online, media outlets often have full transcripts online, corporations publish earnings reports, and universities post course schedules and enrollment figures. They aren’t posted on the ICPSR as “datasets”, but they’re easily obtainable.
Arguing that data from the internet are somehow less real, less valid, or less reliable does nothing to further the cause of social science. We all know about the Literary Digest moments for survey research. But we didn’t give up on surveys right away, we figured out how to make them better. Surveys allowed us to peer into a section of the public’s mind and figure out what people were thinking on an unprecedented scale. New forms of data and new ways to analyze those data offer some of the same promise. Note that I’m not a “big data” polyanna; bigger data don’t spell the end of science. But updating our methods of inference to deal with big data and new kinds of data will allow us to study social interaction on a new scale yet again.
Is Twitter representative of the overall American or world population? Of course not. Will it ever be? Who knows. The key is to figure out the sources of bias and correct for them — the return on investment to doing so would most likely exceed that of trying to figure out how to salvage an ailing survey research patient.
| Gabriel |
All metrics and models are assumption-laden, but some are more assumption-laden than others. Among the worst offenders are smoothers, which as the name implies, assume that the underlying reality is smooth. If the underlying reality has discontinuity then the smoother will obfuscate this in the course of trying to smooth out “noise.” This can actually have big theoretical implications. Most notably, there was always a lot of zig-zag in the fossil record but traditionally people assumed it was just noise and so they smoothed it out. Then Gould came up with the theory of punctuated equilibrium and said that evolution substantively works through bursts, which at a data level is equivalent to saying that the zig zags are signal, not noise.
Here’s an illustration. Let’s simulate a dataset that, by assumption, follows a step function. To keep it simple we’ll have no noise at all, just the underlying step function. Now, let’s apply a LOWESS smooth on the time-series. As you can see, the smoothed trend is basically an s-curve even though we know by assumption that the underlying structure of causation is a step.
set obs 50 gen t=[_n] gen x=0 in 1/30 replace x=1 in 31/50 twoway (lowess x t) (scatter x t, msymbol(circle_hollow)), legend(off)
Moral of the story, think carefully about whether the smoother is theoretically appropriate. If there are substantive reasons to expect discontinuities then it probably ain’t. For a similar reason you may want to not just assume a linear effect or even a polynomial specification in regression but compare various transformations (e.g., linear splines vs quadratics) and see what fits best.