The following is a guest post from Trey Causey, a long-time reader of codeandculture and a grad student at Washington who does a lot of work with web scraping. This is his second guest post here, his first was on natural language processing SNAFUs.
| Trey |
“We all know that survey methods are becoming increasingly outdated and clunky.” While at the ASA Annual Meeting in Denver, I attended a methods panel where a graduate student opened with this remark; I’m paraphrasing here (but not by much). This remark was met with sympathetic laughter, mock(?) indignation, and some groans. Subsequent presenters, whose papers were mostly based on survey research, duly referred back to this comment. Survey research has been the workhorse of social scientists for generations and large portions of our methods training are usually survey-oriented in one way or another.
While the phrasing may have been indelicate, the graduate student is on to something important. I later spoke to a faculty member from the same department as the aforementioned graduate student. We were discussing the rise of “big data” and non-survey-based methods in the social sciences and he commented that he felt that the discipline was inevitably headed towards these methods, as survey response rates continue to fall.
I was reminded of both of these anecdotes earlier this week when Claude Fischer blogged about the crisis of survey response rates. He writes that Pew is averaging an astonishing 9% completion rate these days. He closes by speculating on what we can use instead of surveys to measure public opinion: “…letters to the editor, Facebook “likes,” calls to congressional offices, tweet vocabulary, street demonstrations — or the most common way we do it, assuming that what we and our friends think is typical. We might track Americans’ health by admissions to the hospitals, death rates, consumption of Lipitor. We could estimate changes in poverty by counting beggars on the street, malnourished kids at school. We could try to figure out the “dark” crime number by… I don’t know.”
Fischer is right to be pessimistic about the use of surveys. It’s not clear that the dwindling numbers of participants in many surveys are representative of the larger population (for many reasons). I think the writing is on the wall for all but the most well-funded and well-staffed survey organizations to provide large, representative samples of the American public.
However, I’m optimistic that new forms of data and methods are waiting to pick up the slack. As readers of this blog are no doubt aware, I am a proponent of using data scraped from the web; this could include Facebook likes or Twitter vocabulary, as Fischer points out, or newspaper articles, message board posts, or music sales data. Although debates inevitably (and rightly) arise when studies using these kinds of data are discussed amongst social scientists, the representativeness argument wielded (mostly) by survey researchers is overblown and perhaps deserves to be redirected back at survey research.
Are data scraped from the web representative of a larger population? That depends on which population we’re talking about. Questions that take the form of “does subgroup A differ significantly on average from subgroup B or population P on some attitudinal measure Y” usually require some kind of representative sample of a general population. However, many questions don’t take this form and social scientists are often interested in studying specific subpopulations, often underrepresented in probability samples.
Data scraped from the web have some real benefits. They are often naturalistic and intentional — individuals decided to write something of their own accord, unprompted by any researcher, posted it online for others to read, hoping to communicate something. They are generated in real-time. Reading message board threads like this one from Metafilter on September 11, 2001 gives us real-time insight into how people were making sense of an unfolding tragedy. No retrospective bias, no need to get a survey team on the ground a day or two after things happen. We can combine what individuals say with behavioral data. Scraping is one of the more unobtrusive forms of data collection — the act of observation by the researcher does not alter the individual’s behavior. It’s cheap and fast–even by sociologists’ standards.
Obviously, as people become more savvy about what companies and governments track about them from their online activities, this will probably change. However, it’s still more naturalistic than answers to closed-form survey questions over the phone. Rather than trying to figure out if survey responses represent a stable attitude or if they are epiphenomenal, the individual generates the data independently. While presentation management will always be an underlying concern when trying to figure out what people “really” mean or “really” want, it is not obvious to me that this problem was ever solved in survey research. Amazingly, people offer up copious information about sensitive topics–their sex lives, their drug habits, and their personal prejudices–without being asked and without the questions being cleared by an IRB (although analyzing this information will still require IRB approval). Combined with new methods to analyze unstructured text, advances in network analysis, and increases in computing power, the possibilities are really quite amazing.
It’s also important to note that data scraped from the web are not limited to the activities of computer users themselves — governments store loads of information online, media outlets often have full transcripts online, corporations publish earnings reports, and universities post course schedules and enrollment figures. They aren’t posted on the ICPSR as “datasets”, but they’re easily obtainable.
Arguing that data from the internet are somehow less real, less valid, or less reliable does nothing to further the cause of social science. We all know about the Literary Digest moments for survey research. But we didn’t give up on surveys right away, we figured out how to make them better. Surveys allowed us to peer into a section of the public’s mind and figure out what people were thinking on an unprecedented scale. New forms of data and new ways to analyze those data offer some of the same promise. Note that I’m not a “big data” polyanna; bigger data don’t spell the end of science. But updating our methods of inference to deal with big data and new kinds of data will allow us to study social interaction on a new scale yet again.
Is Twitter representative of the overall American or world population? Of course not. Will it ever be? Who knows. The key is to figure out the sources of bias and correct for them — the return on investment to doing so would most likely exceed that of trying to figure out how to salvage an ailing survey research patient.