Archive for September, 2010
Scraping for Event History
| Gabriel |
As I’ve previously mentioned, there’s a lot of great data out there but much of it is ephemeral so if you’re interested in change (which given our obsession with event history, many sociologists are) you’ve got to know how to grab it. I provided a script (using cron and curl) for grabbing specific pages and timestamping them but this doesn’t scale up very well to getting entire sites, both because you need to specify each specific URL and because it saves a complete copy each time rather than the diff. I’ve recently developed another approach that relies on wget and rsync and is much better for scaling up to a more ambitious scraping project.
Note that because of subtle differences between dialects of Unix, I’m assuming Linux for the data collection but Mac for the data cleaning.* Using one or the other for everything requires some adjustments. Also note that because you’ll want to “cron” this, I don’t recommend running it on your regular desktop computer unless you leave it on all night. If you don’t have server space (or an old computer on which you can install Linux and then treat as a server), your cheapest option is probably to run it on a wall wart computer for about $100 (plus hard drive).
Wget is similar to curl in that it’s a tool for downloading internet content but it has several useful features, some of which aren’t available in curl. First, wget can do recursion, which means it will automatically follows links and thus can get an entire site as compared to just a page. Second, it reads links from a text file a bit better than curl. Third, it has a good time-stamping feature where you can tell it to only download new or modified files. Fourth, you can exclude files (e.g., video files) that are huge and you’re unlikely to ever make use of. Put these all together and it means that wget is scalable — it’s very good at getting and updating several websites.
Unfortunately, wget is good at updating, but not at archiving. It assumes that you only want the current version, not the current version and several archival copies. Of course this is exactly what you do need for any kind of event history analysis. That’s where rsync comes in.
Rsync is, as the name implies, a syncing utility. It’s commonly used as a backup tool (both remote and local). However the simplest use for it is just to sync several directories and we’ll be applying it to a directory structure like this:
project/ current/ backup/ t0/ t1/ t2/ logs/
In this set up, wget only ever works on the “current” directory, which it freely updates. That is, whatever is in “current” is a pretty close reflection of the current state of the websites you’re monitoring. The timestamped stuff, which you’ll eventually be using for event history analysis, goes in the “backup” directories. Every time you run wget you then run rsync after it so that next week’s wget run doesn’t throw this week’s wget run down the memory hole.
The first time you do a scrape you basically just copy current/ to backup/t0. However if you were to do this for each scrape it would waste a lot of disk space since you’d have a lot of identical files. This is where incremental backup comes in, which Mac users will know as Time Machine. You can use hard links (similar to aliases or shortcuts) to get rsync to accomplish this.** The net result is that backup/t0 takes the same disk space as current/ but each subsequent “backup” directory takes only about 15% as much space. (A lot of web pages are generated dynamically and so they show up as “recently modified” every time, even if there’s no actual difference with the “old” file.) Note that your disk space requirements get big fast. If a complete scrape is X, then the amount of disk space you need is approximately 2 * X + .15 * X * number of updates. So if your baseline scrape is 100 gigabytes, this works out to a full terabyte after about a year of weekly updates.
Finally, when you’re ready to analyze it, just use mdfind (or grep) to search the backup/ directory (and its subdirectories) for the term whose diffusion you’re trying to track and pipe the results to a text file. Then use a regular expression to parse each line of this query into the timestamp and website components of the file path to see on which dates each website used your query term — exactly the kind of data you need for event history. Furthermore, you can actually read the underlying files to get the qualitative side of it.
So on to the code. The wget part of the script looks like this
DATESTAMP=`date '+%Y%m%d'` cd ~/Documents/project mkdir logs/$DATESTAMP cd current wget -S --output-file=../logs/$DATESTAMP/wget.log --input-file=../links.txt -r --level=3 -R mpg,mpeg,mp4,au,mp3,flv,jpg,gif,swf,wmv,wma,avi,m4v,mov,zip --tries=10 --random-wait --user-agent=""
That’s what it looks like the first time you run it. When you’re just trying to update “current/” you need to change “wget -S” to “wget -N” but aside from that this first part is exactly the same. Also note that if links.txt is long, I suggest you break it into several parts. This will make it easier to rerun only part of a large scrape, for instance if you’re debugging, or there’s a crash, or if you want to run the scrape only at night but it’s too big to completely run in a single night. Likewise it will also allow you to parallelize the scraping.
Now for the rsync part. After your first run of wget, run this code.
cd .. rsync -a current/ backup/baseline/
After your update wget runs, you do this.
cd .. cp -al backup/baseline/ backup/$DATESTAMP/ rsync -av --delete current/ backup/$DATESTAMP/
* The reason to use Linux for data collection is that OS X doesn’t include wget and has an older version of the cp command, though it’s possible to solve both issues by using Fink to install wget and by rewriting cp in Mac/BSD syntax. The reason to use Mac for data analysis is that mdfind is faster (at least once it builds an index) and can read a lot of important binary file formats (like “.doc”) out of the box, whereas grep only likes to read text-based formats (like “.htm”). There are apparently Linux programs (e.g., Beagle) that allow indexed search of many file formats, but I don’t have personal experience with using them as part of a script.
** I adapted this use of hard links and rsync from this tutorial, but note that there are some important differences. He’s interested in a rolling “two weeks ago,” “last week,” “this week” type of thing, whereas I’m interested in absolute dates and don’t want to overwrite them after a few weeks
Stata 11 Factor Variable / Margins Links
| Gabriel |
Michael Mitchell posted a brief tutorial on factor variables. (This has nothing to do with the “factor” command — think interaction terms and “xi,” not eigenvectors and structural equations). People who are already familiar with FV won’t really get anything out of the tutorial, but should serve as a very good introduction for people who are new to this syntax. Like all of Mitchell’s writing (e.g., the graphics book) it is very clearly written and generally well-suited to the user who is transitioning from novice to advanced usage.
The new Stata newsletter has a good write-up of “margins” syntax, which is useful in conjunction with factor variables for purposes of interpreting the betas (especially when there’s some nonlinearity involved). In a recent post explaining a comparable approach, I said that my impression was that margins only really works with categorical independent variables, but I’m please to see that I was mistaken and it in fact works with continuous variables as well.
Zotero’s Bibtex export filter
| Gabriel |
Two issues with Zotero’s export filter to BibTex.
First, in the new version they broke backwards compatibility (a little) so you can get some missing citation errors if you try to use Lyx/Latex files that you originally wrote based on BibTex files generated by older versions of Zotero. Specifically, the new version handles colons in the title differently when generating the BibTex key. The old version left the colon out, the new version keeps it in. I had to go through my book manuscript and change all the BibTex keys for which this makes a difference. An alternate approach would be to freeze your Zotero library folder and use the last export created with the old version and start a new Zotero library folder for citations you collect from here on out.
Second, while it is possible to create a BibTex “author” field that TeX doesn’t read as “Last, First,” you can’t do it from Zotero. This is a problem as you end up getting ridiculous bibliography entries like “Bureau, US Census”. To get these and similar cases formatted correctly, the easiest thing is just to delete them from Zotero and put them in a hand-coded BibTex file. (Don’t worry, a single TeX document can draw citations from multiple .bib files so you don’t have to commit to hand-coding everything or merging your hand-coded file into the Zotero-generated file). The trick is that you use quotes (not curly brackets as with Zotero’s export) for the field delimiters and then use curly brackets inside quotes to tell BibTex “don’t break this.” For instance, here’s my entry for the Statistical Abstract of the United States:
@book{u.s._census_bureau_statistical_2007, address = "Washington, DC", edition = "127", title = "Statistical Abstract of the United States, 2008.", isbn = "9780160795848", publisher = "{U.S.} Census Bureau", author = "{U.S. Census Bureau}", year = "2007" }
We have a protractor
| Gabriel |
Neal Stephenson’s Anathem opens with the worst instrument in the history of survey research. A monastery of cloistered science-monks is about to open its gates to the outside world for a brief decennial festival and they are interviewing one of the few ordinary people with whom they have regular contact about what they can expect outside. The questions are as vague and etic as imaginable and the respondent has a hard time interpreting them. The reason the questions are so bad is that the monks are almost completely cut off from society and the instrument has been in continuous use for millenia.
The monks call society the “secular world” which sounds strange given that these monks are atheists, but makes sense if you remember that “secular” means “in time” and in English we use this word to mean “non-religious” because St. Augustine argued that God exists outside of time and Pope Gelasius elaborated this argument to develop a theory of the Church’s magisterium. Anyway, the monks in Anathem are so separate from society that in a very real sense they too exist outside of time. To the extent that the outside world does impinge on their experience, it is mostly with the threat of a “sack,” an anti-intellectual pogrom that tends to happen every few centuries.
Thus the novel, especially the first third, is primarily a thought experiment about what it would look like if we were to take the ivory tower as a serious aspiration. I mean, imagine never struggling to figure out what the broader impacts are of your research for purposes of a grant proposal because you’re opposed in principle to the very idea of broader impacts and strive for such perfect lack of them that you asymptotically approach “causal domain shear,” meaning that nothing in the monastery affects the outside world and vice versa. Also, you never go through tenure review because you can stay in the monastery as long as you want (intellectual deadweight are gently shifted to full-time semi-skilled manual labor). OK, there are some pretty big downsides compared to academia here on Earth. You have to do all your calculations by hand as you are forbidden computers and most other modern technology. You spend half the day chanting and gardening. Your only possessions are a bathrobe and a magic beach ball. When you break the rules they punish you by sending you into solitary for a few months to study for a field exam on gibberish. And as previously mentioned, once or twice every thousand years the yokels storm your office and lynch you.
The most easily ridiculed and stereotypically science-fiction-y aspect of the book is the abundance of neologisms. When I started reading it, I found the whole alternative vocabulary very distracting and I did a lot of on-the-fly translation from Stephenson to English. I mean, I understand the need to coin terms like “centarian” (members of a monastery whose cloistered status is relaxed only once a century) when there is no good English equivalent, but it’s mostly* gratuitous to talk about a “Counter-Bazian ark” instead of a “Protestant church,” a “jee-jaw” instead of an “iPhone,” “Syntactic Faculty” instead of “nominalism,” “Semantic Faculty” instead of “naturalism,” or “Orth” instead of “Latin.” Likewise, I found myself constantly interpreting the dates by adding 2200.** Fortunately after a few hundred pages this didn’t bother me, not so much because I thought it was justified, but because I was sufficiently acclimated to it and enjoying the novel that I didn’t notice anymore. Still, I would have preferred it if he just set the book in the future of an alternate Earth which had science monks but didn’t have a bunch of silly vocabulary.
Also, for better or worse, several of the secondary characters were basically recycled from Cryptonomicon. So the outdoorsman Yul and his tom boy girlfriend Cord are basically the same people as the outdoorsman Doug Shaftoe and his daughter Amy. Likewise, the asshole master Procian Fraa Lodoghir is basically the same person as the asshole celebrity postmodernist GEB Kivistik.
Oh, and there’s kung fu.
*If you want to know what I mean by “mostly,” read the book’s spoiler-tastic Wikipedia page and figure it out.
**Aside from the whole science monasteries thing, the book’s backstory closely parallels actual political and intellectual history through the “Praxic” (read: modern) age. Their dating system is pegged to a horrific nuclear war in about the year 2200 AD rather than to the foundation of the Bazian church (read: the birth of Christ). The novel’s present is 3700 years after the nuclear apocalypse, or about the equivalent of the year 6000 AD.
Status, Sorting, and Meritocracy
| Gabriel |
Over at OrgTheory, Fabio asked about how much turnover we expect to see in the NRC rankings. In the comments, myself and a few other people discussed the analysis of the rankings in Burris 2004 ASR. Kieran mentioned the interpretation of the data that it could all be sorting.
To see how plausible this is I wrote a simulation with 500 grad students, each of whom has a latent amount of talent that can only be observed with some noise. The students are admitted in cohorts of 15 each to 34 PhD granting departments and are strictly sorted so the (apparently) best students go to the best schools. There they work on their dissertations, the quality of which is a function of their talent, luck, and (to represent the possibility that top departments teach you more) a parameter proportional to the inverse root of the department’s rank. There is then a job market, with one job line per PhD granting department, and again, strict sorting (without even an exception for the incest taboo). I then summarize the amount of reproduction as the proportion of top 10 jobs that are taken by grad students from the top ten schools.
So how plausible is the meritocracy explanation? It turns out it’s pretty plausible. This table shows the average closure for the top 10 jobs averaged over 100 runs each for several combinations of assumptions. Each cell shows, on average, what proportion of the top 10 jobs we expect to be taken by students from the top 10 schools if we take as assumptions the row and column parameters. The rows represent different assumptions about how noisy is our observation of talent when we read an application to grad school or a job search. The columns represent a scaling parameter for how much you learn at different ranked schools. For instance, if we assume a learning parameter of “1.5,” a student at the 4th highest-ranked school would learn 1.5/(4^0.5), or .75. It turns out that unless you assume noise to be very high (something like a unit signal:noise ratio or worse), meritocracy is pretty plausible. Furthermore, if you assume that the top schools actually educate grad students better then meritocracy looks very plausible even if there’s a lot of noise.
P of top 10 jobs taken by students from top 10 schools ---------------------------------------- Noisiness | of | Admission | s and | Diss / |How Much More Do You Learn at Job | Top Schools Market | 0 .5 1 1.5 2 ----------+----------------------------- 0 | 1 1 1 1 1 .1 | 1 1 1 1 1 .2 | 1 1 1 1 1 .3 | .999 1 1 1 1 .4 | .997 1 1 1 1 .5 | .983 .995 .999 1 1 .6 | .966 .99 .991 .999 .999 .7 | .915 .96 .982 .991 .995 .8 | .867 .932 .963 .975 .986 .9 | .817 .887 .904 .957 .977 1 | .788 .853 .873 .919 .95 ----------------------------------------
Of course, keep in mind this is all in a world of frictionless planes and perfectly spherical cows. If we assume that lots of people are choosing on other margins, or that there’s not a strict dual queue of positions and occupants (e.g., because searches are focused rather than “open”), then it gets a bit looser. Furthermore, I’m still not sure that the meritocracy model has a good explanation for the fact that academic productivity figures (citation counts, etc) have only a loose correlation with ranking.
Here’s the code, knock yourself out using different metrics of reproduction, inputting different assumptions, etc.
[Update: also see Jim Moody’s much more elaborate/realistic simulation, which gives similar results].
capture program drop socmeritocracy program define socmeritocracy local gre_noise=round(`1',.001) /* size of error term, relative to standard normal, for apparenttalent=f(talent) */ local diss_noise=round(`2',.001) /* size of error term, relative to standard normal, for dissquality=f(talent) */ local quality=round(`3',.001) /* scaling parameter for valueadded (by quality grad school) */ local cohortsize=round(`4',.001) /* size of annual graduate cohort (for each programs) */ local facultylines=round(`5',.001) /* number of faculty lines (for each program)*/ local batch `6' clear quietly set obs 500 /*create 500 BAs applying to grad school*/ quietly gen talent=rnormal() /* draw talent from normal */ quietly gen apparenttalent=talent + rnormal(0,`gre_noise') /*observe talent w error */ *grad school admissions follows strict dual queue by apparent talent and dept rank gsort -apparenttalent quietly gen gradschool=1 + floor(([_n]-1)/`cohortsize') lab var gradschool "dept rank of grad school" *how much more do you actually learn at prestigious schools quietly gen valueadded=`quality'*(1/(gradschool^0.5)) *how good is dissertation, as f(talent, gschool value added, noise) quietly gen dissquality=talent+rnormal(0,`diss_noise') + valueadded *grad school admissions follows strict dual queue of diss quality and dept rank (no incest taboo/preference) gsort -dissquality quietly gen placement=1 + floor(([_n]-1)/`facultylines') lab var placement "dept rank of 1st job" quietly sum gradschool quietly replace placement=. if placement>`r(max)' /*those not placed in PhD granting departments do not have research jobs (and may not even have finished PhD)*/ *recode outcomes in a few ways for convenience of presentation quietly gen researchjob=placement quietly recode researchjob 0/999=1 .=0 lab var researchjob "finished PhD and has research job" quietly gen gschool_type= gradschool quietly recode gschool_type 1/10=1 11/999=2 .=3 quietly gen job_type= placement quietly recode job_type 1/10=1 11/999=2 .=3 quietly gen job_top10= placement quietly recode job_top10 1/10=1 11/999=0 lab def typology 1 "top 10" 2 "lower ranked" 3 "non-research" lab val gschool_type job_type typology if "`batch'"=="1" { quietly tab gschool_type job_type, matcell(xtab) local p_reproduction=xtab[1,1]/(xtab[1,1]+xtab[2,1]) shell echo "`gre_noise' `diss_noise' `quality' `cohortsize' `facultylines' `p_reproduction'" >> socmeritocracyresults.txt } else { twoway (lowess researchjob gradschool), ytitle(Proportion Placed) xtitle(Grad School Rank) tab gschool_type job_type, chi2 } end shell echo "gre_noise diss_noise quality cohortsize facultylines p_reproduction" > socmeritocracyresults.txt forvalues gnoise=0(.1)1 { local dnoise=`gnoise' forvalues qualitylearning=0(.5)2 { forvalues i=1/100 { disp "`gnoise' `dnoise' `qualitylearning' 15 1 1 tick `i'" socmeritocracy `gnoise' `dnoise' `qualitylearning' 15 1 1 } } } insheet using socmeritocracyresults.txt, clear delim(" ") lab var gre_noise "Noisiness of Admissions and Diss / Job Market" lab var quality "How Much More Do You Learn at Top Schools" table gre_noise quality, c(m p_reproduction)
Seeing like a state used to see
| Gabriel |
The Census Bureau website has not only the current edition of the Statistical Abstract of the United States, but most of the old editions going back to 1878. A lot of the time you just need the basic summary statistics not the raw micro data so this is a great resource.
It’s also just plain interesting, especially if you’re interested in categorical schema. For instance, in the 1930s revenues for the radio industry are in the section on “National Government Finances” (because this is the tax base), whereas the same figure is now in a section on “Information & Communications.” This suggests a very different conception about what the information is there for and who is meant to consume it for what purposes.
What really surprised me though was treating the deaf or blind as commensurable with convicted felons, but there they are in the “Defectives and Delinquents” chapter of editions through the early 1940s. The logic of such a category seems to be based not on a logic of “people who might deserve help vs people who might hurt us” but on a causal model of deviations from normality that assumes a eugenics/phrenology conception of crime as based on a malformed brain. Given that we now use MRI to study the brains of murderers, the main thing that’s really shocking about the category is the casual harshness of the word “defective” rather than applying a materialist etiology to both social deviance and physical disability.
Hide iTunes Store/Ping
| Gabriel |
As a cultural sociologist who has published research on music as cultural capital, I understand how my successful presentation of self depends on me making y’all believe that I only listen to George Gershwin, John Adams, Hank Williams, the Raveonettes, and Sleater-Kinney, as compared to what I actually listen to 90% of the time, which is none of your fucking business. For this reason, along with generally being a web 2.0 curmudgeon, I’m not exactly excited about iTunes new social networking feature “Ping.” Sure, it’s opt-in and I appreciate that, but I am so actively uninterested in it that I don’t even want to see its icon in the sidebar. Hence I wrote this one-liner that hides it (along with the iTunes store).
defaults write com.apple.iTunes disableMusicStore -bool TRUE
My preferred way of running it is to use Automator to run it as a service in iTunes that takes no input.
Unfortunately it also hides the iTunes Store, which I actually use a couple times a month. To get the store (and unfortunately, Ping) back, either click through an iTunes store link in your web browser or run the command again but with the last word as “FALSE”.
Predicted vignettes
| Gabriel |
One of my favorite ways to interpret a complicated model is to make up hypothetical cases and see what predicted values they give you. For instance, at the end of my Oscars paper with Esparza and Bonacich, we compared predicted probabilities of Oscar nomination for the 2 x 2 of typical vs exceptional actors in typical vs exceptional films. Doing so helps make sense of a complicated model in a way that doesn’t boil down to p-value fetishism. Stata 11 has a very useful “margins” command for doing something comparable to this, but as best as I can tell, “margins” only works with categorical variables.
The way I like to do this kind of thing is to create some hypothetical cases representing various scenarios and use the “predict” command to see what the model predicts for such a scenario. (Note that you can do this in Excel, which is handy for trying to interpret published work, but it’s easier to use Stata for your own work). Below is a simple illustration of how this works using the 1978 cars dataset. In real life it’s overkill to do this for a model with only a few variables where everything is linear and there are no interactions, but this is just an illustration.
Also, note that this approach can get you into trouble if you feed it nonsensical independent variable values. So the example below includes the rather silly scenario of a car that accomplishes the formidable engineering feat of getting good mileage even though it’s very heavy. A conceptually related problem is that you have to be careful if you’re using interaction terms that are specified by the “gen” command (which is a good reason to use factor variables when possible instead of “gen” for interactions). Another way you can get in trouble is going beyond the observed range (e.g., trying to predict the price of a car that gets 100 mpg).
sysuse auto, clear local indyvars "mpg headroom weight" local n_hypothetical 10 *create the hypothetical data *first, append some (empty) cases local biggerdata=[_N]+`n_hypothetical' set obs `biggerdata' gen hypothetical=0 replace hypothetical=1 in -`n_hypothetical'/L *for each of these hypothetical cases, set all independent variables to the * mean *note that you can screw up here if some of the real cases are missing data * on some variables but not others. in such an instance the means for the * analytic subset will not match those of the hypothetical cases foreach var in `indyvars' { sum `var' replace `var'=`r(mean)' in -`n_hypothetical'/L } *change selected hypothetical values to be theoretically interesting values *i'm using a nested loop with a tick counter to allow combinations of values * on two dimensions *if you only want to play with a single variable this can be a lot simpler * alternately you don't need to loop at all, but can just "input" or "edit" local i=-`n_hypothetical' forvalues mpg=0/1 { forvalues weight=0/4 { quietly replace mpg=20 + 20*`mpg' in `i' quietly replace weight=2000 + 500*`weight' in `i' local i=`i'+1 } } *make sure that hypothetical cases are missing on Y and therefore don't go *into regression model replace price=. in -`n_hypothetical'/L *do the regression preserve *to be extra sure the hypotheticals don't bias the regression, drop them, keep if hypothetical==0 reg price `indyvars' *then bring them back restore *create predicted values (for all cases, real and hypothetical) * this postestimation command does a lot of the work and you should explore * the options, which are different for the various regressions commands predict yhat *create table of vignette predictions (all else held at mean or median) preserve keep if hypothetical==1 table mpg weight, c(m yhat) restore
How to Review a Literature
| Gabriel |
Following the reader service model of O&M’s recent “How to Read an Academic Article” and OT’s long running grad skool rulz, I figured I’d describe the proper way to review a literature for a research paper. I should start with a lawyer’s joke/parable.
Eugene Volokh recently puzzled his blawgosphere audience with the term “red cow.” As commenter “James E.” explained:
A country practitioner was retained one day by a client whose red cow had broken into his neighbor’s grain field, and litigation ensued. The practitioner went carefully over the details of the facts in the case with a student in his office, and assigned to the student the duty of “looking up the law” on the subject. Some time after he asked the student what success he had had with the authorities bearing on the case. The student replied: “‘Squire, I have searched diligently through every law book in the library, and there isn’t a red cow case in them.”
Central Law Journal, Vol. 79, p. 299 (1914)
The joke of course is that this lawyer thought the issue was red cows rather than trespassing, negligence, and other abstract legal concepts. This was a lot less funny when I realized that when I was in college and my first year or two of grad school, this kind of substantively-focused literalism was exactly how I would approach doing a lit review for a research paper. I would open up Sociofile (now called “Sociological Abstracts”) and search for substantive key terms, something like “social movements AND television.” That is, I was searching for prior literature on my substantive issue.
A substantive search is worth doing to a certain extent, but it’s not nearly as important as getting (and understanding) theory. A single theory often involves wildly disparate empirical issues. For instance, Status Signals has chapters on banking, wine, and patents, as well as more fleeting references to things like jewelry. So how do you do the theoretical aspect of the review? Well, to a large extent it’s just an issue of learning a large body of literature inside out, but that takes a very long time. In the meantime, here’s the advice I give to my grad students.
1. Use Sociological Abstracts, Google Scholar, etc. for substantive queries but realize that this will only be about a quarter of the work. These databases aren’t very good at queries by theory.
2. Figure out what theoretical problems are at issue in your work. Bounce your empirical issues of your friends and mentors to see what theoretical issues they see. They may suggest theories you’ve never heard of. Also ask them for specific citations that they recommend.
3. Search for essays on the previously flagged theories in Annual Review of Sociology (and possibly Annual Reviews for adjacent disciplines or Journal of Economic Perspectives) to find a review of this literature, preferably one from the last ten years. (If you’re lucky, you’ve recently taken a graduate seminar on your target literature, which is effectively ARS as live theater.) You can also use a few empirical publications that you’ve read or which are recommended to you as providing particularly good theoretical syntheses.
4. Use these to snowball sample, both backwards and forwards in time. To snowball backwards, read the articles and whenever they mention a citation that sounds interesting, add it to your shopping list. To snowball forward, use Google Scholar to do a cited reference search of your key citations and again, take the stuff that looks good. I prefer GScholar for this over Web of Science because it includes working papers and such, giving you more of the “invisible college.” As you read these things you’ll find still more good cites.
5. Actually read all this stuff and pull out the theoretical problems involved and how they hang together. Try to find one to three important theoretical problems and use each of them to derive a proposition that can be operationalized into an empirically-testable hypothesis. Read empirical articles that you admire and note how they structure their lit review / theory section.
Note that this step is as much imposing structure on the literature as about recognizing the structure that pre-exists because, frankly, the literature is often muddled. For instance, in writing the lit review for my Oscars article, I noticed that a lot of people simply confuse different spillover models — citing Kremer QJE 1993 when what they seem to have in mind is a much better fit with Saint-Paul JPE 2001 or Stinchcombe ASR 1963.
6. Get back to me in about two years when you’ve finished doing all of this and we can talk about actually doing the empirical part of the project.
Weeks
| Gabriel |
I use a lot of data that exists at the daily level (stored as a string like “9/3/2010”), but which I prefer to handle at the weekly level. In part this is about keeping the memory manageable but it also lets me bracket the epiphenomenal issues associated with the weekly work pattern of the music industry (e.g., records drop on Tuesdays). There are a few ways to turn a date into a week in Stata: the official way, the way I used to do it, and the way I do it now.
1. Beginning with Stata 10, there is a data type (%tw) which stores dates by the week, in contrast to %td (formerly just “%d”) which stores it as a date. This is good for some purposes but it gets messy if you’re trying to look at things that cross calendar years since the last week of the year can have a funny number of days. (Another reason I don’t use these functions is simply that I’ve been working on these scripts since Stata 9). However if you want to do it, it looks like this:
gen fp_w=wofd(date( firstplayed,"MDY")) format fp_w %tw
2a. The way I prefer to do it is to store data as %td, but to force it to count by sevens, so effectively the %td data really stands for “the week ending in this date” or “this date give or take a few days.” Until a few days ago, I’d do this by dividing by seven and forcing it to be an integer, then multiplying again.
gen fpdate=date(firstplayed,"MDY") gen int fp1=fpdate/7 gen fp_w=fp1*7 format fp_w %td drop fp1
This is really ugly code and I’m not proud of it. First, note that this is a lot of code considering how little it does. I could have done this more efficiently by using the “mod(x,y)” function to subtract the remainder. Second, this only works if you’re interested in rounding off to the closest Friday and not some other day of the week.
2b. My new approach still stores as “%td” but is both more flexible and slightly simpler. In particular, it lets me define “week ending on X” where X is any day of the week I choose, here specified as the local `dow’ so I can define the end of the week once in the header and have it apply throughout several places that I do something like this. Note that Stata treats Sunday as 0, Monday as 1, etc. What I do is subtract the actual day of the week, then add the target day of the week so it substantively means “this thing happened on or a few days before this date.”
gen fp_w=date(firstplayed,"MDY")-dow(date(firstplayed,"MDY"))+`dow' format fp_w %td
Recent Comments