  • There’s a very interesting discussion at AdAge comparing buzz metrics (basically, data mining blogs and Twitter) to traditional surveys. Although the context is market research, this is an issue that potentially has a lot of relevance for basic research and so I recommend it even to people who don’t particularly care about advertising. The epistemological issue is basically the old validity versus generalizability debate. Surveys are more representative of the general consumer but they suffer from extremely low salience and so answers are so riddled with question wording effects and that sort of thing as to be almost meaningless. On the other hand buzz metrics are meaningful but not representative (what kind of person tweets about laundry bleach?). The practical issue is that buzz metrics are cheaper and faster than surveys.
  • I listened to the bhtv between Fodor and Sober and I really don’t get Fodor’s argument about natural selection. He seems to think that the co-occurence of traits is some kind of devastating problem for biology when in fact biologists have well-articulated theories (i.e., “hitchhiking,” “spandrels,” and the “selection for vs. selection of” distinction) for understanding exactly these issues and as implied by the charge “hyper-adaptionist” there’s already an understanding with the field that these make natural selection a little more complicated than it otherwise might be. However the internal critics who raise these issues (e.g., the late Stephen Jay Gould) wouldn’t come anywhere close to claiming that these issues are an anomaly that challenges the paradigm.
  • As a related philosophy of science issue, Phil @ Gelman’s blog has some thoughts on (purposeful or inadvertent) data massaging to fit the model. He takes it as a Bayesian math issue, but I think you can agree with him on Quinean/Kuhnian philosophical grounds.
  • The essay “Why is there no Jewish Narnia?” has been much discussed lately (e.g., Douthat). The essay basically argues that this is because modern Judaism simply is not a mythic religion. The interesting thing though is that it once was, as can be seen clearly in various Babylonian cognates (eg, the parts of Genesis and Exodus from the J source and the 41st chapter of the book of Job). However, as the essay argues, the mythic aspects were driven out by the rabbinic tradition. Myself, I would go further than that and say that the disenchantment really began with P, though I agree that the rabbinate finished it off, as evidenced by the persistence of myth well through the composition of “Daniel” in the 2nd c. BCE. This reminds me of the conclusion to The Sacred Canopy, where Berger basically says disenchantment has been a long-term trend ever since animism gave way to distinct pagan gods and especially with monotheism.
  • Of course the animism -> paganism ->henotheism -> monotheism -> atheism thing isn’t cleanly monotonic as we sometimes see with pagan survivalism. The first episode of the new season of Breaking Bad cold opens with a couple of narcos praying at a shrine to La Santa Muerte. In a great NYer piece on narco culture, one of the worshippers says “Yes, it was true that the Catholic Church disapproved of her ‘Little Skinny One,’ she said. ‘But have you noticed how empty their churches are?'” Maybe Rodney Stark should write his next book on the market theory of religion using Mexican Satanism as a case study of a new market entrant that more effectively pandered to met the needs of worshippers than the incumbent Catholic church, what with its stodgy rules against murder. (This isn’t a critique of Stark. Since he’s fond of Chesterton’s aphorism that when people don’t believe in God they don’t believe in nothing, they believe in anything, I think he’d argue that the popularity of the Santa Muerte cult is the product of a lack of competition among decent religions).
  • The Red Letter feature length deconstructions of the Star Wars prequels are why we have the fair use doctrine. They make dense and creative use of irony, especially with the brilliant contrasts between the narrative and the visual collage. Probably the funniest two segments are the first segment of the Episode I critique when he talks about the poor character development and the fifth segment of the Episode II critique when he plays dating coach for Anakin.

Using grep (or mdfind) to reshape data

Sometimes you have cross-class data that’s arranged the opposite of how you want. For instance, suppose I have a bunch of files organized by song, and I’m interested in finding all the song files that mention a particlar radio station, say KIIS- FM. I can run the following command that finds all the song files in my song directory (or its subdirectories) and puts the names of these files in a text file called “kiis.txt”

grep -l -r ’KIIS’ ~/Documents/book/stata/rawsongs/ > kiis.txt

Of course to run it from within Stata I can prefix it with “shell”. By extension, I could then write a program around this shell command that will let me query station data from my song files (or vice versa). You could do something similar to see what news stories saved from Lexis-Nexis or scraped web pages contain a certain keyword.

Unfortunately grep is pretty slow, but you can do it faster by accessing your desktop search index. It’s basically the difference between reading a book looking for a reference versus looking the reference up in the book’s index. This is especially important if you’re searching over a lot of data — grep is fine for a few dozen files but you want indexed search if you’re looking over thousands of files, let alone your whole file system. On a Mac, you can access your Spotlight index from shell scripts (or the Terminal) with “mdfind“. The syntax is a little different than grep so the example above should be rewritten as

mdfind -onlyin ~/Documents/book/stata/rawsongs/ "KIIS" > kiis.txt

While grep is slower than mdfind, it’s also more flexible. Fortunately (as described here), you can get the best of both worlds by doing a broad search with mdfind then piping the results to grep for more refined work.

I recently started scraping a website using curl and cron (for earlier thoughts on this see here). Because I don’t leave my mac turned on at 2am, I’m hosting the scrape on one of the UCLA servers. I get a daily log of the scrape by email, but I know myself well enough to know that I’ll get bored with reading the logs after a few days.

As such, I added a “rule” to that looks for error messages. When curl fails for any reason, the standard error message is “Warning: Failed to create the file …” Using the “rules” tab of the preferences, I told Mail to turn any message red if it has my server’s log boilerplate in the subject line and contains a curl error message anywhere in the text. Now when I open my email in the morning and see a black message I know everything is fine whereas a red message (or no message at all) means there’s a problem.

