Archive for April, 2010

Grepmerge

| Gabriel |

Over at the Orgtheory mothership, Fabio asked how to do a partial string match in Stata, specifically to see if certain keywords appear in scientific abstracts. This turns out to be hard, not because there are no appropriate functions in Stata (both strmatch() and regexm() can do it) but because Stata can only handle 244 characters in a string variable. Many of the kinds of data we’d want to do content analysis on are much bigger than this. For instance, scientific abstracts are about 2000 characters and news stories are about 10000 characters.

OW suggested using SPSS, and her advice is well-taken as she’s a master at ginormous content analysis projects. Andrew Perrin suggested using Perl and this is closer to my own sympathies. I agree that Perl is generally a good idea for content analysis, but in this case I think a simple grep will suffice.

grep "searchterm" filein.csv | cut -d "," -f 1 > fileout.csv

The way this works is you start with a csv file called filein.csv (or whatever) where the record id key is in the first column. You do a grep search for “searchterm” in that file and pipe the output to the “cut” command. The -d “,” option tells cut that the stream is comma delimited and the -f 1 option tells it to only keep the first field (which is your unique record id). The “> fileout.csv” part says to pipe the output to disk. (Note that in Unix “>” as a file operator means replace and “>>” means append). You then have a text file called fileout.csv that’s just a list of records where your search term appears. You can merge this into Stata and treat a _merge==3 as meaning that the case includes the search term.

You can also wrap the whole thing in a Stata command that takes as arguments (in order): the term to search for, the file to look for it in, the name of the key variable in the master data, and (optionally) the name of the new variable that indicates a match. However for some reason the Stata-wrapped version only works with literal strings and not regexp searches. Also note that all this is for Mac/Linux. You might be able to get it to work on Windows with CygWin or Powershell.

capture program drop grepmerge
program define grepmerge
	local searchterm	"`1'"
	local fileread	"`2'"
	local key "`3'"
	if "`4'"=="" {
		local newvar "`1'"
	}
	else {
		local newvar "`4'"
	}
	tempfile filewrite
	shell grep "`searchterm'" `fileread' | cut -d "," -f 1 > `filewrite'
	tempvar sortorder
	gen `sortorder'=[_n]
	tempfile masterdata
	save `masterdata'
	insheet using `filewrite', clear
	ren v1 `key'
	merge 1:1 `key' using `masterdata', gen(`newvar')
	sort `sortorder'
	recode `newvar' 1=.a 2=0 3=1
	notes `newvar' : "`searchterm'" appears in this case
	lab val `newvar'
end
Advertisement

April 29, 2010 at 12:45 pm 8 comments

Every time you use Powerpoint, Edward Tufte calls in a targeted drone attack on a kitten

| Gabriel |

The NY Times has an article on the infamous Afghan pasta Powerpoint slide, and more broadly the military’s addiction to Powerpoint and the efforts of a few brave officers to detox (h/t Slashdot). [Also see a discussion of this slide at O&M]. This ties into a recent paper published in OS and also discussed at O&M on how Powerpoint has structured business culture.

I use Powerpoint (actually Keynote), but my style is to have it be almost entirely figures and use text very sparingly. (Here are a few examples from my undergrad class: ex 1, ex 2, ex 3). The only place I’ll use a bullet list is to sum up the hypotheses at the very end of a theory section before I move on to analysis/simulation. I think a good rule of thumb is that if somebody could read your ppt file and understand what you were talking about, then it’s a bad ppt file. A good ppt file should be opaque out of the context of the talk itself. If you need to share it with someone who wasn’t there, give them a tape of the talk or your notes. This isn’t deliberate obscurantism but a simple heuristic for understanding if you’re falling into the common trap of your audience skimming ahead in your bullet points and getting bored as they wait for you to (verbally) move on to the next point. I’m convinced people fill their ppt with text because they’re afraid of forgetting their talk, but I’m going to let you in on some ancient wisdom that was known to our ancestors but was forgotten sometime around 1998: keep your fucking notes on a piece of paper that only you can see. Because when you put your outline on the LCD projector itself, the terrorists have won.

April 27, 2010 at 1:04 pm 10 comments

Allegory of the quant

| Gabriel |

And now I will describe in a figure the enlightenment or unenlightenment of our nature — Imagine human beings living in a school; they have been there from childhood, having their necks and legs chained. In the school there is a computer, and between the computer and the prisoners an LCD display. Inside the computer are databases, who generate various tables and graphs. “A strange parable,” he said, “and strange captives.” They are ourselves, I replied; and they see only the shadows of the images which the computer throws on the LCD; these they give names like “variables” and “models.” Suppose now that you suddenly send them out to do field work and make them look with pain and grief to themselves at the human subjects; will they believe them to be real? Will not their eyes be dazzled, and will they not try to get away from the light to something which is already in machine-readable format with a well-documented codebook and a reasonably good sample design and response rate?

April 19, 2010 at 5:07 am 4 comments

If you were a cheese doodle, what kind of cheese doodle would you be?

| Gabriel |

April 16, 2010 at 2:40 pm 3 comments

More online interactional vandalism

| Gabriel |

A few months ago, I wrote about social spam, where the spam somehow tries to exploit your personal connections. I recently saw two new instances of this.

First, Slashdot had a story on a lawsuit with Classmates.com over phony friend requests from people with whom you did go to high school but who were not already using the site and requesting your link. Looks like they tried a bit too hard to provide a Schelling point solution to a coordination game.

Second, I recently got this spam message, which I’ve edited to avoid publicizing the product:

Dr. Rossman:
A friend of mine recommended that I contact you. I am publishing a book on [DATE], [TITLE] with [PUBLISHER]. It is a book that is part sociology, economics, anthropology and a history of the class system in the U.S. She felt that you may have some interest in my work and may provide feedback. If you are interested you can read a description and bio on the Barnes & Noble or Amazon websites or the publisher’s website. I would be happy to know what you think.
Thanks,
[CRANK AUTHOR]

This is actually a pretty convincing message until you stop and ask, who is this friend and why exactly did she think I’d like the book (especially given that I don’t do stratification)?

April 14, 2010 at 4:43 am

Network Graphs in Native Stata Code

| Gabriel |

My approach to handling networks in Stata is to not handle them in Stata. Rather I take an approach inspired by the Unix philosophy and export the data, then call an R script to do what I need, and in some cases use perl to clean the output for importing back into Stata. Since R/igraph has great network tools, this is a very flexible and powerful workflow. However it’s better suited for scripting than exploratory interactive work and it’s frustrating and slow-going as R syntax is really different from Stata.

Fortunately, Rense Corten has created “netplot.” The upside is that it has a simple syntax and is entirely Stata native code so there’s much less hassle than piping to R. The downside is that while it renders fast for small networks, it doesn’t scale as well to medium or large networks as the dedicated SNA packages. The current version only offers MDS and circle as layout algorithms, but future versions may work in K-K or F-R. Here’s a sample plot of a mid-sized random graph (note this took about 30 seconds to render, whereas R/igraph could do it in less than a second).

Anyway, I congratulate Dr. Corten on the accomplishment. I’m really excited about “netplot” in part as a proof of concept towards developing Stata native SNA capabilities but even more because it will make it much more convenient to generate SNA graphs from Stata. For now I still prefer the rendering speed and flexibility of R/igraph, but I really like that I can do a very good initial version natively as it makes exploratory work more convenient. Keep in mind though I have a considerable sunk cost in learning to pipe to R/igraph, someone who has yet to make such an investment might very well prefer to rely on netplot exclusively for graphing as it’s extremely easy to use and gives output that’s great for many purposes.

To install “netplot” type “ssc install netplot”. Also see this blog post and this manual.

April 13, 2010 at 5:35 am 3 comments

Cups-pdf

| Gabriel |

The “Preview” submenu in the Mac print dialogue is great, but I’ve become a big fan of cups-pdf. I originally installed it because Windows apps running in Crossover/WINE don’t have access to the Preview submenu and so I couldn’t save PDFs from them. However I’ve found that I use it a lot even with native apps, mostly for saving to a “read later” folder. It only saves a few clicks vs using the “Preview” menu but as Amazon’s patent lawyers will tell you, there’s a certain beauty to “one-click.” Note that after you install it, you need to play with it a bit to get it to work in Snow Leopard. Only the first of these changes is described on their page:

  1. run the terminal command
    sudo chmod 0700 /usr/libexec/cups/backend/cups-pdf
  2. it wants to save to a folder called “~/Desktop/cups-pdf” but I’ve had trouble getting this to work. (as far as i can tell, it’s a permissions thing). My solution is to instead target “/printed-pdf/”. First, create this folder. Second, open the file “/private/etc/cups/cups-pdf.conf” and edit line 43 to read “Out /printed_pdf/”. If you want it on the Desktop, create an alias or use the Terminal command
    ln -s /printed_pdf ~/Desktop/printed_pdf
  3. The default driver is grayscale. To get color, don’t choose “Generic Postscript Printer”. Instead, when given that option go to “Select Printer Software” and then choose “Postscript Generic postscript color printer, rev3a”.

April 9, 2010 at 5:04 am 1 comment

Misc Links

| Gabriel |

  • There’s a very interesting discussion at AdAge comparing buzz metrics (basically, data mining blogs and Twitter) to traditional surveys. Although the context is market research, this is an issue that potentially has a lot of relevance for basic research and so I recommend it even to people who don’t particularly care about advertising. The epistemological issue is basically the old validity versus generalizability debate. Surveys are more representative of the general consumer but they suffer from extremely low salience and so answers are so riddled with question wording effects and that sort of thing as to be almost meaningless. On the other hand buzz metrics are meaningful but not representative (what kind of person tweets about laundry bleach?). The practical issue is that buzz metrics are cheaper and faster than surveys.
  • I listened to the bhtv between Fodor and Sober and I really don’t get Fodor’s argument about natural selection. He seems to think that the co-occurence of traits is some kind of devastating problem for biology when in fact biologists have well-articulated theories (i.e., “hitchhiking,” “spandrels,” and the “selection for vs. selection of” distinction) for understanding exactly these issues and as implied by the charge “hyper-adaptionist” there’s already an understanding with the field that these make natural selection a little more complicated than it otherwise might be. However the internal critics who raise these issues (e.g., the late Stephen Jay Gould) wouldn’t come anywhere close to claiming that these issues are an anomaly that challenges the paradigm.
  • As a related philosophy of science issue, Phil @ Gelman’s blog has some thoughts on (purposeful or inadvertent) data massaging to fit the model. He takes it as a Bayesian math issue, but I think you can agree with him on Quinean/Kuhnian philosophical grounds.
  • The essay “Why is there no Jewish Narnia?” has been much discussed lately (e.g., Douthat). The essay basically argues that this is because modern Judaism simply is not a mythic religion. The interesting thing though is that it once was, as can be seen clearly in various Babylonian cognates (eg, the parts of Genesis and Exodus from the J source and the 41st chapter of the book of Job). However, as the essay argues, the mythic aspects were driven out by the rabbinic tradition. Myself, I would go further than that and say that the disenchantment really began with P, though I agree that the rabbinate finished it off, as evidenced by the persistence of myth well through the composition of “Daniel” in the 2nd c. BCE. This reminds me of the conclusion to The Sacred Canopy, where Berger basically says disenchantment has been a long-term trend ever since animism gave way to distinct pagan gods and especially with monotheism.
  • Of course the animism -> paganism ->henotheism -> monotheism -> atheism thing isn’t cleanly monotonic as we sometimes see with pagan survivalism. The first episode of the new season of Breaking Bad cold opens with a couple of narcos praying at a shrine to La Santa Muerte. In a great NYer piece on narco culture, one of the worshippers says “Yes, it was true that the Catholic Church disapproved of her ‘Little Skinny One,’ she said. ‘But have you noticed how empty their churches are?'” Maybe Rodney Stark should write his next book on the market theory of religion using Mexican Satanism as a case study of a new market entrant that more effectively pandered to met the needs of worshippers than the incumbent Catholic church, what with its stodgy rules against murder. (This isn’t a critique of Stark. Since he’s fond of Chesterton’s aphorism that when people don’t believe in God they don’t believe in nothing, they believe in anything, I think he’d argue that the popularity of the Santa Muerte cult is the product of a lack of competition among decent religions).
  • The Red Letter feature length deconstructions of the Star Wars prequels are why we have the fair use doctrine. They make dense and creative use of irony, especially with the brilliant contrasts between the narrative and the visual collage. Probably the funniest two segments are the first segment of the Episode I critique when he talks about the poor character development and the fifth segment of the Episode II critique when he plays dating coach for Anakin.

April 8, 2010 at 5:14 am 2 comments

Using grep (or mdfind) to reshape data

| Gabriel |

Sometimes you have cross-class data that’s arranged the opposite of how you want. For instance, suppose I have a bunch of files organized by song, and I’m interested in finding all the song files that mention a particlar radio station, say KIIS- FM. I can run the following command that finds all the song files in my song directory (or its subdirectories) and puts the names of these files in a text file called “kiis.txt”

grep -l -r ’KIIS’ ~/Documents/book/stata/rawsongs/ > kiis.txt

Of course to run it from within Stata I can prefix it with “shell”. By extension, I could then write a program around this shell command that will let me query station data from my song files (or vice versa). You could do something similar to see what news stories saved from Lexis-Nexis or scraped web pages contain a certain keyword.

Unfortunately grep is pretty slow, but you can do it faster by accessing your desktop search index. It’s basically the difference between reading a book looking for a reference versus looking the reference up in the book’s index. This is especially important if you’re searching over a lot of data — grep is fine for a few dozen files but you want indexed search if you’re looking over thousands of files, let alone your whole file system. On a Mac, you can access your Spotlight index from shell scripts (or the Terminal) with “mdfind“. The syntax is a little different than grep so the example above should be rewritten as

mdfind -onlyin ~/Documents/book/stata/rawsongs/ "KIIS" > kiis.txt

While grep is slower than mdfind, it’s also more flexible. Fortunately (as described here), you can get the best of both worlds by doing a broad search with mdfind then piping the results to grep for more refined work.

April 7, 2010 at 5:13 am 1 comment

Mail.app and server logs

| Gabriel |

I recently started scraping a website using curl and cron (for earlier thoughts on this see here). Because I don’t leave my mac turned on at 2am, I’m hosting the scrape on one of the UCLA servers. I get a daily log of the scrape by email, but I know myself well enough to know that I’ll get bored with reading the logs after a few days.

As such, I added a “rule” to Mail.app that looks for error messages. When curl fails for any reason, the standard error message is “Warning: Failed to create the file …” Using the “rules” tab of the Mail.app preferences, I told Mail to turn any message red if it has my server’s log boilerplate in the subject line and contains a curl error message anywhere in the text. Now when I open my email in the morning and see a black message I know everything is fine whereas a red message (or no message at all) means there’s a problem.

April 2, 2010 at 5:10 am


The Culture Geeks