Archive for April, 2010

Grepmerge

| Gabriel |

Over at the Orgtheory mothership, Fabio asked how to do a partial string match in Stata, specifically to see if certain keywords appear in scientific abstracts. This turns out to be hard, not because there are no appropriate functions in Stata (both strmatch() and regexm() can do it) but because Stata can only handle 244 characters in a string variable. Many of the kinds of data we’d want to do content analysis on are much bigger than this. For instance, scientific abstracts are about 2000 characters and news stories are about 10000 characters.

OW suggested using SPSS, and her advice is well-taken as she’s a master at ginormous content analysis projects. Andrew Perrin suggested using Perl and this is closer to my own sympathies. I agree that Perl is generally a good idea for content analysis, but in this case I think a simple grep will suffice.

grep "searchterm" filein.csv | cut -d "," -f 1 > fileout.csv

The way this works is you start with a csv file called filein.csv (or whatever) where the record id key is in the first column. You do a grep search for “searchterm” in that file and pipe the output to the “cut” command. The -d “,” option tells cut that the stream is comma delimited and the -f 1 option tells it to only keep the first field (which is your unique record id). The “> fileout.csv” part says to pipe the output to disk. (Note that in Unix “>” as a file operator means replace and “>>” means append). You then have a text file called fileout.csv that’s just a list of records where your search term appears. You can merge this into Stata and treat a _merge==3 as meaning that the case includes the search term.

You can also wrap the whole thing in a Stata command that takes as arguments (in order): the term to search for, the file to look for it in, the name of the key variable in the master data, and (optionally) the name of the new variable that indicates a match. However for some reason the Stata-wrapped version only works with literal strings and not regexp searches. Also note that all this is for Mac/Linux. You might be able to get it to work on Windows with CygWin or Powershell.

capture program drop grepmerge
program define grepmerge
	local searchterm	"`1'"
	local fileread	"`2'"
	local key "`3'"
	if "`4'"=="" {
		local newvar "`1'"
	}
	else {
		local newvar "`4'"
	}
	tempfile filewrite
	shell grep "`searchterm'" `fileread' | cut -d "," -f 1 > `filewrite'
	tempvar sortorder
	gen `sortorder'=[_n]
	tempfile masterdata
	save `masterdata'
	insheet using `filewrite', clear
	ren v1 `key'
	merge 1:1 `key' using `masterdata', gen(`newvar')
	sort `sortorder'
	recode `newvar' 1=.a 2=0 3=1
	notes `newvar' : "`searchterm'" appears in this case
	lab val `newvar'
end

April 29, 2010 at 12:45 pm 8 comments

Every time you use Powerpoint, Edward Tufte calls in a targeted drone attack on a kitten

| Gabriel |

The NY Times has an article on the infamous Afghan pasta Powerpoint slide, and more broadly the military’s addiction to Powerpoint and the efforts of a few brave officers to detox (h/t Slashdot). [Also see a discussion of this slide at O&M]. This ties into a recent paper published in OS and also discussed at O&M on how Powerpoint has structured business culture.

I use Powerpoint (actually Keynote), but my style is to have it be almost entirely figures and use text very sparingly. (Here are a few examples from my undergrad class: ex 1, ex 2, ex 3). The only place I’ll use a bullet list is to sum up the hypotheses at the very end of a theory section before I move on to analysis/simulation. I think a good rule of thumb is that if somebody could read your ppt file and understand what you were talking about, then it’s a bad ppt file. A good ppt file should be opaque out of the context of the talk itself. If you need to share it with someone who wasn’t there, give them a tape of the talk or your notes. This isn’t deliberate obscurantism but a simple heuristic for understanding if you’re falling into the common trap of your audience skimming ahead in your bullet points and getting bored as they wait for you to (verbally) move on to the next point. I’m convinced people fill their ppt with text because they’re afraid of forgetting their talk, but I’m going to let you in on some ancient wisdom that was known to our ancestors but was forgotten sometime around 1998: keep your fucking notes on a piece of paper that only you can see. Because when you put your outline on the LCD projector itself, the terrorists have won.

April 27, 2010 at 1:04 pm 10 comments

Allegory of the quant

| Gabriel |

And now I will describe in a figure the enlightenment or unenlightenment of our nature — Imagine human beings living in a school; they have been there from childhood, having their necks and legs chained. In the school there is a computer, and between the computer and the prisoners an LCD display. Inside the computer are databases, who generate various tables and graphs. “A strange parable,” he said, “and strange captives.” They are ourselves, I replied; and they see only the shadows of the images which the computer throws on the LCD; these they give names like “variables” and “models.” Suppose now that you suddenly send them out to do field work and make them look with pain and grief to themselves at the human subjects; will they believe them to be real? Will not their eyes be dazzled, and will they not try to get away from the light to something which is already in machine-readable format with a well-documented codebook and a reasonably good sample design and response rate?

April 19, 2010 at 5:07 am 4 comments

If you were a cheese doodle, what kind of cheese doodle would you be?

| Gabriel |

April 16, 2010 at 2:40 pm 3 comments

More online interactional vandalism

| Gabriel |

A few months ago, I wrote about social spam, where the spam somehow tries to exploit your personal connections. I recently saw two new instances of this.

First, Slashdot had a story on a lawsuit with Classmates.com over phony friend requests from people with whom you did go to high school but who were not already using the site and requesting your link. Looks like they tried a bit too hard to provide a Schelling point solution to a coordination game.

Second, I recently got this spam message, which I’ve edited to avoid publicizing the product:

Dr. Rossman:
A friend of mine recommended that I contact you. I am publishing a book on [DATE], [TITLE] with [PUBLISHER]. It is a book that is part sociology, economics, anthropology and a history of the class system in the U.S. She felt that you may have some interest in my work and may provide feedback. If you are interested you can read a description and bio on the Barnes & Noble or Amazon websites or the publisher’s website. I would be happy to know what you think.
Thanks,
[CRANK AUTHOR]

This is actually a pretty convincing message until you stop and ask, who is this friend and why exactly did she think I’d like the book (especially given that I don’t do stratification)?

April 14, 2010 at 4:43 am

Network Graphs in Native Stata Code

| Gabriel |

My approach to handling networks in Stata is to not handle them in Stata. Rather I take an approach inspired by the Unix philosophy and export the data, then call an R script to do what I need, and in some cases use perl to clean the output for importing back into Stata. Since R/igraph has great network tools, this is a very flexible and powerful workflow. However it’s better suited for scripting than exploratory interactive work and it’s frustrating and slow-going as R syntax is really different from Stata.

Fortunately, Rense Corten has created “netplot.” The upside is that it has a simple syntax and is entirely Stata native code so there’s much less hassle than piping to R. The downside is that while it renders fast for small networks, it doesn’t scale as well to medium or large networks as the dedicated SNA packages. The current version only offers MDS and circle as layout algorithms, but future versions may work in K-K or F-R. Here’s a sample plot of a mid-sized random graph (note this took about 30 seconds to render, whereas R/igraph could do it in less than a second).

Anyway, I congratulate Dr. Corten on the accomplishment. I’m really excited about “netplot” in part as a proof of concept towards developing Stata native SNA capabilities but even more because it will make it much more convenient to generate SNA graphs from Stata. For now I still prefer the rendering speed and flexibility of R/igraph, but I really like that I can do a very good initial version natively as it makes exploratory work more convenient. Keep in mind though I have a considerable sunk cost in learning to pipe to R/igraph, someone who has yet to make such an investment might very well prefer to rely on netplot exclusively for graphing as it’s extremely easy to use and gives output that’s great for many purposes.

To install “netplot” type “ssc install netplot”. Also see this blog post and this manual.

April 13, 2010 at 5:35 am 3 comments

Cups-pdf

| Gabriel |

The “Preview” submenu in the Mac print dialogue is great, but I’ve become a big fan of cups-pdf. I originally installed it because Windows apps running in Crossover/WINE don’t have access to the Preview submenu and so I couldn’t save PDFs from them. However I’ve found that I use it a lot even with native apps, mostly for saving to a “read later” folder. It only saves a few clicks vs using the “Preview” menu but as Amazon’s patent lawyers will tell you, there’s a certain beauty to “one-click.” Note that after you install it, you need to play with it a bit to get it to work in Snow Leopard. Only the first of these changes is described on their page:

  1. run the terminal command
    sudo chmod 0700 /usr/libexec/cups/backend/cups-pdf
  2. it wants to save to a folder called “~/Desktop/cups-pdf” but I’ve had trouble getting this to work. (as far as i can tell, it’s a permissions thing). My solution is to instead target “/printed-pdf/”. First, create this folder. Second, open the file “/private/etc/cups/cups-pdf.conf” and edit line 43 to read “Out /printed_pdf/”. If you want it on the Desktop, create an alias or use the Terminal command
    ln -s /printed_pdf ~/Desktop/printed_pdf
  3. The default driver is grayscale. To get color, don’t choose “Generic Postscript Printer”. Instead, when given that option go to “Select Printer Software” and then choose “Postscript Generic postscript color printer, rev3a”.

April 9, 2010 at 5:04 am 1 comment

Older Posts


The Culture Geeks