Grepmerge

April 29, 2010 at 12:45 pm 8 comments

| Gabriel |

Over at the Orgtheory mothership, Fabio asked how to do a partial string match in Stata, specifically to see if certain keywords appear in scientific abstracts. This turns out to be hard, not because there are no appropriate functions in Stata (both strmatch() and regexm() can do it) but because Stata can only handle 244 characters in a string variable. Many of the kinds of data we’d want to do content analysis on are much bigger than this. For instance, scientific abstracts are about 2000 characters and news stories are about 10000 characters.

OW suggested using SPSS, and her advice is well-taken as she’s a master at ginormous content analysis projects. Andrew Perrin suggested using Perl and this is closer to my own sympathies. I agree that Perl is generally a good idea for content analysis, but in this case I think a simple grep will suffice.

grep "searchterm" filein.csv | cut -d "," -f 1 > fileout.csv

The way this works is you start with a csv file called filein.csv (or whatever) where the record id key is in the first column. You do a grep search for “searchterm” in that file and pipe the output to the “cut” command. The -d “,” option tells cut that the stream is comma delimited and the -f 1 option tells it to only keep the first field (which is your unique record id). The “> fileout.csv” part says to pipe the output to disk. (Note that in Unix “>” as a file operator means replace and “>>” means append). You then have a text file called fileout.csv that’s just a list of records where your search term appears. You can merge this into Stata and treat a _merge==3 as meaning that the case includes the search term.

You can also wrap the whole thing in a Stata command that takes as arguments (in order): the term to search for, the file to look for it in, the name of the key variable in the master data, and (optionally) the name of the new variable that indicates a match. However for some reason the Stata-wrapped version only works with literal strings and not regexp searches. Also note that all this is for Mac/Linux. You might be able to get it to work on Windows with CygWin or Powershell.

capture program drop grepmerge
program define grepmerge
	local searchterm	"`1'"
	local fileread	"`2'"
	local key "`3'"
	if "`4'"=="" {
		local newvar "`1'"
	}
	else {
		local newvar "`4'"
	}
	tempfile filewrite
	shell grep "`searchterm'" `fileread' | cut -d "," -f 1 > `filewrite'
	tempvar sortorder
	gen `sortorder'=[_n]
	tempfile masterdata
	save `masterdata'
	insheet using `filewrite', clear
	ren v1 `key'
	merge 1:1 `key' using `masterdata', gen(`newvar')
	sort `sortorder'
	recode `newvar' 1=.a 2=0 3=1
	notes `newvar' : "`searchterm'" appears in this case
	lab val `newvar'
end

Entry filed under: Uncategorized. Tags: , , .

Every time you use Powerpoint, Edward Tufte calls in a targeted drone attack on a kitten Apple v Adobe and network externalities

8 Comments

  • 1. ulrich  |  May 3, 2010 at 5:58 pm

    I’m quite new to Stata programming, so be kind, but shouldn’t you use the `” “´ quotes to allow for these special characters in a search?

    • 2. gabrielrossman  |  May 3, 2010 at 6:08 pm

      it hadn’t occurred to me that somebody would want to search for the quote character itself but in retrospect i can see how this would be important to, for instance, content analysis of journalism.

      anyway, you’re right, i should have escaped the quotes in line 13. thanks for catching the bug

  • 3. Gabi Huiber  |  May 28, 2010 at 2:21 pm

    The 244-character limitation is easy to bypass with a call to Mata. Here’s one way:

    local checkthis “Over at the Orgtheory mothership, Fabio asked how to do a partial string match in Stata, specifically to see if certain keywords appear in scientific abstracts. This turns out to be hard, not because there are no appropriate functions in Stata (both strmatch() and regexm() can do it) but because Stata can only handle 244 characters in a string variable. Many of the kinds of data we’d want to do content analysis on are much bigger than this. For instance, scientific abstracts are about 2000 characters and news stories are about 10000 characters. Budapest.”

    // as expected, this won’t find a thing, because the
    // matching string is past the 244-character limit.
    di strmatch(“`checkthis'”,”*Budapest*”)

    // but this will work:
    mata
    strmatch(st_local(“checkthis”),”*Budapest*”)
    end

    • 4. gabrielrossman  |  May 28, 2010 at 3:34 pm

      neat trick, thanks
      unfortunately i don’t see it scaling up very well because you still have to get it off the disk and into the macro before Mata can process it. it wouldn’t work to use “insheet” or “use” then mkmat or local because insheet feeds the master data and thus runs up against the 244 character limit. the only way I can see how to get from disk to Mata would be a rather cumbersome use of “file read” and it seems like by the time you do all that you’re better off just piping to the OS grep. of course, i know very little about Mata and it might have some direct ability to read from disk that would make using Mata a very promising approach to content analysis.

      • 5. gabrielrossman  |  May 28, 2010 at 11:34 pm

        from the trackback to your own blog i see you were able to work in reading from disk — congratulations, that’s some nice code.

  • 6. Gabi Huiber  |  May 28, 2010 at 2:30 pm

    I would love to know how you got Stata syntax highlighting in WordPress. It seems to be a work in progress, but it’s a promising start. I’ve never seen this done before.

    • 7. gabrielrossman  |  May 28, 2010 at 3:21 pm

      i use the “sourcecode” tag. there’s no Stata syntax built-in but there is a perl syntax and that’s what I use because the structural aspects of the language are the same as Stata. if you want to deal with the raw engine behind the tag you could use a Scintilla style file for Stata (or any other language) but in practice it’s much easier to go with a default language like perl.

  • 8. Using Mata for string processing | A Stata Mind  |  May 28, 2010 at 10:29 pm

    […] My friend Dan Blanchette showed me a little Mata function yesterday that he wrote for changing the case — lower, upper, proper — for strings longer than 244 characters. It was fresh in my head today as I went looking for something while babysitting my daughter — can't remember what; babysitting requires undivided attention — and ended up here. […]


The Culture Geeks