Using filefilter to make insheet happy

As someone who mostly works with text files with a lot of strings, I often run into trouble with the Stata insheet command being extremely finicky about how it takes data. Frequently it ends up throwing out half the rows because at some point in the file there’s a stray character and Stata not only throws out that row but everything thereafter. In my recent work on IMDb I’ve gotten into the habit of first reading text files into Excel, then having Stata read the xlsx files. This is tolerable if you’re dealing with a relatively small number of files that you’re only importing once, but it won’t scale to repeated imports or a large number of files.

More recently, I’ve been dealing with data collected with the R library twitteR. Since tweets sometimes contain literal quotes, twitteR escapes them with a backslash. However Stata does not recognize this convention and it chokes on this when the quote characters are unbalanced. I realized this the other night when I was trying to test for left-censorship and fixed it using the batch find/replace in TextWrangler. Of course this is not scriptable and so I was contemplating taking the plunge into Perl when the Modeled Behavior hive mind suggested the Stata command filefilter. Using this command I can replace the escaped literal quotes (which chokes Stata’s insheet) with literal apostrophes (which Stata’s insheet can handle).

filefilter foo.txt foo2.txt, from(\BS\Q) to(\RQ)

Problem solved, natively in Stata, and I have about 25% more observations. Thanks guys.


