Using filefilter to make insheet happy

June 8, 2012 at 1:24 pm 3 comments

| Gabriel |

As someone who mostly works with text files with a lot of strings, I often run into trouble with the Stata insheet command being extremely finicky about how it takes data. Frequently it ends up throwing out half the rows because at some point in the file there’s a stray character and Stata not only throws out that row but everything thereafter. In my recent work on IMDb I’ve gotten into the habit of first reading text files into Excel, then having Stata read the xlsx files. This is tolerable if you’re dealing with a relatively small number of files that you’re only importing once, but it won’t scale to repeated imports or a large number of files.

More recently, I’ve been dealing with data collected with the R library twitteR. Since tweets sometimes contain literal quotes, twitteR escapes them with a backslash. However Stata does not recognize this convention and it chokes on this when the quote characters are unbalanced. I realized this the other night when I was trying to test for left-censorship and fixed it using the batch find/replace in TextWrangler. Of course this is not scriptable and so I was contemplating taking the plunge into Perl when the Modeled Behavior hive mind suggested the Stata command filefilter. Using this command I can replace the escaped literal quotes (which chokes Stata’s insheet) with literal apostrophes (which Stata’s insheet can handle).

filefilter foo.txt foo2.txt, from(\BS\Q) to(\RQ)

Problem solved, natively in Stata, and I have about 25% more observations. Thanks guys.

About these ads

Entry filed under: Uncategorized. Tags: , .

Control for x Kal Penn’s Commencement Address for UCLA Sociology

3 Comments

  • 1. Nick Cox  |  June 9, 2012 at 10:27 am

    There is a short Tip on -filefilter- in

    Riley, A.R. 2008. Stata tip 60: Fast and easy changes to files with filefilter, Stata Journal 8(2): 290–292

    Full text accessible to all at

    http://www.stata-journal.com/sjpdf.html?articlenum=pr0039

  • 2. JP Ferguson  |  June 9, 2012 at 8:26 pm

    Using filefilter to remove non-ASCII characters from text data isn’t the most efficient process in the world, but still it’s saved me more time than I care to think about.

  • 3. Brooks  |  July 26, 2012 at 10:44 am

    re:TextWrangler not scriptable

    Time for BBEdit?


The Culture Geeks

Recent Posts


Follow

Get every new post delivered to your Inbox.

Join 1,472 other followers

%d bloggers like this: