Using grep (or mdfind) to reshape data

April 7, 2010 at 5:13 am 1 comment

| Gabriel |

Sometimes you have cross-class data that’s arranged the opposite of how you want. For instance, suppose I have a bunch of files organized by song, and I’m interested in finding all the song files that mention a particlar radio station, say KIIS- FM. I can run the following command that finds all the song files in my song directory (or its subdirectories) and puts the names of these files in a text file called “kiis.txt”

grep -l -r ’KIIS’ ~/Documents/book/stata/rawsongs/ > kiis.txt

Of course to run it from within Stata I can prefix it with “shell”. By extension, I could then write a program around this shell command that will let me query station data from my song files (or vice versa). You could do something similar to see what news stories saved from Lexis-Nexis or scraped web pages contain a certain keyword.

Unfortunately grep is pretty slow, but you can do it faster by accessing your desktop search index. It’s basically the difference between reading a book looking for a reference versus looking the reference up in the book’s index. This is especially important if you’re searching over a lot of data — grep is fine for a few dozen files but you want indexed search if you’re looking over thousands of files, let alone your whole file system. On a Mac, you can access your Spotlight index from shell scripts (or the Terminal) with “mdfind“. The syntax is a little different than grep so the example above should be rewritten as

mdfind -onlyin ~/Documents/book/stata/rawsongs/ "KIIS" > kiis.txt

While grep is slower than mdfind, it’s also more flexible. Fortunately (as described here), you can get the best of both worlds by doing a broad search with mdfind then piping the results to grep for more refined work.

Entry filed under: Uncategorized. Tags: , , . and server logs Misc Links

1 Comment

  • 1. Scraping for Event History « Code and Culture  |  September 28, 2010 at 4:31 am

    […] when you’re ready to analyze it, just use mdfind (or grep) to search the backup/ directory (and its subdirectories) for the term whose diffusion you’re […]

The Culture Geeks

%d bloggers like this: