Do it to everything in the directory

March 16, 2009 at 4:34 pm 6 comments

| Gabriel |

Lately I’ve been experimenting with Stata’s “shell” command, which gives you direct access to the command line. This is especially useful on POSIX syntax systems like Mac and Linux. I like it because it allows me to automate my do-files even further so I can finally eliminate those clumsy comment lines like “before running this do-file remember to use text editor to clean the data with the following regular expression …”

One issue that comes up a lot in my cleaning scripts is that they amount to “run this code on about a thousand text files” so I have to tell Stata which text files. The way I used to do it was to write the list as a global directly into the do-file listing all thousand filenames but this was awkward because the global was usally the longest part of the do-file and you had to rewrite it every time you get new data. I realized recently that “shell” can solve both problems.

Partly inspired by a similar hack by the brilliant coders at UCLA ATS, my basic solution is to have a raw directory full of csv formatted text files($raw), a clean directory where I put the Stata files ($clean), and a foreach loop or a program that imports each text file in $raw, recodes some of the variables, and saves it in $clean.

cd $parentpath
shell touch tmpfile
shell mv tmpfile filelist_text.txt
cd $raw
shell ls *.csv >"$parentpath/filelist_text.txt"
shell awk '{ gsub("\.csv", ""); print $0;}' "$parentpath/filelist_text.txt" > tmp
shell mv tmp "$parentpath/filelist_text.txt"
shell perl -pe 's/\n/ /g'  "$parentpath/filelist_text.txt" > tmp
shell mv tmp "$parentpath/filelist_text.txt"
file open myfile using "$parentpath/filelist_text.txt", read
file read myfile line
global filelist `line'
foreach file in $filelist  {
 insheet using "$raw/`file'.csv"
 gen filename=="`file'"
 *drop some cases and variables
 *gen and recode some other variables
 save using "$clean/`file'".dta, replace
}
cd "$clean"
clear
set obs 1
gen filename="seedfile"
foreach file in $filelist {
 append using "`file'"
}
drop if filename=="seedfile"
save _allthedata.dta, replace

Entry filed under: Uncategorized. Tags: , , .

hello world Publication bias

6 Comments

  • […] handle extremely long rows, but I only ever used extremely long rows for file list globals and there’s a better way to do that. A nice feature is that the language support is in the app package (as compared to […]

  • 2. Shell vs “Shell” « Code and Culture  |  September 23, 2009 at 5:38 am

    […] file, then use “file” or “insheet” to get that into Stata. For instance, my “do it to everything in a directory” script is a lot simpler if I rewrite it to use “ashell” instead of […]

  • 3. Gabi Huiber  |  October 5, 2009 at 11:29 am

    Unless you need the list of file names saved to a text file such as filelist_text.txt for some future use, you don’t really need this shell detour. I ran into a similar problem last year. I asked the Statalist for advice, and now I do this:

    local mypath “D:/path to ${raw} (with forward slashes)/”
    global filelist: dir “`mypath'” file “*.txt”
    display `”${filelist}”‘

    This redirects the output of “dir” (probably “ls” on your system) from the screen to your global $filelist.

    The file mask makes it possible to read a list of text files from a folder where they might be mixed with other kinds. You can also use it to do some crude regex jobs, e.g. file “*census*.txt”.

    Finally, notice the compound quotes after the display command. You need them because Stata will put quotes around your file names. If you’re sure that your file names do not include spaces, you can just do this

    global filelist: list clean global filelist
    display “${filelist}”

    • 4. gabrielrossman  |  October 5, 2009 at 3:10 pm

      thanks.
      fyi, the strategies you describe are also mentioned in the text and comments of theshell vs shell post.

      • 5. Gabi Huiber  |  November 17, 2009 at 12:33 am

        Thank you. Without the comments to your “shell vs. shell” post I wouldn’t have known about the respectcase option to the “dir” extended macro function. I wrote the original version of the solution I proposed in Stata 9; respectcase came with Stata 10.

  • 6. Scraping 101 « Code and Culture  |  January 24, 2011 at 3:09 pm

    […] Once you have all this data collected you’ll need to clean it, probably with regular expressions in perl but TextWrangler (on the mac) is about as good as an interactive GUI program can be for such a thing. Also note that this is going to produce a lot of files and so after you clean them you’re going to want a Stata script that can recognize a large batch of files and loop them. […]


The Culture Geeks