Posts tagged ‘loops’

Programming notes

| Gabriel |

I gave a lecture to my grad stats class today on Stata programming. Regular readers of the blog will have heard most of this before but I thought it would help to have it consolidated and let more recent readers catch up.

Here’s my lecture notes. [Updated link 10/14/2010]

November 12, 2009 at 3:33 pm 2 comments

Shufflevar

| Gabriel |

[Update: I’ve rewritten the command to be more flexible and posted it to ssc. to get it type “ssc install shufflevar”. this post may still be of interest for understanding how to apply the command].

Sometimes you face a situation where it’s really hard to see what the null is because the data structure is really complicated and there is all sorts of nonlinearity, etc. Analyses of non-sparse square network matrices can use the quadratic assignment procedure, but you can do something similar with other data structures, including bipartite networks.

A good null keeps everything constant, but shows what associations we would expect were association random. The simplest way to do this is to keep the actual variable vectors but randomly sort one of the vectors. So for instance, you could keep the actual income distribution and the actual values of peoples’ education, race, etc, but randomly assign actual incomes to people.

Fernandez, Castilla, and Moore used what was basically this approach to build a null distribution of the effects of employment referrals. Since then Ezra Zuckerman has used it in several papers on Hollywood to measure the strength of repeat collaboration. I myself am using it in some of my current radio work to understand how much corporate clustering we’d expect to see in the diffusion of pop songs under the null hypothesis that radio corporations don’t actually practice central coordination.

I wrote a little program that takes the argument of the variable you want shuffled. It has a similar application as bsample, and like bsample it’s best used as part of a loop.

capture program drop shufflevar
program define shufflevar
  local shufflevar `1'
  tempvar oldsortorder
  gen `oldsortorder'=[_n]
  tempvar newsortorder
  gen `newsortorder'=uniform()
  sort `newsortorder'
  capture drop `shufflevar'_shuffled
  gen `shufflevar'_shuffled=`shufflevar'[_n-1]
  replace `shufflevar'_shuffled=`shufflevar'[_N] in 1/1
  sort `oldsortorder'
  drop `newsortorder' `oldsortorder'
end

Here’s an example to show how much clustering of “y” you’d expect to see by “clusterid” if we keep the observed distributions of “y” and “clusterid” but break any association between them:

shell echo "run rho" > _results_shuffled.txt

forvalues run=1/1000 {
  disp "iteration # `run' of 1000"
  quietly shufflevar clusterid
  quietly xtreg y, re i(clusterid_shuffled)
  shell echo "`run' `e(rho)'" >> _results_shuffled.txt
}

insheet using _results_shuffled.txt, names clear delimiter(" ")
histogram rho
sum rho

(Note that “shell echo” only works with Mac/Unix, Windows users should try postfile).

October 26, 2009 at 5:09 am 3 comments

Mansfield

| Gabriel |

My MDC technique is basically a multilevel version of a much older technique (as in so old it could have been used for marketing analysis at Sterling Cooper on “Mad Men”) created by Edwin Mansfield. This older technique first does a series of Bass analyses (Mansfield published the equation before Bass but the older version is under-theorized which is why we call it the “Bass” model today). It then treats the coefficients from the first stage as a dataset to itself be regressed. Although more recent work supersedes it in several ways, it’s still worth using for diagnostic purposes. However it’s a pain in the ass to use as it requires you to run a separate regression for each of your innovations and then aggregate them. As such, I wrote this code to automate it.

Even if for some bizarre reason you’re not particularly interested in diffusion models dating from the Kennedy administration, this code may be interesting for a few reasons:

  • It uses the “estout” package not for the (indispensable) usual purpose of making results meet publication style, but for the off-label purpose of creating a meta-analysis dataset.
  • It makes extensive (and extremely clumsy) use of shell-based regular expression commands to clean this output. (I am under no illusions that the “awk” code is remotely elegant).
  • It saves the cluster id variable in a local, then attaches it back using a loop.
capture program drop mansfield
program define mansfield
 *NOTE: dependency, "vallist" and "estout"

 set more off

 local caseid `1'
 local genre  `1'

 sort `caseid'
 by `caseid': drop if [_N]<5

 vallist `caseid', quoted 

 shell touch emptyresults
 shell mv emptyresults `genre'results.txt
 foreach case in `r(list)' {
  disp "`case'"
  quietly reg w_adds Nt Nt2 if `caseid'==`case'
  esttab using `genre'results.txt, plain append
 }

 shell awk '{ gsub(" +b/t", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub(" +", "\t"); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("\n\t.*", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '/.+/{print $0}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("^\t.+", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("^$", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '/.+/{print $0}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("Nt2\t", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("_cons\t", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("N\t", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("Nt\t", "NR\t"); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell #!/bin/sh
 shell awk -f mansfield.awk `genre'results.txt > tmp ;  mv tmp `genre'results.txt

 insheet using `genre'results.txt, clear
 drop v1 v6
 ren v4 A
 ren v2 B
 ren v3 C
 ren v5 n
 gen b    = -C
 gen nmax = (-B - ((B^2)-4*A*C)^0.5) / (2*C)
 gen a    = A / nmax

 gen `caseid'=.
 global n=1
 foreach case in `r(list)' {
  replace `caseid'=`case' in $n/$n
  global n=$n+1
 }
 save `genre'_mansfield.dta, replace
end

*note, text of mansfield.awk follows
*it should be in the same directory as the data and made executable the command
*"chmod mansfield.awk -x"
*BEGIN {
*    FS="\n"
*    RS="Nt\t"
*    ORS=""
*}
*
*{
*        x=1
*        while ( x<NF ) {
*                print $x "\t"
*                x++
*        }
*        print $NF "\n"
*}

March 24, 2009 at 9:31 am 2 comments

Append to nothing

| Gabriel |

One of the things that often frustrates me is that if you’re doing a loop that involves an “append” command you need to set  a seed for the file. For a long time, I would write one set of code for the first instance, then a second set for the append loop. This is a pain in the ass as it makes the code redundant (and thus error-prone and hard to modify) and you have to remember to remove the first case from the loop list. For these reasons it’s much better to keep everything in the loop but since you can’t append to nothing this requires creating a seed file.

To set the seed for an append command loop, you create a dataset with 1 row that has a missing observation (then you remember to delete this record later)

clear
set obs 1
gen x=.
foreach file in $list{
 append using `file'
}
drop in 1 /*alternately, drop if x==. */

Likewise, you often want to use the append option for Stata commands (e.g., “log” or “esttab”) that write ASCII files. The easiest way to do this is to use the “touch” command to create an empty file then the “mv” command to write over your existing file with this blank file. You now have a blank text file suitable for appending to. Note that “touch” and “mv” are Unix commands but I’m guessing you can a) trick Windows into speaking POSIX syntax with Cygwin or b) find equivalent native MSDOS syntax.

shell touch emptyresults
shell mv emptyresults results.txt
foreach case in $list {
 disp "`case'"
 quietly reg y x if serialno==`case'
 esttab using results.txt, plain append
}

March 19, 2009 at 4:04 am 3 comments

Do it to everything in the directory

| Gabriel |

Lately I’ve been experimenting with Stata’s “shell” command, which gives you direct access to the command line. This is especially useful on POSIX syntax systems like Mac and Linux. I like it because it allows me to automate my do-files even further so I can finally eliminate those clumsy comment lines like “before running this do-file remember to use text editor to clean the data with the following regular expression …”

One issue that comes up a lot in my cleaning scripts is that they amount to “run this code on about a thousand text files” so I have to tell Stata which text files. The way I used to do it was to write the list as a global directly into the do-file listing all thousand filenames but this was awkward because the global was usally the longest part of the do-file and you had to rewrite it every time you get new data. I realized recently that “shell” can solve both problems.

Partly inspired by a similar hack by the brilliant coders at UCLA ATS, my basic solution is to have a raw directory full of csv formatted text files($raw), a clean directory where I put the Stata files ($clean), and a foreach loop or a program that imports each text file in $raw, recodes some of the variables, and saves it in $clean.

cd $parentpath
shell touch tmpfile
shell mv tmpfile filelist_text.txt
cd $raw
shell ls *.csv >"$parentpath/filelist_text.txt"
shell awk '{ gsub("\.csv", ""); print $0;}' "$parentpath/filelist_text.txt" > tmp
shell mv tmp "$parentpath/filelist_text.txt"
shell perl -pe 's/\n/ /g'  "$parentpath/filelist_text.txt" > tmp
shell mv tmp "$parentpath/filelist_text.txt"
file open myfile using "$parentpath/filelist_text.txt", read
file read myfile line
global filelist `line'
foreach file in $filelist  {
 insheet using "$raw/`file'.csv"
 gen filename=="`file'"
 *drop some cases and variables
 *gen and recode some other variables
 save using "$clean/`file'".dta, replace
}
cd "$clean"
clear
set obs 1
gen filename="seedfile"
foreach file in $filelist {
 append using "`file'"
}
drop if filename=="seedfile"
save _allthedata.dta, replace

March 16, 2009 at 4:34 pm 6 comments

Newer Posts


The Culture Geeks