Posts Tagged loops

Drop the entirely missing variables

| Gabriel |

One of the purely technical frustrations with GSS is that it’s hard to figure out if a particular question was in a particular wave. The SDA server is pretty good about kicking these from extracts, but it still leaves some in. The other day I was playing with the 2008 wave and got sick of saying “oooh, that looks interesting” only to find the variable was missing for my wave, which happened repeatedly. (For instance, I thought it would be fun to run Biblical literalism against the Israel feeling thermometer, but no dice, at least for the 2008 wave).

To get rid of these phantom variables, I wrote this little loop that drops variables with entirely missing data (or that are coded as strings, see below):

foreach varname of varlist * {
 quietly sum `varname'
 if `r(N)'==0 {
  drop `varname'
  disp "dropped `varname' for too much missing data"
 }
}

Unfortunately “sum” thinks that string variables have no observations so this will also drop strings. There’s a workaround, but it involves the “ds” command, which works in Stata 11 but has been deprecated and so may not work in future versions.

ds, not(type string)
foreach varname of varlist `r(varlist)' {
 quietly sum `varname'
 if `r(N)'==0 {
  drop `varname'
  disp "dropped `varname' for too much missing data"
 }
}

7 comments December 2, 2009

Programming notes

| Gabriel |

I gave a lecture to my grad stats class today on Stata programming. Regular readers of the blog will have heard most of this before but I thought it would help to have it consolidated and let more recent readers catch up.

Here’s my lecture notes.

2 comments November 12, 2009

Shufflevar

| Gabriel |

Sometimes you face a situation where it’s really hard to see what the null is because the data structure is really complicated and there is all sorts of nonlinearity, etc. Analyses of non-sparse square network matrices can use the quadratic assignment procedure, but you can do something similar with other data structures, including bipartite networks.

A good null keeps everything constant, but shows what associations we would expect were association random. The simplest way to do this is to keep the actual variable vectors but randomly sort one of the vectors. So for instance, you could keep the actual income distribution and the actual values of peoples’ education, race, etc, but randomly assign actual incomes to people.

Fernandez, Castilla, and Moore used what was basically this approach to build a null distribution of the effects of employment referrals. Since then Ezra Zuckerman has used it in several papers on Hollywood to measure the strength of repeat collaboration. I myself am using it in some of my current radio work to understand how much corporate clustering we’d expect to see in the diffusion of pop songs under the null hypothesis that radio corporations don’t actually practice central coordination.

I wrote a little program that takes the argument of the variable you want shuffled. It has a similar application as bsample, and like bsample it’s best used as part of a loop.

capture program drop shufflevar
program define shufflevar
  local shufflevar `1'
  tempvar oldsortorder
  gen `oldsortorder'=[_n]
  tempvar newsortorder
  gen `newsortorder'=uniform()
  sort `newsortorder'
  capture drop `shufflevar'_shuffled
  gen `shufflevar'_shuffled=`shufflevar'[_n-1]
  replace `shufflevar'_shuffled=`shufflevar'[_N] in 1/1
  sort `oldsortorder'
  drop `newsortorder' `oldsortorder'
end

Here’s an example to show how much clustering of “y” you’d expect to see by “clusterid” if we keep the observed distributions of “y” and “clusterid” but break any association between them:

shell echo "run rho" > _results_shuffled.txt

forvalues run=1/1000 {
  disp "iteration # `run' of 1000"
  quietly shufflevar clusterid
  quietly xtreg y, re i(clusterid_shuffled)
  shell echo "`run' `e(rho)'" >> _results_shuffled.txt
}

insheet using _results_shuffled.txt, names clear delimiter(" ")
histogram rho
sum rho

(Note that “shell echo” only works with Mac/Unix, Windows users should try postfile).

2 comments October 26, 2009

Mansfield

| Gabriel |

My MDC technique is basically a multilevel version of a much older technique (as in so old it could have been used for marketing analysis at Sterling Cooper on “Mad Men”) created by Edwin Mansfield. This older technique first does a series of Bass analyses (Mansfield published the equation before Bass but the older version is under-theorized which is why we call it the “Bass” model today). It then treats the coefficients from the first stage as a dataset to itself be regressed. Although more recent work supersedes it in several ways, it’s still worth using for diagnostic purposes. However it’s a pain in the ass to use as it requires you to run a separate regression for each of your innovations and then aggregate them. As such, I wrote this code to automate it.

Even if for some bizarre reason you’re not particularly interested in diffusion models dating from the Kennedy administration, this code may be interesting for a few reasons:

  • It uses the “estout” package not for the (indispensable) usual purpose of making results meet publication style, but for the off-label purpose of creating a meta-analysis dataset.
  • It makes extensive (and extremely clumsy) use of shell-based regular expression commands to clean this output. (I am under no illusions that the “awk” code is remotely elegant).
  • It saves the cluster id variable in a local, then attaches it back using a loop.
capture program drop mansfield
program define mansfield
 *NOTE: dependency, "vallist" and "estout"

 set more off

 local caseid `1'
 local genre  `1'

 sort `caseid'
 by `caseid': drop if [_N]<5

 vallist `caseid', quoted 

 shell touch emptyresults
 shell mv emptyresults `genre'results.txt
 foreach case in `r(list)' {
  disp "`case'"
  quietly reg w_adds Nt Nt2 if `caseid'==`case'
  esttab using `genre'results.txt, plain append
 }

 shell awk '{ gsub(" +b/t", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub(" +", "\t"); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("\n\t.*", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '/.+/{print $0}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("^\t.+", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("^$", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '/.+/{print $0}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("Nt2\t", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("_cons\t", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("N\t", ""); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell awk '{ gsub("Nt\t", "NR\t"); print $0;}' `genre'results.txt > tmp ;  mv tmp `genre'results.txt
 shell #!/bin/sh
 shell awk -f mansfield.awk `genre'results.txt > tmp ;  mv tmp `genre'results.txt

 insheet using `genre'results.txt, clear
 drop v1 v6
 ren v4 A
 ren v2 B
 ren v3 C
 ren v5 n
 gen b    = -C
 gen nmax = (-B - ((B^2)-4*A*C)^0.5) / (2*C)
 gen a    = A / nmax

 gen `caseid'=.
 global n=1
 foreach case in `r(list)' {
  replace `caseid'=`case' in $n/$n
  global n=$n+1
 }
 save `genre'_mansfield.dta, replace
end

*note, text of mansfield.awk follows
*it should be in the same directory as the data and made executable the command
*"chmod mansfield.awk -x"
*BEGIN {
*    FS="\n"
*    RS="Nt\t"
*    ORS=""
*}
*
*{
*        x=1
*        while ( x<NF ) {
*                print $x "\t"
*                x++
*        }
*        print $NF "\n"
*}

2 comments March 24, 2009

Append to nothing

| Gabriel |

One of the things that often frustrates me is that if you’re doing a loop that involves an “append” command you need to set  a seed for the file. For a long time, I would write one set of code for the first instance, then a second set for the append loop. This is a pain in the ass as it makes the code redundant (and thus error-prone and hard to modify) and you have to remember to remove the first case from the loop list. For these reasons it’s much better to keep everything in the loop but since you can’t append to nothing this requires creating a seed file.

To set the seed for an append command loop, you create a dataset with 1 row that has a missing observation (then you remember to delete this record later)

clear
set obs 1
gen x=.
foreach file in $list{
 append using `file'
}
drop in 1 /*alternately, drop if x==. */

Likewise, you often want to use the append option for Stata commands (e.g., “log” or “esttab”) that write ASCII files. The easiest way to do this is to use the “touch” command to create an empty file then the “mv” command to write over your existing file with this blank file. You now have a blank text file suitable for appending to. Note that “touch” and “mv” are Unix commands but I’m guessing you can a) trick Windows into speaking POSIX syntax with Cygwin or b) find equivalent native MSDOS syntax.

shell touch emptyresults
shell mv emptyresults results.txt
foreach case in $list {
 disp "`case'"
 quietly reg y x if serialno==`case'
 esttab using results.txt, plain append
}

3 comments March 19, 2009

Do it to everything in the directory

| Gabriel |

Lately I’ve been experimenting with Stata’s “shell” command, which gives you direct access to the command line. This is especially useful on POSIX syntax systems like Mac and Linux. I like it because it allows me to automate my do-files even further so I can finally eliminate those clumsy comment lines like “before running this do-file remember to use text editor to clean the data with the following regular expression …”

One issue that comes up a lot in my cleaning scripts is that they amount to “run this code on about a thousand text files” so I have to tell Stata which text files. The way I used to do it was to write the list as a global directly into the do-file listing all thousand filenames but this was awkward because the global was usally the longest part of the do-file and you had to rewrite it every time you get new data. I realized recently that “shell” can solve both problems.

Partly inspired by a similar hack by the brilliant coders at UCLA ATS, my basic solution is to have a raw directory full of csv formatted text files($raw), a clean directory where I put the Stata files ($clean), and a foreach loop or a program that imports each text file in $raw, recodes some of the variables, and saves it in $clean.

cd $parentpath
shell touch tmpfile
shell mv tmpfile filelist_text.txt
cd $raw
shell ls *.csv >"$parentpath/filelist_text.txt"
shell awk '{ gsub("\.csv", ""); print $0;}' "$parentpath/filelist_text.txt" > tmp
shell mv tmp "$parentpath/filelist_text.txt"
shell perl -pe 's/\n/ /g'  "$parentpath/filelist_text.txt" > tmp
shell mv tmp "$parentpath/filelist_text.txt"
file open myfile using "$parentpath/filelist_text.txt", read
file read myfile line
global filelist `line'
foreach file in $filelist  {
 insheet using "$raw/`file'.csv"
 gen filename=="`file'"
 *drop some cases and variables
 *gen and recode some other variables
 save using "$clean/`file'".dta, replace
}
cd "$clean"
clear
set obs 1
gen filename="seedfile"
foreach file in $filelist {
 append using "`file'"
}
drop if filename=="seedfile"
save _allthedata.dta, replace

5 comments March 16, 2009


The Culture Geeks

Tags

bayesian cleaning culture diffusion economics economic sociology ethnomethodology financial crisis graphs history IMDB loops lyx macros networks phenomenology philosophy of science R random variables regular expressions resampling shell sociology of organizations sociology of science st Stata superstar text editor typesetting

Archives

Recent Posts

Recent Comments

Blogroll