Drop the entirely missing variables

December 2, 2009 at 5:00 am 9 comments

| Gabriel |

One of the purely technical frustrations with GSS is that it’s hard to figure out if a particular question was in a particular wave. The SDA server is pretty good about kicking these from extracts, but it still leaves some in. The other day I was playing with the 2008 wave and got sick of saying “oooh, that looks interesting” only to find the variable was missing for my wave, which happened repeatedly. (For instance, I thought it would be fun to run Biblical literalism against the Israel feeling thermometer, but no dice, at least for the 2008 wave).

To get rid of these phantom variables, I wrote this little loop that drops variables with entirely missing data (or that are coded as strings, see below):

foreach varname of varlist * {
 quietly sum `varname'
 if `r(N)'==0 {
  drop `varname'
  disp "dropped `varname' for too much missing data"
 }
}

Unfortunately “sum” thinks that string variables have no observations so this will also drop strings. There’s a workaround, but it involves the “ds” command, which works in Stata 11 but has been deprecated and so may not work in future versions.

ds, not(type string)
foreach varname of varlist `r(varlist)' {
 quietly sum `varname'
 if `r(N)'==0 {
  drop `varname'
  disp "dropped `varname' for too much missing data"
 }
}

Entry filed under: Uncategorized. Tags: , , .

R and TextMate Get the path

9 Comments

  • 1. Neal  |  December 2, 2009 at 6:51 am

    You could also use the “capture confirm” combo:

    sysuse auto, clear

    gen miss_string=””
    gen miss_numer=.

    foreach varname of varlist * {
    local drop 0
    capture confirm string variable `varname’
    if !_rc {
    quietly tab `varname’
    }
    else {
    sum `varname’, meanonly
    }

    if `r(N)’==0 {
    drop `varname’
    disp “dropped `varname’ for too much missing data”
    }
    }

  • 2. Neal  |  December 2, 2009 at 6:55 am

    You don’t need the “local drop 0” line. That was leftover from before I realized that “tab” returns a r(N) and was going to use levelsof to find the empty strings. tab is much quicker.

  • 3. Conrad Hackett  |  December 2, 2009 at 10:53 am

    To accomplish a similar goal, I used Nick Cox’s dropmiss command.

  • 4. mike3550  |  December 2, 2009 at 3:27 pm

    Here is a way that can handle both string and numeric variables and avoid the if/then logic:

    
    unab ALL: *
    
    foreach var in `ALL' {
         gen test = sum(mi(`var'))
         if test[_N]==_N drop `var'
         drop test
         }
    
    • 5. mike3550  |  December 2, 2009 at 3:37 pm

      I realized that it might be helpful to explain what this is doing.

      The command unab creates a local macro from the variable list defined after the colon using Stata’s wildcard character, *. Thus, unab ALL: * creates a local macro named `ALL’ containing all variable names (because it is a wildcard).

      Then, for each variable in the dataset a “helper” variable, test that creates a running count (from the sum() function) of the observations that are missing. The missing() (abbreviated with “mi”) creates a dichotomous variable indicating whether observation is missing on the variable for that observation.

      Finally, you examine whether the last observation of the test variable, test[_N], equals the number of observations in the dataset, _N. If so, then the variable is dropped. Finally, the helper variable test is cleaned up.

  • 6. Kieran  |  December 2, 2009 at 10:29 pm

    In R, if missing values are properly represented in your data (as NAs), then a quick and dirty way of finding the variables with all missing cases would go something like this (here the data frame is called ‘data’):

    ## Returns TRUE if the vector you feed it is ’empty’ ie full of NAs, FALSE otherwise

    is.empty <- function(x){
    all(is.na(x))
    }

    ## Apply this function to all the columns of data. (Returns a logical vector with as many elements as there are columns in data.)

    empty.vars <- apply(data, 2, is.empty)

    ## Create a new data frame keeping only the columns that are not empty.

    new.data <- data[,!empty.vars]

    • 7. Kieran  |  December 2, 2009 at 10:32 pm

      I guess a more kosher way to write the is.empty() function would be

      is.empty <- function(x)
      out <- all(is.na(x))
      return(out)
      }

  • 8. Robert O'Grady  |  September 22, 2011 at 12:34 pm

    this is awesome, thanks guys! exactly what I need!

  • 9. Nick Cox  |  September 23, 2011 at 5:32 pm

    This old thread evidently continues to attract interest. What follows is only about doing it in Stata.

    As mentioned by Conrad Hackett -dropmiss- (Stata Journal, -search dropmiss- to find download location) is a dedicated command.

    If it didn’t exist, there are shorter solutions than suggested earlier, for example

    qui foreach v of var * {
    count if missing(`v’)
    if r(N) == _N drop `v’
    }

    and

    qui foreach v of var * {
    sort `v’
    if missing(`v'[1]) & missing(`v'[_N]) drop `v’
    }

    -sort- sorts missings together, so if the first and the last are both missing, they all are.

    However, -missing()- also catches extended missing values .a to .z, so watch out if you want them.


The Culture Geeks


%d bloggers like this: