Shufflevar update

January 24, 2011 at 7:30 pm 6 comments

| Gabriel |

Thanks to Elizabeth Blankenspoor (Michigan) I corrected a bug with the “cluster” option in shufflevar. I’ve submitted the update to SSC but it’s also here:

*1.1 GHR January 24, 2011

*changelog
*1.1 -- fixed bug that let one case per "cluster" be misallocated (thanks to Elizabeth Blankenspoor)

capture program drop shufflevar
program define shufflevar
	version 10
	syntax varlist(min=1) [ , Joint DROPold cluster(varname)]
	tempvar oldsortorder
	gen `oldsortorder'=[_n]
	if "`cluster'"!="" {
		local bystatement "by `cluster': "
	}
	else {
		local bystatement ""
	}
	if "`joint'"=="joint" {
		tempvar newsortorder
		gen `newsortorder'=uniform()
		sort `cluster' `newsortorder'
		foreach var in `varlist' {
			capture drop `var'_shuffled
			quietly {
				`bystatement' gen `var'_shuffled=`var'[_n-1]
				`bystatement' replace `var'_shuffled=`var'[_N] if _n==1
			}
			if "`dropold'"=="dropold" {
				drop `var'
			}
		}
		sort `oldsortorder'
		drop `newsortorder' `oldsortorder'
	}
	else {
		foreach var in `varlist' {
			tempvar newsortorder
			gen `newsortorder'=uniform()
			sort `cluster' `newsortorder'
			capture drop `var'_shuffled
			quietly {
				`bystatement' gen `var'_shuffled=`var'[_n-1]
				`bystatement' replace `var'_shuffled=`var'[_N] if _n==1
			}
			drop `newsortorder'
			if "`dropold'"=="dropold" {
				drop `var'
			}
		}
		sort `oldsortorder'
		drop `oldsortorder'
	}
end

Entry filed under: Uncategorized. Tags: , , .

Growl in R and Stata Plan B

6 Comments

  • 1. Nick Cox  |  January 25, 2011 at 10:49 am

    The oresumption is that variable names are short enough to allow “_shuffled” to be added as suffix. Evidently no users have yet been bitten by this. An option -suffix()- say could be added to allow users to specify a different suffix.

    • 2. gabrielrossman  |  January 25, 2011 at 2:02 pm

      better yet, i’ll just give an option that lets you choose anything you want for the entire varname (not just the suffix) but which defaults to oldvar_shuffled.

      • 3. Nick Cox  |  January 25, 2011 at 4:15 pm

        I am not sure that would be better yet. Since your program could shuffle a varlist, you’d need to allow all the variables being reshuffled to be renamed, for consistency, not just one.

        Orthogonality of design might be better satisfied by just shuffling the data as they are, encouraging people to rename any way beforehand they want using an appropriate command and (perhaps) refusing to shuffle unless the dataset in memory has been saved.

        Mind you, I don’t always act in such a way in writing code.

      • 4. gabrielrossman  |  January 25, 2011 at 4:24 pm

        good point, the option of shuffling multiple variables would not work well with a “gen” option. i knew there was a reason i did it as a suffix!

        i guess another way to do it would be to give people a “replace” option and otherwise stick to a “_shuffled” suffix. people with ridiculously long varnames can just use the replace option (or rewrite the command). how’s that strike you?

  • 5. Nick Cox  |  January 25, 2011 at 10:50 am

    That’s “presumption”.

  • 6. Nick Cox  |  January 25, 2011 at 5:25 pm

    I’d just separate renaming and shuffling completely. If people want to rename a copy of their data before or after shuffling that’s up to them.

    I am partisan here as the author of renaming commands, but I’d just write all over the help file: Warning: this mangles your data.

    Also, as said your default could be no shuffling if the data are changed since last save (see c(changed)) — unless the user specifies a -force- option. That’s a StataCorp standard. -force- says “I know what I am doing and I am going ahead regardless”.

    The maxim is that each program should do one thing well may apply here.


The Culture Geeks


%d bloggers like this: