## Correlations and sparseness

| Gabriel |

I just published a paper in which the dependent variable was a binary variable with a frequency of about 1%. You can think of a dummy as basically taking a latent continuous distribution and turning it into a step function. When the dummy is sparse, the step occurs at the extreme right-tail. Ideally we would have had the underlying distribution itself, but that’s life. Having not just a binary, but a sparse binary, means that it’s really hard to do things like fixed-effects as when almost all of your cases are coded “0” it’s really easy to run into perfect prediction.

Anyway, one of the ways that sparse binary variables are weird is that nothing really correlates with them. A peer reviewer noticed this in the corr-var-cov matrix and asked about this, and it was a very reasonable question since most people don’t have a lot of experience with sparse binary variables and under most circumstances low zero-order correlations with the dependent variable are a sign of trouble. I found that the easiest way to both get a grasp on the issue myself and to explain it to the reviewer was to just demonstrate it with a simple simulation.

 clear
set obs 100000
gen x=0
replace x=1 in 1/50000
gen y=0
replace y=1 in 1/1000
tab y x, cell nofreq chi2
corr y x
corr y x, cov

What this simulation is doing is creating a dataset with one common binary trait (x) and one sparse binary trait (y), where the sparse trait is effectively a subset of the common trait. In the coded illustration, the correlation is about .1, which is pretty low, and the covariance is even lower. On the other hand the Chi2 is through the roof, which you’d expect given that Chi2 defines the null by the marginals, and this dataset shows as much association as is possible given the marginals. Here’s a real world example. Since 1920 there have been hundreds of millions of native born Americans over the age of 35. Of these people, a little under half were men and 16 have been president, all of them men. For this population there would be a very high Chi2 but a very low correlation between being male and being president of the United States.

All this is a good illustration of why, technically, you’re not supposed to run correlations with dummies. This is one of those rules that we violate all the time and usually it’s not a big problem. Not only is it usually not a problem, but it’s pretty convenient because there is no appealling alternative for showing zero-order associations for all combinations of a mix of continuous and dummy variables. However when the dummy gets sparse you can run into trouble. Fortunately things like this are pretty easy to explain with a simulation that is similar to your data/model but where the true structure is known by assumption.