Stata 11 FV and margin

October 29, 2009 at 5:20 am 8 comments

| Gabriel |

Yesterday I attended the ATS workshop on the new factor variables and margin syntax in Stata 11. Despite the usual statistical usage of the word “factor,” this has nothing to do eigenvectors and multi-dimensional scaling but is really about dummy sets and interactions. I might still be missing something, but it seems like the factor variables syntax is only an incremental improvement over the old “xi” syntax, mostly because it’s more elegant.

However the margin command is really impressive and should go a long way to making nonlinear models (including logit) more intelligible. I think a big reason people have p-fetishism is because with a lot of models it’s difficult to understand effects size. For this reason I like to close my results section with predicted values for various vignettes. I had been doing this in Excel or Numbers but “margin” will make this much easier, especially if I continue to experiment with specifications. (In general, I find that if you’re doing something once, GUI is faster than scripting, but we never just do something once so scripting is better in the long run). Anyway, it’s a very promising command.

My only reservation about both “factor variables” and “margin” is the value labeling. First, (like “xi”) neither command carries through value labels so you have to remember what occupation 3 is instead of it saying “sales.” Second, the numbers aren’t even consistent between factor and margin. Factor shows the value of the underlying variable whereas margin numbers the categories sequentially. So for instance, your basic dummy would be “0”for no and “1” for yes in factor variables because that’s how it’s stored in memory and “1” for no and “2” for yes in margin because “no” is the first category. What is this, SPSS? Anyway, margin is a very useful command, but it would be even more useful if the command itself or some kind of postestimation or wrapper ado file made the output more intuitive. Not that I’m volunteering to write it. Help us Ben Jann, you’re our only hope!

Entry filed under: Uncategorized. Tags: .

Why Jay Leno is like classical music Don said you were the market, and you were


  • 1. Kieran  |  October 29, 2009 at 9:06 am

    Do factors in stata work like factors in R?

    • 2. gabrielrossman  |  October 29, 2009 at 11:57 am

      i can’t say as i don’t use R. in Stata the syntax is like this.

      i.x /*break categorical var “x” into dummy set */
      b2.x /*make a dummy for when x==2 */
      x#y /*make a dummy set for interactions of x and y */
      x##y /*make interaction and main effects dummies */
      x#c.z /*make interaction between categorical var x and continuous var z*/

      note that the new syntax is both more flexible than “xi” and no longer requires a command prefix so you can just type “reg y i.x” instead of the old “xi: reg y i.x”

      the “margin” command syntax is like a cross between “predict” and “forvalues”

      the UCLA ATS demo i linked has a lot more details. also see the Stata 11 brochure entries for factor var and margin

  • 3. Kieran  |  October 30, 2009 at 9:08 pm

    Looks like the concept is the same — in R factors are a kind of categorical variable with ordered or unordered levels (e.g., Sex, Country, High/Medium/Low, whatever). Very handy, because you don’t have to think about translating categorical variables into numerical codes and back again. But you have to be careful that you know whether the factor is ordered, what the order of the levels is, and so on. Otherwise it’s really easy to screw up comparisons and interactions in models. Also you need to avoid not accidentally coercing factor values back to numeric ones, and generating superficially interpretable but actually meaningless results.

    • 4. gabrielrossman  |  October 31, 2009 at 2:23 pm

      >Also you need to avoid not accidentally coercing factor values
      >back to numeric ones,

      yeah, i remember a friend once treated occupation as a continuous variable and got results. of course this isn’t surprising as most occupation schemes (US Census, EGP, etc) basically start with professions and management then other nonmanual, etc, so in effect the “nominal” variable is similar to a reverse-coded prestige score.

  • 5. Kieran  |  October 30, 2009 at 9:14 pm

    As always when I read this blog, I think, hey I should be posting more code on my blog. But I hardly ever do, because I never feel like I know enough R, despite having used it (or S) one way or another since 1996. It’s a bit like that article, “Learn Emacs in Ten Years“.

    • 6. gabrielrossman  |  October 31, 2009 at 12:58 pm

      be great if you did, your workflow essay did as much as anything to inspire me to write this. i wouldn’t let the lack of complete mastery stop you — i’m only so-so at stata programming, which is why a lot of my code uses clumsy hacks like piping to the OS shell, writing to and reading from disk, etc.

  • 7. Pierre Azoulay  |  November 2, 2009 at 2:15 pm

    Gabriel, I think your post overlooks one big advantage of factor variables: stata does not store factor variables the same way as its “elemental components”, which really shrinks the size of the data to be kept in memory; Whether this will really make it possible to estimate models with *really lots* of fixed effects (say for firms and employees in employee/employer matched data) remains to be seen.

  • 8. Meredith  |  July 30, 2010 at 11:14 am

    In googling for a way to carry over value labels into regression output I found the user-written command ‘reformat’.
    It’s a little clunky, relies on xi (so no RAM efficiencies), but works for me for looking at results on the fly and remembering what marital_4 is.

The Culture Geeks

%d bloggers like this: