Archive for October, 2009
Stata 11 FV and margin
| Gabriel |
Yesterday I attended the ATS workshop on the new factor variables and margin syntax in Stata 11. Despite the usual statistical usage of the word “factor,” this has nothing to do eigenvectors and multi-dimensional scaling but is really about dummy sets and interactions. I might still be missing something, but it seems like the factor variables syntax is only an incremental improvement over the old “xi” syntax, mostly because it’s more elegant.
However the margin command is really impressive and should go a long way to making nonlinear models (including logit) more intelligible. I think a big reason people have p-fetishism is because with a lot of models it’s difficult to understand effects size. For this reason I like to close my results section with predicted values for various vignettes. I had been doing this in Excel or Numbers but “margin” will make this much easier, especially if I continue to experiment with specifications. (In general, I find that if you’re doing something once, GUI is faster than scripting, but we never just do something once so scripting is better in the long run). Anyway, it’s a very promising command.
My only reservation about both “factor variables” and “margin” is the value labeling. First, (like “xi”) neither command carries through value labels so you have to remember what occupation 3 is instead of it saying “sales.” Second, the numbers aren’t even consistent between factor and margin. Factor shows the value of the underlying variable whereas margin numbers the categories sequentially. So for instance, your basic dummy would be “0”for no and “1” for yes in factor variables because that’s how it’s stored in memory and “1” for no and “2” for yes in margin because “no” is the first category. What is this, SPSS? Anyway, margin is a very useful command, but it would be even more useful if the command itself or some kind of postestimation or wrapper ado file made the output more intuitive. Not that I’m volunteering to write it. Help us Ben Jann, you’re our only hope!
Why Jay Leno is like classical music
| Gabriel |
As the logical conclusion of a trend that began with reality tv, NBC has concluded that it’s just too expensive to make scripted television. And so they filled their 10pm slot every week night with a dirt cheap variety show. Not surprisingly, the show has much lower ratings and ad revenues than the traditional one hour scripted dramas that filled the slot until this season. On the one hand it’s embarrassing for the network that ruled tv in the 1990s to embrace a low-cost, low-revenue model or, as some industry people call it, the “winning by losing” model. On the other hand, it’s a much more profitable strategy because low as the revenues are, the costs are even lower.
People have mostly been focusing on the short-term revenues and in that sense NBC has indeed made the smart (but shameful) call. The long-run picture is more uncertain, even if you put aside Podolny-esque status issues and affiliate defections. Scripted television has always lost money on the first run and only turned a profit over the long-run. For decades this was mostly an issue of syndication (re-run) rights but for the last ten years it’s been dvd box sets. In contrast, non-scripted tv produces basically no long-term revenues — almost nobody watches ESPN Classic or buys dvds of game shows or “the Tonight Show”. There have been some troubling signs lately for the long-run revenue streams. Gilmore Girls is one of the best shows ever on television and you can buy the dvds for $20/season, a huge drop from the $50 or $60 a season the studios were asking a few years ago for tv. Likewise, it’s not clear that the studios will be able to effectively monetize streaming video, despite Rupert Murdoch’s attempt to get media companies to charge for content.
Thus you can read the Leno-ization of NBC as not just the idea that drama production budgets have gotten out of control, but also a bet that streaming video won’t produce long-term revenue streams at all comparable to those produced by syndication or dvd. Note that SAG and WGA seem to be making the opposite bet, as for the past few years Hollywood labor has been doing extremely painful strikes and soft strikes primarily over residuals on streaming. My hunch is that NBC is right and the unions are wrong on this, but it’s an empirical question.
The other interesting thing is that NBC is doing the exact opposite of what public radio did the 80s and 90s. Originally, public radio mostly consisted of classical music and jazz djs, which was dirt cheap content to produce but brought in little revenue. Them CPB and NPR bought some Arbitron reports and noticed that “All Things Considered” brought in the lion’s share of listeners. They checked pledge drive data and found that it also brought in most listener contributions. On this basis NPR added “Morning Edition,” and a little later, “Weekend Edition,” and it’s gotten to the point that pretty much all public radio stations play news and talk either in their best time slots (e.g., KCRW Santa Monica) or pretty much 24/7 (e.g., KPCC in Pasadena, WHYY in Philadelphia). Of course news is much more expensive to produce than just hiring a dj to spin Bach, but it also brings in more numerous, young, and affluent listeners. So we’ve seen public radio experience shift from low revenue & low cost to high revenue & high cost, pretty much exactly the opposite of what NBC is doing this season.
Copy mac files when booting from dvd
| Gabriel |
One of the frustrating things about the Mac is that there’s no such thing as a live cd (and live cds for Windows and Linux can’t read HFS disks). Of course you can boot from the installer dvd, but it doesn’t have the Finder. If you have problems booting from your internal disk and you don’t have a reasonably current backup this can induce alternating waves of panic and despair. (I’m speaking from experience. I’ve screwed up my partition table by playing with gparted. Actually, I’ve done this twice — as a dog returns to his vomit so a fool returns to his folly).
However you can still copy files because the installer dvd does have the Terminal, and the Terminal can invoke the command “cp“. Here’s how to do it.
- Put the dvd in and restart, tapping option so it let’s you choose the dvd.
- Choose a language, then instead of installing the OS, go to the Utilities menu and choose Terminal
- Plug in a USB drive and type “ls /Volumes”. Figure out which one is your USB drive, which one is your internal drive, and write it down. If it doesn’t recognize the USB drive you’ll need to mount.
- Use “cd” to navigate to your internal disk and find your most important files, which are probably in “/Volumes/Macintosh HD/Users/yournamehere/Documents”
- Use the “cp source target” command to copy files from the internal disk to the USB disk. To copy a directory use the -R option. For example to copy the directory “bookmanuscript” you’d use something like
cp -R '/Volumes/Macintosh HD/Users/yournamehere/Documents/bookmanuscript' /Volumes/USBdisk"
Shufflevar
| Gabriel |
[Update: I’ve rewritten the command to be more flexible and posted it to ssc. to get it type “ssc install shufflevar”. this post may still be of interest for understanding how to apply the command].
Sometimes you face a situation where it’s really hard to see what the null is because the data structure is really complicated and there is all sorts of nonlinearity, etc. Analyses of non-sparse square network matrices can use the quadratic assignment procedure, but you can do something similar with other data structures, including bipartite networks.
A good null keeps everything constant, but shows what associations we would expect were association random. The simplest way to do this is to keep the actual variable vectors but randomly sort one of the vectors. So for instance, you could keep the actual income distribution and the actual values of peoples’ education, race, etc, but randomly assign actual incomes to people.
Fernandez, Castilla, and Moore used what was basically this approach to build a null distribution of the effects of employment referrals. Since then Ezra Zuckerman has used it in several papers on Hollywood to measure the strength of repeat collaboration. I myself am using it in some of my current radio work to understand how much corporate clustering we’d expect to see in the diffusion of pop songs under the null hypothesis that radio corporations don’t actually practice central coordination.
I wrote a little program that takes the argument of the variable you want shuffled. It has a similar application as bsample, and like bsample it’s best used as part of a loop.
capture program drop shufflevar program define shufflevar local shufflevar `1' tempvar oldsortorder gen `oldsortorder'=[_n] tempvar newsortorder gen `newsortorder'=uniform() sort `newsortorder' capture drop `shufflevar'_shuffled gen `shufflevar'_shuffled=`shufflevar'[_n-1] replace `shufflevar'_shuffled=`shufflevar'[_N] in 1/1 sort `oldsortorder' drop `newsortorder' `oldsortorder' end
Here’s an example to show how much clustering of “y” you’d expect to see by “clusterid” if we keep the observed distributions of “y” and “clusterid” but break any association between them:
shell echo "run rho" > _results_shuffled.txt forvalues run=1/1000 { disp "iteration # `run' of 1000" quietly shufflevar clusterid quietly xtreg y, re i(clusterid_shuffled) shell echo "`run' `e(rho)'" >> _results_shuffled.txt } insheet using _results_shuffled.txt, names clear delimiter(" ") histogram rho sum rho
(Note that “shell echo” only works with Mac/Unix, Windows users should try postfile).
La vie en mort
| Gabriel |
Denis Colombi has contributed a few entries to the thriving sociology of zombies literature. His abstracts (including a description of zombie habitus) are all funny, but for my money the most sublime satire is his rational choice marginal analysis of zombie equilibrium:
Il faut comprendre les zombis en restituant les “bonnes raisons” de devenir zombis, afin de le faire apparaître comme un comportement rationnel. Ainsi, le choix de devenir ou non zombi dépend avant tout d’un calcul en fonction du rendement espéré de cette transformation. L’agrégation de ces comportements se traduit par un effet émergents, à savoir la réduction du nombre d’humains non-zombifiés ce qui réduit les gains de sa propre zombification. On peut ainsi parler d’une inflation zombifique, comme pour les diplômes.
It’s been a long time since lycée, but here’s my loose translation
We can understand zombies by restoring the “good reasons” to become zombies and thus make it apparent that it is a rational behavior. Thus the choice to become a zombie or not depends primarily a calculation based on the expected value of this transformation. The aggregation of these behaviors results in an emergent phenomenon, that is increasing the number of zombies reduces the marginal value of zombification. We can thus speak of zombification inflation, as with credential inflation for diplomas.
Time Machine and rsync
| Gabriel |
I think Time Machine is one of the best features of Leopard / Snow Leopard, but I still have a few issues with it.
First, I’m really not interested in having a Spotlight index of my Time Machine drive, so I go to System Preferences / Spotlight / Privacy and add my Time Machine volume to the “do not index” list. This isn’t so much a privacy issue as a performance issue since the Spotlight indexer (“mdworker”) is a real hog so why have it index stuff you don’t plan to search?
Second, Time Machine doesn’t work well with more than one backup volume, especially if you want to update one of the backups infrequently or backup different directories to each drive. In my case I have a large drive that I keep at work and a small backup drive that I keep at home in case my office burns down and destroys both my mac and the big backup drive. To use Time Machine for both disks, I would not only need to “select disk” but also “exclude items” because the disk I keep at home isn’t big enough to hold everything. Furthermore if I skip a few weeks of backing up to the home disk, Time Machine refuses to do an incremental backup.
My solution to this is to use Time Machine for the main backup drive and rsync for the second one. Every day I use Time Machine with my big backup drive at the office. Once a week or so at home I take my redundant backup drive (“seagate”) out of the drawer, plug it in, and run this shell script.
#!/bin/bash #backup_seagate.sh rsync -aE --delete ~/Documents/ /Volumes/seagate/rossman/Documents rsync -aE --delete ~/Library/ /Volumes/seagate/rossman/Library rsync -aE --delete ~/scripts/ /Volumes/seagate/rossman/scripts rsync -aE --delete ~/Pictures/ /Volumes/seagate/rossman/Pictures rsync -aE --delete ~/Music/ /Volumes/seagate/rossman/Music rsync -aE --delete ~/Applications/ /Volumes/seagate/rossman/Applications
Note that the version of rsync that ships with OS 10.5 or 10.6 is pretty old. If you install the current version, it will handle the resource fork more efficiently. There are instructions here but for my purposes it’s not worth the hassle.
[Update1: USB flash drives work well as your off-site backup because they are easier to transport than hard drives, being smaller and lacking moving parts. However you’ll need to use Disk Utility to change the file system from FAT to HFS+].
[Update2: Be careful with rsync as the syntax is important. It needs to be “command options source target,” if you reverse source and target you’re pretty much screwed].
Towards a sociology of living death
| Gabriel |
Daniel Drezner had a post a few months ago talking about how international relations scholars of the four major schools would react to a zombie epidemic. Aside from the sheer fun of talking about something as silly as zombies, it has much the same illuminating satiric purpose as “how many X does it take to screw in a lightbulb” jokes. If you have even a cursory familiarity with IR it is well worth reading.
Here’s my humble attempt to do the same for several schools within sociology. Note that I’m not even to get into the Foucauldian “whose to say that life is ‘normal’ and living death is ‘deviant'” stuff because, really, it would be too easy. Also, I wrote this post last week and originally planned to save it for Halloween, but I figured I’d move it up given that Zombieland is doing so well with critics and at the box office.
Public Opinion. Consider the statement that “Zombies are a growing problem in society.” Would you:
- Strongly disagree
- Somewhat disagree
- Neither agree nor disagree
- Somewhat agree
- Strongly agree
- Um, how do I know you’re really with NORC and not just here to eat my brain?
Criminology. In some areas (e.g., Pittsburgh, Raccoon City), zombification is now more common that attending college or serving in the military and must be understood as a modal life course event. Furthermore, as seen in audit studies employers are unwilling to hire zombies and so the mark of zombification has persistent and reverberating effects throughout undeath (at least until complete decomposition and putrefecation). However race trumps humanity as most employers prefer to hire a white zombie over a black human.
Cultural toolkit. Being mindless, zombies have no cultural toolkit. Rather the great interest is understanding how the cultural toolkits of the living develop and are invoked during unsettled times of uncertainty, such as an onslaught of walking corpses. The human being besieged by zombies is not constrained by culture, but draws upon it. Actors can draw upon such culturally-informed tools as boarding up the windows of a farmhouse, shotgunning the undead, or simply falling into panicked blubbering.
Categorization. There’s a kind of categorical legitimacy problem to zombies. Initially zombies were supernaturally animated dead, they were sluggish but relentlessness, and they sought to eat human brains. In contrast, more recent zombies tend to be infected with a virus that leaves them still living in a biological sense but alters their behavior so as to be savage, oblivious to pain, and nimble. Furthermore even supernatural zombies are not a homogenous set but encompass varying degrees of decomposition. Thus the first issue with zombies is defining what is a zombie and if it is commensurable with similar categories (like an inferius in Harry Potter). This categorical uncertainty has effects in that insurance underwriters systematically undervalue life insurance policies against monsters that are ambiguous to categorize (zombies) as compared to those that fall into a clearly delineated category (vampires).
Neo-institutionalism. Saving humanity from the hordes of the undead is a broad goal that is easily decoupled from the means used to achieve it. Especially given that human survivors need legitimacy in order to command access to scarce resources (e.g., shotgun shells, gasoline), it is more important to use strategies that are perceived as legitimate by trading partners (i.e., other terrified humans you’re trying to recruit into your improvised human survival cooperative) than to develop technically efficient means of dispatching the living dead. Although early on strategies for dealing with the undead (panic, “hole up here until help arrives,” “we have to get out of the city,” developing a vaccine, etc) are practiced where they are most technically efficient, once a strategy achieves legitimacy it spreads via isomorphism to technically inappropriate contexts.
Population ecology. Improvised human survival cooperatives (IHSC) demonstrate the liability of newness in that many are overwhelmed and devoured immediately after formation. Furthermore, IHSC demonstrate the essentially fixed nature of organizations as those IHSC that attempt to change core strategy (eg, from “let’s hole up here until help arrives” to “we have to get out of the city”) show a greatly increased hazard for being overwhelmed and devoured.
Diffusion. Viral zombieism (e.g. Resident Evil, 28 Days Later) tends to start with a single patient zero whereas supernatural zombieism (e.g. Night of the Living Dead, the “Thriller” video) tends to start with all recently deceased bodies rising from the grave. By seeing whether the diffusion curve for zombieism more closely approximates a Bass mixed-influence model or a classic s-curve we can estimate whether zombieism is supernatural or viral, and therefore whether policy-makers should direct grants towards biomedical labs to develop a zombie vaccine or the Catholic Church to give priests a crash course in the neglected art of exorcism. Furthermore marketers can plug plausible assumptions into the Bass model so as to make projections of the size of the zombie market over time, and thus how quickly to start manufacturing such products as brain-flavored Doritos.
Social movements. The dominant debate is the extent to which anti-zombie mobilization represents changes in the political opportunity structure brought on by complete societal collapse as compared to an essentially expressive act related to cultural dislocation and contested space. Supporting the latter interpretation is that zombie hunting militias are especially likely to form in counties that have seen recent increases in immigration. (The finding holds even when controlling for such variables as gun registrations, log distance to the nearest army administered “safe zone,” etc.).
Family. Zombieism doesn’t just affect individuals, but families. Having a zombie in the family involves an average of 25 hours of care work per week, including such tasks as going to the butcher to buy pig brains, repairing the boarding that keeps the zombie securely in the basement and away from the rest of the family, and washing a variety of stains out of the zombie’s tattered clothing. Almost all of this care work is performed by women and very little of it is done by paid care workers as no care worker in her right mind is willing to be in a house with a zombie.
Applied micro-economics. We combine two unique datasets, the first being military satellite imagery of zombie mobs and the second records salvaged from the wreckage of Exxon/Mobil headquarters showing which gas stations were due to be refueled just before the start of the zombie epidemic. Since humans can use salvaged gasoline either to set the undead on fire or to power vehicles, chainsaws, etc., we have a source of plausibly exogenous heterogeneity in showing which neighborhoods were more or less hospitable environments for zombies. We show that zombies tended to shuffle towards neighborhoods with low stocks of gasoline. Hence, we find that zombies respond to incentives (just like school teachers, and sumo wrestlers, and crack dealers, and realtors, and hookers, …).
Grounded theory. One cannot fully appreciate zombies by imposing a pre-existing theoretical framework on zombies. Only participant observation can allow one to provide a thick description of the mindless zombie perspective. Unfortunately scientistic institutions tend to be unsupportive of this kind of research. Major research funders reject as “too vague and insufficiently theory-driven” proposals that describe the intention to see what findings emerge from roaming about feasting on the living. Likewise IRB panels raise issues about whether a zombie can give informed consent and whether it is ethical to kill the living and eat their brains.
Ethnomethodology. Zombieism is not so much a state of being as a set of practices and cultural scripts. It is not that one is a zombie but that one does being a zombie such that zombieism is created and enacted through interaction. Even if one is “objectively” a mindless animated corpse, one cannot really be said to be fulfilling one’s cultural role as a zombie unless one shuffles across the landscape in search of brains.
Conversation Analysis.
1 HUMAN: Hello, (0.5) Uh, I uh, (Ya know) is anyone in there? 2 ZOMBIE1: Br:ai[ns], = 3 ZOMBIE2: [Br]:ain[s] 4 ZOMBIE1: =[B]r:ains 5 HUMAN: Uh, I uh= li:ke, Hello? = 6 ZOMBIE1: Br:ai:ns! 7 (0.5) 8 HUMAN: Die >motherfuckers!< 9 SHOTGUN: Bang! (0.1) = 10 ZOMBIE1: Aa:ar:gg[gh!] 11 SHOTGUN: =[Chk]-Chk, (0.1) Bang!
Probability distributions
| Gabriel |
I wrote this little demo for my stats class to show how normal distributions result from complex processes that sum the constituent parts whereas count distributions result from complex processes where a single constituent failure is catastrophic.
*this do-file is a simple demo of how you statistical distributions are built up from additive vs sudden-death causation *this is entirely based on a simulated coin toss -- the function "round(uniform())" *one either counts how many heads out of 10 tosses or how long a streak of heads lasts *I'm building up from this simple function for pedagogical purposes, in actual programming there are much more direct functions like rnormal() *1. The normal distribution *Failure is an additive setback clear set obs 1000 forvalues var=1/10 { quietly gen x`var'=. } forvalues row=1/1000 { forvalues var=1/10 { quietly replace x`var'=round(uniform()) in `row' } } gen sumheads=x1+x2+x3+x4+x5+x6+x7+x8+x9+x10 order sumheads lab var sumheads "How Many Heads Out of 10 Flips" *show five examples list in 1/5 histogram sumheads, discrete normal graph export sumheads.png, replace *2. Count distribution *Failure is catastrophic clear set obs 1000 forvalues var=1/30 { quietly gen x`var'=. } gen streak=0 lab var streak "consecutive heads before first tails" gen fail=0 forvalues row=1/1000 { forvalues var=1/30 { quietly replace x`var'=round(uniform()) in `row' quietly replace fail=1 if x`var'==0 quietly replace streak=`var' if fail==0 } quietly replace fail=. in `row'/`row' } quietly replace streak=0 if x1==0 *show five partial examples list streak x1 x2 x3 x4 x5 in 1/5 histogram streak, discrete graph export streakheads.png, replace *have a nice day
The winner is …
| Gabriel |
WASHINGTON — The American Sociological Association announced today that it is giving the distinguished book award to the prospectus for Climbing the Chart by Gabriel Rossman. In a statement, the ASA prize committee said they were awarding the prize for the prospectus’s “extraordinary efforts to synthesize sociology of culture, economic sociology, and social networks.”
Appearing in his front yard, Professor Rossman said he was ‘’surprised and deeply humbled” by the committee’s decision, mostly because he hasn’t finished writing the book yet. Previous ASA book awards have gone to such completed manuscripts as Charles Tilly’s Durable Inequality. However Professor Rossman quickly put to rest any speculation that he might not accept the honor. Describing the award as an “affirmation of the production of culture paradigm’s leadership on behalf of aspirations to scientific rigor held by scholars in all sociological subfields,” he said he would accept it as “a call to action.”
“To be honest,” Professor Rossman said “I do not feel that I deserve to be in the company of so many of the transformative figures who have been honored by this prize, men and women who’ve inspired me and inspired the entire world through their actually finishing writing their books.”
[Update: I see Mankiw made what is essentially the same joke]
Correlations and sparseness
| Gabriel |
I just published a paper in which the dependent variable was a binary variable with a frequency of about 1%. You can think of a dummy as basically taking a latent continuous distribution and turning it into a step function. When the dummy is sparse, the step occurs at the extreme right-tail. Ideally we would have had the underlying distribution itself, but that’s life. Having not just a binary, but a sparse binary, means that it’s really hard to do things like fixed-effects as when almost all of your cases are coded “0” it’s really easy to run into perfect prediction.
Anyway, one of the ways that sparse binary variables are weird is that nothing really correlates with them. A peer reviewer noticed this in the corr-var-cov matrix and asked about this, and it was a very reasonable question since most people don’t have a lot of experience with sparse binary variables and under most circumstances low zero-order correlations with the dependent variable are a sign of trouble. I found that the easiest way to both get a grasp on the issue myself and to explain it to the reviewer was to just demonstrate it with a simple simulation.
clear set obs 100000 gen x=0 replace x=1 in 1/50000 gen y=0 replace y=1 in 1/1000 tab y x, cell nofreq chi2 corr y x corr y x, cov
What this simulation is doing is creating a dataset with one common binary trait (x) and one sparse binary trait (y), where the sparse trait is effectively a subset of the common trait. In the coded illustration, the correlation is about .1, which is pretty low, and the covariance is even lower. On the other hand the Chi2 is through the roof, which you’d expect given that Chi2 defines the null by the marginals, and this dataset shows as much association as is possible given the marginals. Here’s a real world example. Since 1920 there have been hundreds of millions of native born Americans over the age of 35. Of these people, a little under half were men and 16 have been president, all of them men. For this population there would be a very high Chi2 but a very low correlation between being male and being president of the United States.
All this is a good illustration of why, technically, you’re not supposed to run correlations with dummies. This is one of those rules that we violate all the time and usually it’s not a big problem. Not only is it usually not a problem, but it’s pretty convenient because there is no appealling alternative for showing zero-order associations for all combinations of a mix of continuous and dummy variables. However when the dummy gets sparse you can run into trouble. Fortunately things like this are pretty easy to explain with a simulation that is similar to your data/model but where the true structure is known by assumption.
Recent Comments