Archive for August, 2010
Beyond the Finder
| Gabriel |
In a recent post I mentioned that one of the (few) crappy things about OS X not related to network externalities is that there’s no dual-pane file manager. When I first got a Mac this really bothered me, but as I mentioned, I’ve mostly gotten used to just keeping tons of Finder windows open.
Anyway, while a dual-pane file manager doesn’t come standard, there are a few third-party options. Two options I think are especially worth checking out are the minimalist TotalFinder (free while in development, $15 when version 1.0 comes out). In contrast is the mega-featured PathFinder 5 ($40, on sale for $25 to educational users until September 7).
(Forklift is very similar to Pathfinder and also very good. The current version isn’t quite as well-featured as PathFinder, for instance there’s no command line or version control, but a) it has a slightly smaller memory footprint, b) at $30 it’s cheaper for a non-educational license, and c) the interface is less cluttered and looks more like the Finder).
PathFinder is a stand-alone program (as is Forklift) whereas TotalFinder is a mod of the standard Finder. This is something of a trade-off. On the one hand, PathFinder has a much bigger feature set, including such things as bookmarks, menu/submenu file navigation, command line, drop stack, etc. The only thing TotalFinder does is let you choose between a tabbed interface and a dual pane interface. Also because it mods the Finder, TotalFinder only works with Snow Leopard. On the plus side, being part of Finder means you don’t have both Finder and your OFM running, as is the case with PathFinder. This also implies that any services you’ve written for Finder will automatically work in TotalFinder whereas I found I had to recreate them in Automator for the benefit of PathFinder.
Parenthetically, a tip for using PathFinder is to add a link to the “Favorites” folder (~/Library/Favorites) to your “Places” in the Finder/PathFinder sidebar. The advantage of doing so is that this gives you access to the “favorites” from within open/save dialogs, which I find very convenient for navigating if I’m trying to reach something that I use a lot but nonetheless is not in “Recent Places.” Note that you can do so without any kind of special file manager software just by creating a “favorites” folder of your own and populating it with aliases.
Also of possible interest for orthodox file manager devotees is the cross-platform muCommander (free). Myself, I gave up on it for reasons that are common to cross-platform software: it’s written in an interpreted language so it launches slow and it doesn’t integrate with Spotlight, QuickLook, services, the Aqua toolkit, etc. I do appreciate it when the FOSS community compiles a Mac binary of cross-platform software and I use Lyx, NetLogo, and GIMP all the time, but a file manager is the kind of thing where I really prefer thick compatibility with the operating system.
Newsola
| Gabriel |
Traditional news organizations have a long-standing ethical code (but no legal restriction) against paying their sources. On the other hand, exclusives with sources can be valuable, especially when those sources can tell us about something cool like sex and/or murder involving celebrities (or at least attractive white girls aged 15-30) rather than some boring shit about Congress or whatshisface from that place where they don’t like us.
The Atlantic has a very interesting story about Larry Garrison, a freelance news producer whose job it is to square the circle between the value being offered and the refusal to pay. Mr. Garrison’s basic business model is to quickly identify people who have been thrust into the news, offer to (for lack of a better term) represent them, and then withhold their appearance from news outlets that refuse to take Garrison on as a segment producer. Mr. Garrison mostly gets paid for being a segment producer and his sources either get kickbacks for these producing fees or more often he gets them book deals and/or arranges for them to license various artifacts and footage to the news outlet (the rule against paying sources for testimony allows a loophole for buying photographs, etc, from them and in practice there’s a lot of implicit bundling).
This whole set of business arrangements is similar to payola in two respects, one of which is articulated in Coase and the other described in Dannen.
First, the Coase point is that while payola is often conceived of as a bribe to corrupt the broadcaster it can just as easily be conceived of as a payment for a valuable input and hence a payola prohibition is a monopsonistic cartel: the purchasers of an input conspire to fix a low price. Specifically, record labels refuse to pay anything for publicity and news organizations refuse to pay for sources. This cartel can be informal (as it is with news and as it was with music trade group agreements in 1917 and 1986) or formal (as with the payola law of 1960). Either way, it’s vulnerable to cheating.
Second, the point illustrated in Dannen is that when cheating occurs, the requirements of plausible deniability and/or etiquette will promote the emergence of brokers who can extract rents for their trouble. In the case of news that would be Mr. Garrison and in the case of pop music in the late 1970s and early 1980s that would be a cartel of sketchy radio consultants affiliated with the mafia.
Heads or Tails of your dta
| Gabriel |
A lot of languages (e.g., Bash and R) offer the commands/functions “head” and “tail.” These show you the first or last 5 or 10 things in a file or object. “Head” is similar to typing “list in 1/10” in Stata and you’d do so for similar reasons. Because I’m getting used to the Unix version, I wrote an ado file that lets “head” and “tail” work in Stata. Note that these Stata programs only work on the master dataset and can’t also be applied to using datasets or output like they can in Unix.
Update: see the comments for some suggestions on how to work in error messages and the like.
capture program drop head program define head if [_N]<10 { local ten = [_N] } else { local ten 10 } syntax [varlist(default=none)] list `varlist' in 1/`ten' end capture program drop tail program define tail syntax [varlist(default=none)] local theend = [_N] local theend_min10 = [_N]-10 if `theend_min10'<1 { local theend_min10 1 } list `varlist' in `theend_min10'/`theend' end *have a nice day
Blogola
| Gabriel |
The Daily Caller has a very interesting story on political blogs having suspicious relationships with political candidates and committees. For the most part these relationships involve buying ad space or hiring bloggers to do consulting. It’s entirely natural that candidates would buy ad space on these blogs but it gets suspicious when they are doing it at ten times the market rate cpm. Likewise, if you’re a campaign committee interested in doing blogger outreach, who better to hire to write the report than a blogger but this is the kind of thing the blogger really should disclose.
I found this interesting in part because I work on pop music payola and there are parallels, deeper even than the obvious. First, take this as evidence for Coase’s take on payola that when something is valuable a market for it will emerge. Second, in radio payola hits every fourteen years, like a cicada, but every time the details are different (this will be described at length in the payola chapter of my book, Climbing the Chart, look for it in fine bookstores everywhere sometime in 2013). The practice of overpaying for legitimate services is very similar to how payola was practiced in the 1950s, when it was common for record labels to hire disk jockeys to moonlight as consultants, party hosts, etc. or to buy services from companies owned by the disk jockeys.
Finally, if the Whitman campaign is reading this they should know that I’m a Princeton alum (*05) and a California opinion leader with literally dozens of readers. I don’t currently carry ads but I’d consider it for $500 per page view. I’m also a social scientist with invaluable expertise who could do some GSS cross-tabs for a mere $50,000. I’m just saying.
Zotero Hint: Empty the trash
| Gabriel |
I use Zotero to scrape/manage my citations and then I export them to Bibtex for use with Lyx/LaTex. I noticed some phantom citations in the Bibtex file (and by extension, in Lyx) that didn’t appear in Zotero. For instance, I had two versions of the Espeland and Sauder AJS 07 cite, one of which misspelled “Sauder” as “Saunder,” but only the correct spelling appeared in Zotero. After puzzling over this for a bit, I realized that Zotero has a “trash” folder within “My Library” and for some reason it was including the contents of the trash when I exported to Bibtex. Empty the trash and problem solved.
Grep nlogo
| Gabriel |
NetLogo has pretty good documentation but sometimes I’d like to see a bit more example code. Fortunately “.nlogo” files are stored as text, which means you can grep them to find examples of how the function is used in the model library. For instance this command will give you a list of models in the model library that use the function “layout-circle.”
grep -l -r 'layout-circle' "/Applications/NetLogo 4.1/models/Sample Models/"
Note that NetLogo itself only allows one open file at a time which is why I like to have the example code open in a text editor while the program I’m working on is open in NetLogo itself. Personally, I use TextMate for this, in part because I like it in general but also because the Logo bundle works pretty well with NetLogo and features like code-folding are especially useful for NetLogo coding style.
Life Without Walls
| Gabriel |
So Microsoft now has a page on why you should choose Windows over Mac. What’s interesting to me as an econ soc guy is that most of the things on the list, and certainly most of the things on the list that are actually compelling, rely in one way or another on network externalities, which implies that the advantages of Windows are mostly an issue of path dependence.
- More familiar interface if you’re already used to Windows — network externality
- Easier to share documents — network externality
- Availability of games — mostly a network externality issue, partly that some developers prefer DirectX over OpenGL
Of course they don’t mention that the main disadvantage of Windows is also in large part a network externality issue.
Although the “PC vs Mac” page mostly just finds different ways to say “because they’re popular,” it also lists a few issues that might be interesting even to Robinson Crusoe. Some of these other issues are good points (e.g., Apple’s insistence on bizarre video ports that require you to use dongles and which aren’t even standard within Apples own product line) and others are just stupid or misleading (e.g., that Mac’s don’t have touch).
To avoid giving the impression that I’ve fallen into the reality distortion field, let me provide my own list of advantages of non-network-externality reasons that I see as advantages of PCs over Macs:
- Price
- Two button mice
- A file manager that allows a traditional multi-pane interface
- The availability of tray-loading optical drives that actually work reliably, rather than exclusive use of slot-loading optical drives that often refuse to accept discs, or having accepted them, require you to turn the machine on its side to get it to eject.
Getting long flat-files out of field-tagged data
| Gabriel |
Some field-tagged data can be pretty unproblematically reshaped into flat-files. However one of the reasons people like field-tagged data is that they can have internal long structures and this creates problems for reshaping. For instance in Web of Science, the “CR” field (works that are being cited by the main work) usually has dozens of references separated by carriage returns. To get this into most statistic packages it has to be reshaped into a long flat-file. In other words, you need a script to turn this:
PT J AU LEVINSON, RM TI TEACHING LABELING THEORY - 4 EXPERIENCES IN ILLNESS ATTRIBUTION SO TEACHING SOCIOLOGY LA English DT Article C1 EMORY UNIV,ATLANTA,GA 30322. CR BECKER HS, 1963, OUTSIDERS MENDEL WM, 1969, ARCH GEN PSYCHIAT, V20, P321 ROSENHAN DL, 1973, SCIENCE, V179, P250 SCHEFF TJ, 1964, SOC PROBL, V11, P401 SCHEFF TJ, 1966, BEING MENTALLY ILL NR 5 TC 5 PU AMER SOCIOLOGICAL ASSOC PI WASHINGTON PA 1722 N ST NW, WASHINGTON, DC 20036-2981 SN 0092-055X J9 TEACH SOCIOL JI Teach. Sociol. PY 1975 VL 2 IS 2 BP 207 EP 211 PG 5 SC Education & Educational Research; Sociology GA AD732 UT ISI:A1975AD73200007 ER
Into this:
ISI:A1975AD73200007 TEACH SOCIOL BECKER HS, 1963, OUTSIDERS ISI:A1975AD73200007 TEACH SOCIOL MENDEL WM, 1969, ARCH GEN PSYCHIAT, V20, P321 ISI:A1975AD73200007 TEACH SOCIOL ROSENHAN DL, 1973, SCIENCE, V179, P250 ISI:A1975AD73200007 TEACH SOCIOL SCHEFF TJ, 1964, SOC PROBL, V11, P401 ISI:A1975AD73200007 TEACH SOCIOL SCHEFF TJ, 1966, BEING MENTALLY ILL
Note that the other fields usually lack carriage returns or other internal long delimiters so they can be cleaned like this. The two approaches can then be merged (in a flat-file like Stata) or linked (in a relational like R or Access) using the key.
I’ve actually already done this kind of thing twice, with my code for cleaning Memetracker and the IMDb business file. However those two datasets had the convenient property that the record key appears in the first row of the record. With data structures like these, you just remember the key then every time you come across a long entry, write it out along with the key.
Unfortunately, Web of Science has the record key appear towards the end of the record, a data structure that I propose to call “a huge pain in the ass.” This means that you have to collect all the long values in an array, then record the record key, then loop over the array to write out.
#!/usr/bin/perl #wos2tab_cr.pl by ghr #this script converts field-tagged WOS queries to tab-delimited text #it extracts the CR field and attaches to it the fields UT and J9 #unlike wos2tab, this file outputs long-formatted data #the two types of output can be merged with the UT field #sice CR comes /before/ UT and J9, must save as array, then loop over array at ER use warnings; use strict; die "usage: wos2tab_cr.pl <wos data>\n" unless @ARGV==1; my $rawdata = shift(@ARGV); my $ut = "" ; #unique article identifier my $j9 = "" ; #j9 coding of journal title my $cr = "" ; #cited work my @crlist = () ; #list of cited works my $cr_continued = 0 ; #flag for recently hitting "^CR" print "starting to read $rawdata\n"; open(IN, "<$rawdata") or die "error opening $rawdata for reading\n"; open(OUT, ">$rawdata.long") or die "error creating $rawdata.long\n"; print OUT "ut\tj9\tcr\n"; while (<IN>) { # if begins with non-whitespace character other than CR, flag cr_continued as not if($_ =~ m/^\S/) { if($_ =~ m/^[^CR]/) {$cr_continued = 0} } if($_ =~ m/^J9/) { $j9 = $_; $j9 =~ s/\015?\012//; #manual chomp $j9 =~ s/^J9 //; #drop leading tag } if($_ =~ m/^UT/) { $ut = $_; $ut =~ s/\015?\012//; #manual chomp $ut =~ s/^UT //; #drop leading tag } #first line of a CR field if($_ =~ m/^CR/) { $cr = $_; $cr =~ s/\015?\012//; #manual chomp $cr =~ s/^CR //; #drop leading tag $cr_continued = 1 ; #flag to allow for multi-line CR field push @crlist, $cr; #add the new cite to the list } #subsequent lines of a CR field if($_ =~ m/^ /) { if($cr_continued==1) { $cr = $_ ; $cr =~ s/\015?\012//; #manual chomp $cr =~ s/^ //; #drop leading tag push @crlist, $cr; #add the new cite to the list } } #when "end record" code is reached, loop over array to write out as long file, then clear memory if($_=~ /^ER/) { #loop over CRLIST array, printing for each value so as to have long file for (my $i = 0; $i < @crlist; $i++) { print OUT "$ut\t$j9\t$crlist[$i]\n"; #write out } #clear memory, just do once per "end record" $j9 = "" ; $ut = "" ; $cr = "" ; $cr_continued = 0 ; @crlist = () ; } } close IN; close OUT; print "done writing $rawdata.long \n";
Zero marginal productivity
| Gabriel |
One of the intellectually exciting but practically horrifying ideas being discussed recently on econ blogs is the “zero marginal product” worker (see here and here). That is, labor that contributes nothing to the firm, or at least doesn’t contribute enough to justify the minimum wage and overhead. If you take this seriously and game it out, you’re faced with the prospect of large-scale permanent unemployment which is not only a bad thing for the underclass themselves but tends to produce politics of redistribution that look a lot less like TH Marshall or John Rawls than Gaius Gracchus or Hugo Chavez.
Anyway, I was particularly interested in this post (h/t MR) because it makes explicit that once you take into consideration the possibility of catastrophic failure, you can even have workers whose marginal productivity is negative, at least in expectation. The logic underlying negative productivity gets into theoretical issues reviewed in my paper (with Esparza and Bonacich) on spillovers in Hollywood, but the theoretical Ur-cites for this downside model are Jacobs in sociology and Kremer in economics. Basically, every so often a worker will create a catastrophic failure. If the probability of creating a screw-up is a function of skill then it makes sense to avoid low skilled workers, even if on the median day they would have nontrivial positive productivity. The scary thing about negative productivity (in expectation) is that it implies that there might still be some nontrivial permanent unemployment even if we tried every plausible policy intervention to make hiring low-skilled workers more attractive: a lower minimum wage, eliminate the payroll tax, single-payer (which would make moot the current stigma on not providing health benefits), reasonably generous wage subsidies, etc.
Some ways Stata is an unusual language
| Gabriel |
As I’ve tried to learn other languages, I’ve realized that part of the difficulty isn’t that they’re hard (although in some cases they are) but that I’m used to Stata’s very distinctive paradigm and nomenclature. Some aspects of Stata are pretty standard (e.g., “while”/”foreach”/”forvalues” loops, log files, and the “file” syntax for using text files on disk), but other bits are pretty strange. Or rather, they’re strange from a computer science perspective but intuitive from a social science perspective.
Stata seems to have been designed to make sense to social scientists and if this makes it confusing to programmers, then so be it. A simple example of this is that Stata uses the word “variable” in the sense meant by social scientists. More broadly, Stata is pretty bold about defaults so as to make things easy for beginners. It presumes that anything you’re doing applies to the dataset (aka the master data) — which is always a flat-file database. Other things that might be held in memory have a secondary status and beginning users don’t even know that they’re there. Likewise, commands distinguish between the important arguments (usually variables) and the secondary arguments, which Stata calls “options”. There’s also the very sensible assumptions about what to report and what to put in ephemeral data objects that can be accessed immediately after the primary command (but need not be stored as part of the original command, as they would in most other languages).
Note, I’m not complaining about any of this. Very few of Stata’s quirks are pointlessly arbitrary. (The only arbitrary deviation I can think of is using “*” instead of “#” for commenting). Most of Stata’s quirks are necessary in order to make it so user-friendly to social scientists. In a lot of ways R is a more conventional language than Stata, but most social scientists find Stata much easier to learn. In part because Stata is willing to deviate from the conventions of general purpose programming languages, running and interpreting a regression in Stata looks like this “reg y x” instead of this “summary(lm(y~x))” and loading a dataset looks like this “use mydata, clear” instead of this “data <- read.table(mydata.txt)”. Stata has some pretty complicated syntax (e.g., the entire Mata language) but you can get a lot done with just a handful of simple commands like “use,” “gen,” and “reg”.
Nonetheless all this means that when Stata native speakers like me learn a second programming language it can be a bit confusing. And FWIW, I worry that rumored improvements to Stata (such as allowing relational data in memory) will detract from its user-friendliness. Anyway, the point is that I love Stata and I think it’s entirely appropriate for social scientists to learn it first. I do most of my work in Stata and I teach/mentor my graduate students in Stata unless there’s a specific reason for them to learn something else. At the same time I know that many social scientists would benefit a lot from also learning other languages. For instance, people into social networks should learn R, people who want to do content analysis should learn Perl or Python, and people who want to do simulations should learn NetLogo or Java. The thing is that when you do, you’re in for a culture shock and so I’m making explicit some ways in which Stata is weird.
Do-files and Ado-files. In any other language a do-file would be called a script and an ado-file would be called a library. Also note that Stata very conveniently reads all your ado-files automatically, whereas most other languages require you to specifically load the relevant libraries into memory at the beginning of each script.
Commands, Programs, and Functions. In Stata a program is basically just a command that you wrote yourself. Stata is somewhat unusual in drawing a distinction between a command/program and a function. So in Stata a function usually means some kind of transformation that attaches its output to a variable or macro, as in “gen ln_income=log(income)”. In contrast a command/program is pretty much anything that doesn’t directly attach to an operator and includes all file operations (e.g., “use”) and estimations (e.g, “regress”). Other languages don’t really draw this distinction but consider everything a function, no matter what it does and whether the user wrote it or not. (Some languages use “primitive” to mean something like the Stata command vs. program distinction, but it’s not terribly important).
Because most languages only have functions this means that pretty much everything has to be assigned to an object via an operator. Hence Stata users would usually type “reg y x” whereas R users would usually type “myregression <- lm(y~x)”. This is because “regress” in Stata is a command whereas “lm()” in R is a function. Also note that Stata distinguishes between commands and everything else by word order syntax with the command being the first word. In contrast functions in other languages (just like Stata functions) have the function being the thing outside the parentheses and inside the parentheses goes all of the arguments, both data objects and options.
The Dataset. Stata is one of the only languages where it’s appropriate to use the definite article in reference to data. (NetLogo is arguably another case of this). In other languages it’s more appropriate to speak of “a data object” than “the dataset,” even if there only happens to be one data object in memory. For the same reason, most languages don’t “use” or “open” data, but “read” the data and assign it to an object. Another way to think about it is that only Stata has a “dataset” whereas other languages only have “matrices.” Of course, Stata/Mata also has matrices but most Stata end users don’t bother with them as they tend to be kind of a backend thing that’s usually handled by ado-files. Furthermore, in other languages (e.g., Perl) it’s common to not even load a file into memory but to process it line-by-line, which in Stata terms is kind of like a cross between the “file read/write” syntax and a “while” loop.
Variables. Stata uses the term “variable” in the statistical or social scientific meaning of the term. In other languages this would usually be called a field or vector.
Macros. What most other languages call variables, Stata calls local and global “macros.” Stata’s usage of the local vs global distinction is standard. In other languages the concept of “declaring” a variable is usually a little more explicit than it is in Stata.
Stata is extremely good about expanding macros in situ and this can spoil us Stata users. In other languages you often have to do some kind of crude work around by first using some kind of concatenate function to create a string object containing the expansion and then you use that string object. For instance, if you wanted to access a series of numbered files in Stata you could just loop over this:
use ~/project/file`i', clear
In other languages you’d have to add a separate line for the expansion. So in R you’d loop over:
filename <- paste('~/project/file',i, sep="") data <- read.table(filename)
[Update: Also see this Statalist post by Nick Cox on the distinction between variables and macros]
Reporting. Stata allows you to pass estimations on for further work (that’s what return macros, ereturn matrices, and postestimation commands are all about), but it assumes you probably won’t and so it is unusually generous in reporting most of the really interesting things after a command. In other languages you usually have to specifically ask to get this level of reporting. Another way to put it is that in Stata verbosity is assumed by default and can be suppressed with “quietly,” whereas in R silence is assumed by default and verbosity can be invoked by wrapping the estimation (or an object saving the estimation) in the “summary()” function.
Recent Comments