Continuing my discussion of Long’s Workflow …
One of the things that Long is appropriately insistent on is good archiving for the long-term. First, he notes that the most serious issue is the physical storage medium and the need to migrate data whenever you get a new system given that even formats that were popular in recent memory like zip disks and 3.5 floppies are now almost impossible to find hardware for. I think in the future this should become easier as hard drives get so ginormous that it’s increasingly feasible to keep everything on one disk rather than pushing your archives to removable medium that can get lost. When it’s all on one (or a few) internal disks then you tend to migrate and backup, unlike removable media that get lost in your file cabinet until they are obsolete and/or physically corroded. Of course in those increasingly rare instances where IRB or proprietary data provision issues are not a concern the best way to handle this is to use ICPSR, CPANDA, or a similar public archive.
Even if you can access the files, the issue is can you read them. Long appropriately stresses the issue of keeping data in several formats but I think he’s a bit too agnostic about which formats are likely to last. As I see it there are basically two issues: popularity and opacity. Popularity is simply how many people use the format. For this reason Long endorses SAS Transport because it’s the official format of the FDA. However Long overlooks the other key issue of opacity, which basically comes down to the two-related issues of being proprietary and being binary (as compared to text).
The more popular and the less opaque, the more likely it is that you’ll be able to read your data in the future. So looking into my crystal ball twenty years or so I think it’s fair to guess that Stata binary will not be readable with any ease and uncompressed tab-delimited ASCII will remain the lingua franca of data. I say tab-delimited instead of fixed-width because dictionary files get lost, tab-delimited instead of csv because embedded literal commas are common whereas embedded literal tabs are nonexistent, and I say uncompressed because compressed files are more vulnerable to corruption.
The problem with ASCII is that if (like Long) you find value labels and variable labels to be crucial then ASCII loses a lot of value. I think a good compromise is the Stata XML format. As you can see by opening it in a text editor, XML is human-readable text so even if no off-the-shelf import filters exist (which is unlikely as XML is increasingly the standard) you could with relatively little effort/cost write a filter yourself in a text-processing language like perl — or whatever the equivalent of perl will be in a generation.
Because it’s smaller and faster than XML, I still use Stata binary for day to day usage but I’m going to make a point of periodically making uncompressed XML archives, especially when I finish a project.