Archiving

July 1, 2009 at 5:01 am 2 comments

Continuing my discussion of Long’s Workflow

One of the things that Long is appropriately insistent on is good archiving for the long-term. First, he notes that the most serious issue is the physical storage medium and the need to migrate data whenever you get a new system given that even formats that were popular in recent memory like zip disks and 3.5 floppies are now almost impossible to find hardware for. I think in the future this should become easier as hard drives get so ginormous that it’s increasingly feasible to keep everything on one disk rather than pushing your archives to removable medium that can get lost. When it’s all on one (or a few) internal disks then you tend to migrate and backup, unlike removable media that get lost in your file cabinet until they are obsolete and/or physically corroded. Of course in those increasingly rare instances where IRB or proprietary data provision issues are not a concern the best way to handle this is to use ICPSR, CPANDA, or a similar public archive.

Even if you can access the files, the issue is can you read them. Long appropriately stresses the issue of keeping data in several formats but I think he’s a bit too agnostic about which formats are likely to last. As I see it there are basically two issues: popularity and opacity. Popularity is simply how many people use the format. For this reason Long endorses SAS Transport because it’s the official format of the FDA. However Long overlooks the other key issue of opacity, which basically comes down to the two-related issues of being proprietary and being binary (as compared to text).

The more popular and the less opaque, the more likely it is that you’ll be able to read your data in the future. So looking into my crystal ball twenty years or so I think it’s fair to guess that Stata binary will not be readable with any ease and uncompressed tab-delimited ASCII will remain the lingua franca of data. I say tab-delimited instead of fixed-width because dictionary files get lost, tab-delimited instead of csv because embedded literal commas are common whereas embedded literal tabs are nonexistent, and I say uncompressed because compressed files are more vulnerable to corruption.

The problem with ASCII is that if (like Long) you find value labels and variable labels to be crucial then ASCII loses a lot of value. I think a good compromise is the Stata XML format. As you can see by opening it in a text editor, XML is human-readable text so even if no off-the-shelf import filters exist (which is unlikely as XML is increasingly the standard) you could with relatively little effort/cost write a filter yourself in a text-processing language like perl — or whatever the equivalent of perl will be in a generation.

Because it’s smaller and faster than XML, I still use Stata binary for day to day usage but I’m going to make a point of periodically making uncompressed XML archives, especially when I finish a project.

Entry filed under: Uncategorized. Tags: , .

The Workflow of Data Analysis Using Stata Friending race

2 Comments

  • 1. Eszter Hargittai  |  July 5, 2009 at 11:02 am

    Some good points here, thanks.

    The screen-shot video files I generated for my dissertation are very hard to access at this point. It’s amazing how far we’ve come in terms of storage capacity even in just the last 7-8 years. I couldn’t afford to get lots of external hard drives for my data then and they are stuck on CDs that are nearly impossible to read (not because of format actually, but corruption that I can’t explain).

    I do still keep physical copies of the data separately from the ones backed up in the cloud, just as an extra precaution. I realize you weren’t really talking about backups here, rather, future-proofing, but I thought I’d mention it.

    As for this comment: “when I finish a project” – it made me smile. Have you ever finished a project? I guess there’s one I can think of that I really know I’ve finished, but others, I keep thinking one day I may still revisit…

  • 2. Mike3550  |  July 15, 2009 at 12:13 am

    I was thinking about this post for a while — and I was realizing that another possibility for archiving data in ASCII format is to store the data dictionary data in the same file with the rectangular data. Stata’s

    infile2

    has this ability for fixed-file formats where the data can appear on the line following the dictionary. Variable labels can be commented out in the same line as the text. Also, I think that Stata’s do files provide a very simple human-readable and easily-parsed format to recover variable names and labels.

    I agree that XML is the way to go for permanent archiving. But, in the meantime, I always prefer tab- or pipe-delimited formats (pipe-delimited because it has the advantage that you don’t have a large amount of extra space when you open the file in a text-editor) as an easy way to store/recover data.


The Culture Geeks


%d bloggers like this: