Field-tagged data

June 10, 2009 at 5:52 am 5 comments

| Gabriel |

Most of the datasets we deal with are rectangular in that the variables are always in the same order (whether they are free or fixed) and the records are delimited with a carriage return. A data format that’s less familiar to us but actually quite common in other applications is the field-tagged format. Examples are the BibTex citation database format. Likewise, some of the files in IMDB are a weird hybrid of rectangular and field-tagged. If data formats were human languages and sentences were data records, rectangular formats would be word order syntax (like English) and field-tagged formats would be case marker syntax (like Latin or German). (Yes, I have a bad habit of making overly complicated metaphors make sense only to me.)

In rectangular formats like field-delimited data (.csv or .tsv) or fixed-width data (.prn) you have one record per row and the same variables in the same order for each row, with the variables being either separated by a delimiter (usually comma or tab) or fixed-width with each variable being defined in the data dictionary as columns x-y (which was a really good idea back when we used punch cards, you know, to keep track of our dinosaur herds). In contrast with a field-tagged format, each record spans multiple rows and the first row contains the key that identifies the data. Subsequent rows usually begin with a tab, then a tag that identifies the name of the variable, followed by a delimiter and finally the actually content of the variable for that case. The beginning and end of the record are flagged with special characters. For example here’s a BibTex entry:

@book{vogel_entertainment_2007,
	address = {Cambridge},
	edition = {7th ed.},
	title = {Entertainment Industry Economics: A Guide for Financial Analysis},
	isbn = {9780521874854},
	publisher = {Cambridge University Press},
	author = {Harold Vogel},
	year = {2007}
},

The first thought is, why would anyone want to organize data this way? It certainly doesn’t make it easier to load into Stata (and even if it’s less difficult in R it’s still going to be harder than doing a csv). Basically the reasons people use field-tagged data are that it’s more human-readable / human-editable (a lot of people write BibTex files by hand, although personally I find it easier to let Zotero do it). Not only do you not have to remember what the fifth variable is, but you have more flexibility with things like “comment” fields which can be any length and have internal carriage returns. This is obviously a nice feature for a citation database as it means you can keep detailed notes directly in the file. Furthermore, they are good with situations where you have a lot of “missing data.” BibTex entries can potentially have dozens of variables but most works only require a few of them. For instance the Vogel citation only has eight fields and most of the other potential fields, things like translator, editor, journal title, series title, etc., are appropriately “missing” because they are simply not applicable to this book. It saves a lot of whitespace in the file just to omit these fields entirely rather than having them in but coded as missing (which is what you’d have to do to format BibTex as rectangular).

Nonetheless, if you want to get it into Stata, you need to shoehorn it into rectangular format. Perhaps this is all possible to handle with the “infile” command but last time I tried I couldn’t figure it out. (Comments are welcome if anyone actually knows how to do this). The very clumsy hack I use for these kind of data is to use a text editor to do a regular expression search that first deletes everything but the record key and the variable I want. I then do another search to convert carriage returns to tabs for lines beginning with the record key. I now have a rectangular dataset with the key and one variable. I can save this and get it into Stata. This is a totally insane example both because I can’t imagine why you’d want citation data in Stata and also because there are easier ways to do this (like export filters in citation software) but imagine that you wanted to get “year” and “author” out of a BibTex file and make it rectangular. You would want to run the following regexp patterns through a text editor (or write them into a perl script if you planned on doing it regularly):

^\t[^(year)|@].+\r

Sometimes this is all you need, but what if you want several variables. Basically, rinse, wash repeat until you have one file per variable then you can merge them in Stata. The reason you need a separate file for each variable is because otherwise it’s really easy to get your variables switched around. Because field-tagged formats are so forgiving about having variables in arbitrary orders or missing altogether, when you try to turn it into rectangular you’ll get a lot of values in the wrong column.

Entry filed under: Uncategorized. Tags: , , , .

we can draw the line some other time Fool’s Gold and Organizational Inertia

5 Comments

  • 1. Bobby Chen  |  June 10, 2009 at 8:03 pm

    I’ve been reading Franzosi’s papers and books in trying to figure out the best way to code for my dissertation data (which are discourse data). He keeps going on about the advantage to using a database structure to organizing discourse data. I am not sold (yet) on the analytical or even practical leverage of coding up language data using a database structure.

  • 2. gabrielrossman  |  June 10, 2009 at 9:56 pm

    field-tagged data is still in the same case/variable paradigm as rectangular data and so it’s pretty different from meta-text tagging of the sort used in qualitative analysis. i’ve never seriously looked into QDA but my understanding is that what atlas.ti and similar programs do is keep the literary text, archival documents, fieldnotes, or whatever, as a narrative but allow you to tag the meta-text. this then builds up what is effectively a turbo-charged index or concordance. all i can say is that i suggest you talk to people who do that kind of research and ask them whether the concordance and related features are worth the investment.
    btw, the last post i had on anything touching on narrative analysis was this piece on textarc

  • 3. Randall Stross  |  June 11, 2009 at 9:08 pm

    I always enjoyed reading Franzosi’s work. He covers the topic quite well.

    Gabriel, yes human readability is a big concern for a lot of these formats. It’s part of the main reasons for the existence of XML as well.

  • 4. Gross.pl « Code and Culture  |  March 31, 2010 at 3:40 pm

    […] few months ago I talked about reshaping field-tagged data and gave some clumsy advice for doing so. I’ve now written a perl script that does this more […]

  • 5. wos2tab.pl « Code and Culture  |  July 19, 2010 at 2:13 pm

    […] is also interested in node level attributes, not just the network. Unfortunately WOS queries are field-tagged which is kind of a pain to work with and the grad student horrified me but expressing the […]


The Culture Geeks


%d bloggers like this: