Posts tagged ‘cleaning’

importspss.ado (requires R)

| Gabriel |

Mike Gruszczynski has a post up pointing out that you can use R to translate files, for instance from SPSS to Stata. I like this a lot because it let’s you avoid using SPSS but I’d like it even better if it let you avoid using R as well.

As such I rewrote the script to work entirely from Stata. Mike wanted to do this in Bash but couldn’t figure out how to pass arguments from the shell to R. Frankly, I don’t know how to do this either which is why my solution is to have Stata write and execute an R source file so all the argument passing occurs within Stata. This follows my general philosophy of doing a lot of code mise en place in a user-friendly language so I can spend as little time as necessary in R. (Note that you could just as easily write this in Bash, but I figured this way you can a) make it cross-platform and b) attach it to “use” for a one-stop shop “import” command).

*by GHR 6/29/2010
*this script uses R to translate SPSS to Stata
*it takes as arguments the SPSS file and Stata file
*adapted from 

*DEPENDENCY: R and library(foreign) 
*if R exists but is not in PATH, change the reference to "R" in line 27 to be the specific location

capture program drop importspss
program define importspss
	set more off
	local spssfile `1'
	if "`2'"=="" {
		local statafile "`spssfile'.dta"
	else {
		local statafile `2'	
	local sourcefile=round(runiform()*1000)
	capture file close rsource
	file open rsource using `sourcefile'.R, write text replace
	file write rsource "library(foreign)" _n
	file write rsource `"data <- read.spss("`spssfile'","' _n
	file write rsource `"write.dta(data, file="`statafile'")"' _n
	file close rsource
	shell R --vanilla <`sourcefile'.R
	erase `sourcefile'.R
	use `statafile', clear

June 29, 2010 at 3:01 pm 6 comments


| Gabriel |

Over at the Orgtheory mothership, Fabio asked how to do a partial string match in Stata, specifically to see if certain keywords appear in scientific abstracts. This turns out to be hard, not because there are no appropriate functions in Stata (both strmatch() and regexm() can do it) but because Stata can only handle 244 characters in a string variable. Many of the kinds of data we’d want to do content analysis on are much bigger than this. For instance, scientific abstracts are about 2000 characters and news stories are about 10000 characters.

OW suggested using SPSS, and her advice is well-taken as she’s a master at ginormous content analysis projects. Andrew Perrin suggested using Perl and this is closer to my own sympathies. I agree that Perl is generally a good idea for content analysis, but in this case I think a simple grep will suffice.

grep "searchterm" filein.csv | cut -d "," -f 1 > fileout.csv

The way this works is you start with a csv file called filein.csv (or whatever) where the record id key is in the first column. You do a grep search for “searchterm” in that file and pipe the output to the “cut” command. The -d “,” option tells cut that the stream is comma delimited and the -f 1 option tells it to only keep the first field (which is your unique record id). The “> fileout.csv” part says to pipe the output to disk. (Note that in Unix “>” as a file operator means replace and “>>” means append). You then have a text file called fileout.csv that’s just a list of records where your search term appears. You can merge this into Stata and treat a _merge==3 as meaning that the case includes the search term.

You can also wrap the whole thing in a Stata command that takes as arguments (in order): the term to search for, the file to look for it in, the name of the key variable in the master data, and (optionally) the name of the new variable that indicates a match. However for some reason the Stata-wrapped version only works with literal strings and not regexp searches. Also note that all this is for Mac/Linux. You might be able to get it to work on Windows with CygWin or Powershell.

capture program drop grepmerge
program define grepmerge
	local searchterm	"`1'"
	local fileread	"`2'"
	local key "`3'"
	if "`4'"=="" {
		local newvar "`1'"
	else {
		local newvar "`4'"
	tempfile filewrite
	shell grep "`searchterm'" `fileread' | cut -d "," -f 1 > `filewrite'
	tempvar sortorder
	gen `sortorder'=[_n]
	tempfile masterdata
	save `masterdata'
	insheet using `filewrite', clear
	ren v1 `key'
	merge 1:1 `key' using `masterdata', gen(`newvar')
	sort `sortorder'
	recode `newvar' 1=.a 2=0 3=1
	notes `newvar' : "`searchterm'" appears in this case
	lab val `newvar'

April 29, 2010 at 12:45 pm 8 comments

Using grep (or mdfind) to reshape data

| Gabriel |

Sometimes you have cross-class data that’s arranged the opposite of how you want. For instance, suppose I have a bunch of files organized by song, and I’m interested in finding all the song files that mention a particlar radio station, say KIIS- FM. I can run the following command that finds all the song files in my song directory (or its subdirectories) and puts the names of these files in a text file called “kiis.txt”

grep -l -r ’KIIS’ ~/Documents/book/stata/rawsongs/ > kiis.txt

Of course to run it from within Stata I can prefix it with “shell”. By extension, I could then write a program around this shell command that will let me query station data from my song files (or vice versa). You could do something similar to see what news stories saved from Lexis-Nexis or scraped web pages contain a certain keyword.

Unfortunately grep is pretty slow, but you can do it faster by accessing your desktop search index. It’s basically the difference between reading a book looking for a reference versus looking the reference up in the book’s index. This is especially important if you’re searching over a lot of data — grep is fine for a few dozen files but you want indexed search if you’re looking over thousands of files, let alone your whole file system. On a Mac, you can access your Spotlight index from shell scripts (or the Terminal) with “mdfind“. The syntax is a little different than grep so the example above should be rewritten as

mdfind -onlyin ~/Documents/book/stata/rawsongs/ "KIIS" > kiis.txt

While grep is slower than mdfind, it’s also more flexible. Fortunately (as described here), you can get the best of both worlds by doing a broad search with mdfind then piping the results to grep for more refined work.

April 7, 2010 at 5:13 am 1 comment and server logs

| Gabriel |

I recently started scraping a website using curl and cron (for earlier thoughts on this see here). Because I don’t leave my mac turned on at 2am, I’m hosting the scrape on one of the UCLA servers. I get a daily log of the scrape by email, but I know myself well enough to know that I’ll get bored with reading the logs after a few days.

As such, I added a “rule” to that looks for error messages. When curl fails for any reason, the standard error message is “Warning: Failed to create the file …” Using the “rules” tab of the preferences, I told Mail to turn any message red if it has my server’s log boilerplate in the subject line and contains a curl error message anywhere in the text. Now when I open my email in the morning and see a black message I know everything is fine whereas a red message (or no message at all) means there’s a problem.

April 2, 2010 at 5:10 am

| Gabriel |

A few months ago I talked about reshaping field-tagged data and gave some clumsy advice for doing so. I’ve now written a perl script that does this more elegantly. It’s written to extract movie title (“MV”) and domestic box office (“GR”) from the IMDB file business.list, but you could adapt it to get other variables and/or work on other field-tagged data.
Basically, the script will turn this:

MV: Little Shop of Horrors (1986)

AD: 118,418 (Sweden) 

BT: USD 30,000,000 

GR: USD 34,656,704 (USA) (8 February 1987) 
GR: USD 33,126,503 (USA) (1 February 1987) 
GR: USD 30,810,276 (USA) (25 January 1987) 
GR: USD 27,781,027 (USA) (18 January 1987) 
GR: USD 23,727,232 (USA) (11 January 1987) 
GR: USD 19,546,049 (USA) (4 January 1987) 
GR: USD 11,412,248 (USA) (28 December 1986) 
GR: USD 3,659,884 (USA) (21 December 1986) 
GR: USD 38,747,385 (USA) 
GR: SEK 4,318,255 (Sweden) 

OW: USD 3,659,884 (USA) (21 December 1986) (866 screens) 

RT: USD 19,300,000 (USA) 

SD: 21 October 1985 - ? 

WG: USD 1,112,016 (USA) (8 February 1987) (871 screens) 
WG: USD 1,719,329 (USA) (1 February 1987) 
WG: USD 2,093,847 (USA) (25 January 1987) 
WG: USD 3,222,066 (USA) (18 January 1987) 
WG: USD 3,057,666 (USA) (11 January 1987) (858 screens) 
WG: USD 4,004,838 (USA) (4 January 1987) (866 screens) 
WG: USD 5,042,682 (USA) (28 December 1986) (866 screens) 
WG: USD 3,659,884 (USA) (21 December 1986) (866 screens) 


Into this:

Little Shop of Horrors (1986)	34,656,704 (USA) (8 February 1987) 
Little Shop of Horrors (1986)	33,126,503 (USA) (1 February 1987) 
Little Shop of Horrors (1986)	30,810,276 (USA) (25 January 1987) 
Little Shop of Horrors (1986)	27,781,027 (USA) (18 January 1987) 
Little Shop of Horrors (1986)	23,727,232 (USA) (11 January 1987) 
Little Shop of Horrors (1986)	19,546,049 (USA) (4 January 1987) 
Little Shop of Horrors (1986)	11,412,248 (USA) (28 December 1986) 
Little Shop of Horrors (1986)	3,659,884 (USA) (21 December 1986) 
Little Shop of Horrors (1986)	38,747,385 (USA) 

Here’s the code:

#!/usr/bin/perl by ghr
#this script cleans the IMDB file business.list
#raw data is field-tagged, key tags are "MV" (movie title) and "GR" (gross)
#record can have multiple "gross" fields, only interested in those with "(USA)"
#MV: Astronaut's Wife, The (1999)
#GR: USD 10,654,581 (USA) (7 November 1999) 
#find "MV" tag, keep in memory, go to "GR" tag and write out as "GR\tMV"

use warnings; use strict;
die "usage: <IMDB business file>\n" unless @ARGV==1;
my $rawdata = shift(@ARGV);

# if line=MV, redefine the "title" variable
# if line=GR, write out with "title" in front
#optional, screen out non "USA" gross, parse GR into 
#"currency, quantity, country, date"
my $title ;
my $gross ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">gross.txt") or die "error creating gross.txt\n";
print OUT "title\tgross\n";
while (<IN>) {
	#match "MV" lines by looking for lines beginning "MV: "
	if($_=~ /^MV: /) {
		$title = $_; 
		$title =~ s/\015?\012//; #manual chomp
		$title =~ s/^MV: //; #drop leading tag
		print "$title ";
	#match "GR" lines, write out with clid
	if ($_ =~ m/^GR: USD .+\(USA\)/) {
		$gross = $_; 
		$gross =~ s/\015?\012//; #manual chomp
		$gross =~ s/^GR: USD //; #drop leading tag
		print OUT "$title\t$gross\n";
close IN;
close OUT;
print "\ndone\n";

March 31, 2010 at 3:40 pm 4 comments

Affiliation network 2 edge list

| Gabriel |

The grad student for whom I wrote dyadkey came back after realizing he had a more complicated problem than we thought. Long story short, he had a dataset where each record was an affiliation with one of the variables being a space-delimited list of all the members of the affiliation. That is his dataset was “wide” and basically looks like this:

j	i
a	"1 2 3 4"
b	"2 5 6"

Note that in most citation databases the data is field-tagged but the author tag itself is “wide.” Hence this code would be useful for making a co-authorship network but you’d first have to rearrange it so that the publication key is “j” and the author list is “i”.

In contrast, “long” affiliation data like IMDB is structured like this:

j	i
a	1
a	2
a	3
a	4
b	2
b	5
b	6

My student wanted to project the affiliation network into an edge list at the “i” level. As before he only wanted each edge in once (so if we have “1 & 2” we don’t also want “2 & 1”). To accomplish this, I wrote him a program that takes as arguments the name of the “affiliation” variable and the name of the “members list” variable. To do this it first reshapes to a long file (mostly lines 11-18), then uses joinby against itself to create all permutations (mostly lines 22-24), and finally drops redundant cases by only keeping dyads where ego was listed before alter in the original list of affiliation members (mostly lines 25-32). With minor modifications, the script would also work with affiliation data that starts out as long, like IMDB. Also note that it should work well in combination with stata2pajek classic (“ssc install stata2pajek”) or the version that lets you save vertice traits like color.

capture program drop affiliation2edges
program define affiliation2edges
	local affiliation `1'
	local memberslist `2'
	tempfile dataset
	tempvar sortorder
	quietly gen `sortorder'=[_n]
	sort `affiliation'
	quietly save "`dataset'"
	keep `affiliation' `memberslist'
	gen wc=wordcount(`memberslist')
	quietly sum wc
	local maxsize=`r(max)'
	forvalues word=1/`maxsize' {
		quietly gen member`word'=word(`memberslist',`word')
	reshape long member, i(`affiliation') j(membernumber)
	quietly drop if member=="" /*drop the empty cell, n_dropped ~ sd(wc) */
	sort `affiliation'
	tempfile datasetlong
	quietly save "`datasetlong'"
	ren member member_a
	joinby `affiliation' using "`datasetlong'"
	ren member member_b
	ren membernumber membernumber_a
	quietly drop if member_a==member_b /*drop loops*/
	quietly gen membernumber_b=.
	forvalues w=1/`maxsize' {
		quietly replace membernumber_b=`w' if member_b==word(`memberslist',`w')
	quietly drop if membernumber_a < membernumber_b /*keep only one version of a combinatio; ie, treat as "edges"*/
	drop membernumber_a membernumber_b `memberslist'
	lab var wc "n members in affiliation"
	sort `affiliation'
	merge m:1 `affiliation' using "`dataset'"
	disp "'from using' only are those with only one member in affiliation's memberlist"
	quietly sort `sortorder'
	quietly drop `sortorder'

March 17, 2010 at 5:30 am 6 comments

Memetracker into Stata

| Gabriel |

A few months ago I mentioned the Memetracker project to scrape the internet and look for the diffusion of (various variants of) catchphrases. I wanted to play with the dataset but there were a few tricks. First, the dataset is really, really, big. The summary file is 862 megabytes when stored as text and would no doubt be bigger in Stata (because of how Stata allocates memory to string variables). Second, the data is in a moderately complicated hierarchical format, with “C” specific occurrences, nested within “B” phrase variants, which are in turn nested within “A” phrase families. You can immediately identify whether a row is A, B, or C by the numer of leading tabs (0, 1, and 2, respectively).

I figured that the best way to interpret this data in Stata would be two create two flat-files, one a record of all the “A” records that I call “key”, and the other a simplified version of all the “C” records but with the key variable to allow merging with the “A” records. Rather than do this all in Stata, I figured it would be good to pre-process it in perl, which reads text one line at a time and thus is well-suited for handling very large files. The easy part was to make a first pass through the file with grep to create the “key” file by copying all the “A” rows (i.e., those with no leading tabs).

Slightly harder was to cull the “C” rows. If I just wanted the “C” rows this would be easy, but I wanted to associate them with the cluster key variable from the “A” rows. This required looking for “A” rows, copying the key, and keeping it in memory until the next “A” row. Meanwhile, every time I hit a “C” row, I copy it but add in the key variable from the most recent “A” row. Both for debugging and because I get nervous when a program doesn’t give any output for several minutes, I have it print to screen every new “A” key. Finally, to keep the file size down, I set a floor to eliminate reasonably rare phrase clusters (anything with less than 500 occurrences total).

At that point I had two text files, “key” which associates the phrase cluster serial number with the actual phrase string and “data” which records occurrences of the phrases. The reason I didn’t merge them is that it would massively bloat the file size and it’s not necessary for analytic purposes. Anyway, at this point I could easily get both the key and data files into Stata and do whatever I want with them. As a first pass, I graphed the time-series for each catchphrase, with and without special attention drawn to mentions occurring in the top 10 news websites.

Here’s a sample graph.

Here’s the perl file:

#!/usr/bin/perl by ghr
#this script cleans the "phrase cluster" data
#script takes the (local and unzipped) location of this file as an argument
#throws out much of the data, saves as two tab flatfiles
#"key.txt" which associates cluster IDs with phrases
#"data.txt" which contains individual observations of the phrases
# input
# A:  <ClSz>  <TotFq>  <Root>  <ClId>
# B:          <QtFq>   <Urls>  <QtStr>  <QtId>
# C:                   <Tm>    <Fq>     <UrlTy>  <Url>
# output, key file
# A:  <ClSz>  <TotFq>  <Root>  <ClId>
# output, data file
# C:<ClID>	<Tm>	<UrlTy>	<URL>
# make two passes.

use warnings; use strict;
die "usage: <phrase cluster data>\n" unless @ARGV==1;

#define minimum number of occurences a phrase must have
my $minfreq = 500;

my $rawdata = shift(@ARGV);
# use bash grep to write out the "key file"
system("grep '^[0-9]' $rawdata > key.txt");

# read again, and write out the "data file"
# if line=A, redefine the "clid" variable
# optional, if second field of "A" is too small, (eg, below 100), break the loop?
# if line=B, skip
# if line=C, write out with "clid" in front
my $clid  ;
open(IN, "<$rawdata") or die "error opening $rawdata for reading\n";
open(OUT, ">data.txt") or die "error creating data.txt\n";
print OUT "clid\ttm\turlty\turl\n";
while (<IN>) {
	#match "A" lines by looking for numbers in field 0
	if($_=~ /^\d/) {
		my @fields = split("\t", $_); #parse as tab-delimited text
		if($fields[1] < $minfreq) { last;} #quit when you get to a rare phrase
		$clid = $fields[3]; #record the ClID
		$clid =~ s/\015?\012//; #manual chomp
		print "$clid ";
	#match "C" lines, write out with clid
	if ($_ =~ m/^\t\t/) {
		my @fields = split("\t", $_);
		print OUT "$clid\t$fields[2]\t$fields[4]\t$fields[5]\n";
close IN;
close OUT;
print "\ndone\n";

And here’s the Stata file:

set mem 500m
set more off
cd ~/Documents/Sjt/memetracker/
*import key, or "A" records
insheet using key.txt, clear
ren v1 clsz
ren v2 totfq
ren v3 root
ren v4 clid
sort clid
lab var clsz "cluster size, n phrases"
lab var totfq "total frequency"
lab var root "phrase"
lab var clid "cluster id"
save key, replace
*import data, or "C" records
insheet using data.txt, clear
drop if clid==.
gen double timestamp=clock(tm,"YMDhms")
format timestamp %tc
drop tm
gen hostname=regexs(1) if regexm(url, "http://([^/]+)") /*get the website, leaving out the filepath*/
drop url
gen blog=0
replace blog=1 if urlty=="B"
replace blog=1 if hostname==""
gen technoratitop10=0 /*note, as of 2/3/2010, some mismatch with late 2008 memetracker data*/
foreach site in {
	replace technoratitop10=1 if hostname=="`site'"
gen alexanews10=0 /*as w technorati, anachronistic*/
foreach site in {
	replace alexanews10=1 if hostname=="`site'"
drop urlty
sort clid timestamp
contract _all /*eliminate redundant "C" records (from different "B" branches)*/
drop _freq
save data, replace
*draw a graph of each meme's occurrences
levelsof clid, local(clidvalues)
foreach clid in `clidvalues' {
	disp "`clid'"
	quietly use key, clear
	quietly keep if clid==`clid'
	local title=root in 1
	quietly use data, clear
	histogram timestamp if clid==`clid', frequency xlabel(#5, labsize(small) angle(forty_five)) title(`title', size(medsmall))
	graph export graphs/`clid'.png, replace
	twoway (histogram timestamp if clid==`clid') (line alexanews10 timestamp if clid==`clid', yaxis(2)), legend(off) xlabel(#5, labsize(small) angle(forty_five)) title(`title', size(medsmall))
	graph export graphs_alexa/`clid'.png, replace
*have a nice day

February 8, 2010 at 4:31 am 7 comments

Stata2Pajek w vertice colors

| Gabriel |

[updated 6/29/2010, added option for “missingcolor” instead of always defaulting to yellow. also introduced commented quotes so as to make TextMate’s syntax parser happy]
My Stata program stata2pajek let’s you export an edge list or arc list from Stata to Pajek. However I recently wanted to make color-coded vertices and this required tweaking the syntax a bit. Because this is a relational database problem (and Stata likes flat-file), I do it by merging in a vertice level file on disk. Although I’ve written it to merge on colors it should be fairly easy to rewrite to deal with other vertice-level variables. Unfortunately the syntax is pretty awkward, which is why I’m forking it and posting it to the blog instead of updating the main stata2pajek file at SSC.

Note that the script assumes that nodes which appear in the arc list but not the vertice attributes file should be color-coded yellow. If you want yellow to have a substantive meaning you should change line 61 to be something else.

The syntax is the same as stata2pajek except there are a few more options. If you omit these options it behaves just like stata2pajek classic:

  • attributefile() — the file where the vertice level variables are stored
  • attributekey() — the merge key, defaults to “ego”
  • color() — the variable storing the color-codes

For instance, to color-code a graph of radio stations based on when they first played “My Humps” (yes, really), I use this command:

stata2pajekalt ego alter, filename(ties_bounded_humpcolor) attributefile(humps_color) attributekey(ego) color(color)

It says to treat the data in memory as an arc list, to merge on station-level data on adoption time where the station name is stored as the variable “ego” in the file “humps_color”, and to write out the whole mess as a Pajek formatted text file called “”.

*1.4 GHR 6/29/2010
*forked from stata2pajek 1.2 (available at ssc)
capture program drop stata2pajekalt
program define stata2pajekalt
	version 10
	set more off
	syntax varlist(min=2 max=2) [ , tiestrength(string asis) filename(string asis) EDGEs attributefile(string asis) attributekey(string asis) color(string asis) missingcolor(string asis)]

	tempfile dataset
	quietly save `dataset'
	gettoken ego alter : varlist
	local starchar="*"
	if "`filename'"=="" {
		local pajeknetfile "mypajekfile"
	else {
		local pajeknetfile "`filename'"
	if "`attributekey'"=="" {
		local attributekey "`ego'"
	if "`missingcolor'"=="" {
		local missingcolor "Yellow"
	capture file close pajeknetfile
	file open pajeknetfile using `pajeknetfile'.net, write text replace
	use `dataset', clear
	drop `ego'
	ren `alter' `ego'
	append using `dataset'
	keep `ego'
	contract `ego'
	ren `ego' verticelabel
	drop _freq
	sort verticelabel
	gen number=[_n]
	order number verticelabel
	if "`attributefile'"!="" {
		ren verticelabel `attributekey'
		merge `attributekey' using `attributefile'
		drop if _merge==2
		ren `attributekey' verticelabel
		drop _merge
		keep number verticelabel `color'
		sort verticelabel
	tempfile verticelabels
	quietly save `verticelabels', replace
	local nvertices=[_N]
	file write pajeknetfile "`starchar'Vertices `nvertices'" _n
	if "`attributefile'"=="" {
		forvalues x=1/`nvertices' {
			local c2=verticelabel in `x'
			file write pajeknetfile `"`x' "`c2'""' _n
	else {
		forvalues x=1/`nvertices' {
			local c2=verticelabel in `x'
			local colvalue=`color' in `x'
			if "`colvalue'"=="" {
				local colvalue "`missingcolor'"
			file write pajeknetfile `"`x' "`c2'" ic `colvalue'"' _n
	use `dataset', clear
	ren `ego' verticelabel
	sort verticelabel
	quietly merge verticelabel using `verticelabels'
	quietly keep if _merge==3
	drop _merge verticelabel
	ren number `ego'
	ren `alter' verticelabel
	sort verticelabel
	quietly merge verticelabel using `verticelabels'
	quietly keep if _merge==3
	drop _merge verticelabel
	ren number `alter'
	order `ego' `alter' `tiestrength'
	keep  `ego' `alter' `tiestrength'
	local narcs=[_N]
	if "`edges'"=="edges" {
		file write pajeknetfile `"`starchar'Edges"' _n
	else {
		file write pajeknetfile `"`starchar'Arcs"' _n
	sort `ego' `alter'      
	if "`tiestrength'"~="" {
		forvalues x=1/`narcs' {
		        local c1=`ego' in `x'
		        local c2=`alter' in `x'
		        local c3=`tiestrength' in `x'                   
		        file write pajeknetfile "`c1' `c2' `c3'" _n
	else {
		forvalues x=1/`narcs' {
		        local c1=`ego' in `x'
		        local c2=`alter' in `x'
		        file write pajeknetfile "`c1' `c2'" _n
	file close pajeknetfile
	*ensure that it's windows (CRLF) text format
	if "$S_OS"~="Windows" {
		filefilter `pajeknetfile'.net tmp, from(\M) to(\W) replace
		shell mv tmp `pajeknetfile'.net
		filefilter `pajeknetfile'.net tmp, from(\U) to(\W) replace
		shell mv tmp `pajeknetfile'.net
	use `dataset', clear
	disp "Your output is saved as"
	disp "`c(pwd)'`c(dirsep)'`pajeknetfile'.net"

December 10, 2009 at 7:29 pm 9 comments

Drop the entirely missing variables

| Gabriel |

One of the purely technical frustrations with GSS is that it’s hard to figure out if a particular question was in a particular wave. The SDA server is pretty good about kicking these from extracts, but it still leaves some in. The other day I was playing with the 2008 wave and got sick of saying “oooh, that looks interesting” only to find the variable was missing for my wave, which happened repeatedly. (For instance, I thought it would be fun to run Biblical literalism against the Israel feeling thermometer, but no dice, at least for the 2008 wave).

To get rid of these phantom variables, I wrote this little loop that drops variables with entirely missing data (or that are coded as strings, see below):

foreach varname of varlist * {
 quietly sum `varname'
 if `r(N)'==0 {
  drop `varname'
  disp "dropped `varname' for too much missing data"

Unfortunately “sum” thinks that string variables have no observations so this will also drop strings. There’s a workaround, but it involves the “ds” command, which works in Stata 11 but has been deprecated and so may not work in future versions.

ds, not(type string)
foreach varname of varlist `r(varlist)' {
 quietly sum `varname'
 if `r(N)'==0 {
  drop `varname'
  disp "dropped `varname' for too much missing data"

December 2, 2009 at 5:00 am 9 comments

Perl text library

| Gabriel |

I found this very useful library of perl scripts for text cleaning. You can use them even if you can’t code perl yourself, for instance to transpose a dataset just download “” script to your ~/scripts directory and enter the shell command:
perl ~/scripts/ row_col.txt > col_row.txt

The transpose script is particularly useful to me as I’ve never gotten Excel’s transpose function to work and for some bizarre reason Stata’s “xpose” command only works with numeric variables. You can even use these scripts from directly in a do-file like so:

tempfile foo1
tempfile foo2
outsheet using `foo1'.txt
shell perl ~/scripts/ `foo1'.txt > `foo2'.txt
insheet using `foo2'.txt, clear

November 30, 2009 at 4:49 am 1 comment

Older Posts Newer Posts

The Culture Geeks